Self-Organized Integration of Adaptive Visual Cues

0 downloads 0 Views 174KB Size Report
However, there usually is no ... to one indicating high confidence of the cue that the face is present at .... the room, shake hands, and then continue on their.
Self-Organized Integration of Adaptive Visual Cues for Face Tracking Jochen Triesch Department of Computer Science University of Rochester Rochester (NY), USA [email protected]

Christoph von der Malsburg Institut f¨ur Neuroinformatik Ruhr-Universit¨at Bochum Bochum, Germany [email protected]

Abstract A mechanism for the self-organized integration of different adaptive cues is proposed. In Democratic Integration the cues agree on a result and each cue adapts towards the result agreed upon. A technical formulation of this scheme is employed in a face tracking system. The self-organized adaptivity leads to suppression and recalibration of discordant cues. Experiments show that the system is robust to sudden changes in the environment as long as the changes disrupt only a minority of cues at the same time, although all cues may be affected in the long run.

1. Introduction The integration of information stemming from different cues or modalities is among the most fundamental problems of perception in biological and artificial systems. Due to frequent changes in complex environments, the integration of cues has to be adaptive. However, there usually is no teacher available to guide the adaptation. The agent has to figure out on his own which cues are reliable for the given task in the current context — self-organization is required. We propose the following integration scheme for such situations [9]: the cues agree on a result and the result agreed upon serves as the basis for individual adaptations. We propose calling this scheme Democratic Integration. A system with this property will try to reach maximal coherence between its different cues. There are two fundamental underlying assumptions that must hold for this scheme to produce useful behavior. First, concordances between the different cues must prevail in the environment. Second, the environment must exhibit continuous properties with respect to the majority of cues, i.e., environmental changes only affect a minority of cues at any one time. In order to test this idea, we have applied it to the tracking of faces in a complex, changing environment. In the following, we describe our method, present experiments, and discuss the results.

Head Position

Intensity Change

Motion Continuity

Color

Shape

Contrast Range

Figure 1. System ar hite ture. 2. Method Our architecture for face tracking integrates five different visual cues (Fig.1). All cues operate on the basis of two-dimensional saliency maps Ai ( ; t), where i indexes the cues. We assume 0  Ai ( ; t)  1 with values close to one indicating high confidence of the cue that the face is present at position . To this end, each cue may compare a prototype Pi , which describes the appearance of the face with respect to this cue, to all positions in the current image. The Pi shall be fixed for the moment.

x

x

x

Ai (x; t) = Si (Pi ; I (x)) ; 0  Si (Pi ; I (x))  1

(1)

The functions Si measure the similarity of an image region I ( ) centered at point to the prototype Pi of the cue. For the integration of the Pdifferent cues we introduce their reliabilities ri (t), with i ri (t) = 1. The saliency maps of the cues are combined to a total result R( ; t) by computing a weighted sum, with the reliabilities acting as weights.

x

x

x

R(x; t) =

X

i

The estimated target position highest result.

ri (t)Ai (x; t)

(2)

x^(t) is the point yielding the

x^(t) = arg max fR(x; t)g x

(3)

Now a quality q~i (t) is defined for each cue, which measures how successful the cue was in predicting the total result:

q~i (t) = R (Ai (^x(t))

hAi (x; t)i) ;

(4)

x

where h: : :i denotes an average over all positions , and

R(x) =



0 : x0 x : x>0

2.1

(5)

is the ramp function. In words, the response of the cue at the estimated position of the face is compared to its response averaged over the whole image. If the response at the estimated position is above average, then the quality is given by the distance to that average, otherwise the quality is zero. Note that this choice is ad hoc and other definitions may seem more natural. In particular, current research is considering the saliency maps as probability distributions, computing the correlation between the saliency maps and the total result, or using the Kullback-Leibler distance. Our current choice was mainly motivated by the low computational costs of our scheme compared to others. Now, normalized qualities qi (t) are defined by:

q~ (t) : qi (t) = P i j q~j (t)

(6)

The system is made adaptive by defining dynamics for the reliabilities, which couple them to the normalized qualities:

 r_i (t) = qi (t)

ri (t) :

(7)

A cue with a quality measurement higher than its current reliability will tend to increase its reliability and a cue with a quality lower than its current reliability will have its reliability lowered. In particular, a cue whose quality remains zero, will end up with a reliability of zero; i.e., it is completely suppressed. The time constant  is considered to be sufficiently large, so that the dynamics of the ri (t) react slowly to changes in the qi (t), which are expected to be spoiled by high-frequency noise. Due to the normalization of the qi (t), the ri (t) also tend to be normalized. The sum of the reliabilities converges to one, as can be easily seen by summing (7) over i. If the reliabilities are initialized such that their sum is one, the sum will remain one throughout. Thus, the reliabilities can indeed be regarded as weights. The ri (t) are effectively a running average of the qualities of the cues; they express their reliabilities in the current context. The q~i (t) and hence the qi (t) depend on the shapes of the results Ai ( ; t), which are defined by the particular problem at hand and on the position ^ of the maximum response in R( ; t). As R( ; t) is a superposition of the results Ai ( ; t) with the weights ri (t), the qualities q~i (t) are indirectly influenced by the reliabilities ri (t). This indirect or hidden influence is a function of the particular problem and usually unknown: q~i = q~i (r1 (t); : : : ; rN (t); t) : (8)

x

x

x

x

x

The analysis of a number of canonical cases has been described elsewhere [9].

Adaptation of Prototypes

The reliability dynamics introduced above allow the system to exhibit suppression of a discordant cue. Now we introduce additional dynamics for the prototypes Pi of the cues, which in the previous discussion were assumed to be fixed. The cues shall change their prototypes in such a way that their outputs match the total result better. This leads to recalibration of discordant cues. Consider a function fi (I ( ; t)) extracting a feature vector Pi ( ; t) of dimension 0

(12)

The thresholded difference image is convolved with a 6  6pixel binomial filter in order to smooth the result and detect larger blobs of motion. The parameter is not adaptive.

original image

L is a linear rescaling ensuring that Sshape projects into the interval [0; 1℄. C is the correlation operator:

shape pattern

C (P; I (x)) = intensity change

color

motion continuity

P

x+x )

x0 I (



I(x) P (x0 ) I (x)P 0

contrast range

result

Motion Continuity. This cue tries to exploit the continuity of persons’ motions using a linear predictor to forecast the current position:

x

x^(t

2)) :

(13)

Its output Aposition ( ; t) is given by:

Aposition (x; t) = exp

(x

X^ )2

22

x

Contrast Range. The contrast cue compares the contrast of a local image patch centered at position to a contrast prototype P ontrast (t), where contrast is defined as the standard deviation of the grey level values of the pixel distribution of a 6  6-pixel image region.

A ontrast (x; t) = jP ontrast (t)

3.1



(14)

Shape Cue. The shape cue computes the correlation of a 6  6-pixel grey level template of the target to the image. High correlations indicate a high likelihood of the target being at this particular position.



(15) (16)

Image Sequen es

84 image sequences of people crossing a room were recorded in RGB color format. The original resolution of 768  568 pixels was downsampled with a rectangular filter of 8  8 pixels yielding images 96  71 pixels in size. The frame rate was 8 Hz and sequences consisted of 100 frames resulting in 12.5 s recording time. The sequences are evenly distributed among the following six classes:



Ashape (x; t) = Sshape (Pshape (t); I (x; t)) = L (C (Pshape(t); I (x; t)))

(18)

3 Experiments

!

;

I (x; t)j

The prototypical contrast of the target is also adapted according to (9).

= 5. The position prediction is not adaptive.

with 

(17)

x

Color Analysis. The color cue is computed by comparing the color of each pixel to a region of skin color tones in HSI (hue, saturation, intensity) color space [4]. If a pixel falls within a region defined by intervals of “allowed” values for the H , S , and I components, the result is one, otherwise zero. The result is also filtered with a 6  6-pixel binomial filter. The prototype color region is adapted towards a color average taken from a 3  3-pixel region around the estimated position of the target according to (9). If the standard deviation of the colors in the 3  3-pixel region exceeds a certain threshold, indicating that there is no homogeneous color around the estimated target position, then the prototype is not adapted.

1) + (^x(t 1)

:

66-pixel image region around and the average grey level of the 6  6-pixel shape pattern P , respectively. I ( ) and P are the corresponding standard deviations. The sum runs over a 6  6-pixel region. The shape template Pshape (t) is adapted to the grey level distribution at the estimated target position according to (9).

Figure 2. Overview of ues.

X^ (t) = x^(t



I(x) and P are the averages of the grey level values of a

x

shape

P

Normal: A person crosses the room starting from either side with little to normal speed. The person moves in front of different backgrounds; in some places the background is predominantly light and in others it is predominantly dark. Lighting: Same as Normal but when the person reaches the middle of the room the lighting is abruptly changed to green. For this purpose a transparent green piece of plastic was held in front of a 500 W lamp illuminating the scene. Turning: The person goes up and down in the room turning 1–3 times. Although people were asked to turn both facing the camera and facing the wall, the first was observed more often.



  3.2

Occlusion: Same as Normal but shortly after the first person enters the room, a second person enters from the opposite side. The people meet in the middle of the room, shake hands, and then continue on their course. The tracked person is occluded by the second person while passing. Light&Turn (Lighting&Turning): Same as Turning but in the middle of the sequence the illumination is abruptly changed as in the Lighting condition.

Dete tion and Tra king

x

x

Results

The performance of the system was carefully analyzed for the 84 sequences. Two different results for a sequence are distinguished:



good tracking: the person’s face is correctly tracked during the whole sequence. The face remains unde-

total 14 14 14 14 14 14 84

adaptation 12 (86%) 13 (93%) 10 (71%) 4 (29%) 8 (57%) 2 (14%) 49 (58%)

no adapt. 10 (71%) 0 ( 0%) 6 (43%) 0 ( 0%) 0 ( 0%) 0 ( 0%) 16 (19%)

Table 1. Tra king Results.

Light&Occl. (Lighting&Occlusion): Same as Occlusion but at the moment of occlusion the illumination is changed as in the Lighting condition.

The formulas given above were discretized with the E U LER method (t = 1=8s), giving a set of difference equations suited for computer implementation. The system is initialized with the reliabilities of the color and intensity change cue set to 0:5 and all other reliabilities set to zero. This choice reflects the limited a priori knowledge about the target. The person’s face is assumed to be a moving, skincolored object, but the place of its appearance, its shape and contrast are unknown. Correspondingly, the prototypes of the shape and contrast cues are set to canonical values. A detection threshold T was defined for deciding whether a person is in the scene or not. If the maximum total result R(^ ) is greater than T = 0:7 then a person is assumed to be there. In the beginning, the system is looking for a moving skin-colored object of sufficient size. In the case of the detection of a person the reliabilities ri (t) and the prototypes Pi (t) are adapted to the current qualities and current image as described above. The time constant for the reliability adaptation  and the time constants for the adaptive cues were all set to 0.5 s. If the result R(^ ) is smaller than the detection threshold, the reliabilities and prototypes are adapted back towards the default values with a slower time constant of 5 s. This ensures that after a person has left the scene, the system will return to its unbiased default expectations after a while. This means that the system can rather quickly learn about and adapt to new situations (a new person, a new lighting situation) but forgets about them somewhat more slowly. 3.3

class Normal Lighting Turning Occlusion Light&Turn Light&Occl. Total

tected or the position estimate is poorly placed in less than 5 frames.



poor tracking: continuous mistrackings occur. The face is undetected, other body parts of the person are tracked, the other person is tracked, or parts of the background are tracked for more than 5 frames, or the person’s exit from the scene is not detected.

The results according to this categorization are summarized in Tab. 1. For each sequence class, the number of sequences with good tracking are compared for an adaptive and a non-adaptive version of the system. For the nonadaptive system, the reliabilities and prototypes were not updated; the system had to rely on motion detection and color cue alone. Not surprisingly, the adaptive version outperforms the non-adaptive one for all classes. Normal sequences and sequences with turning persons and/or lighting changes produce predominantly good results for the adaptive system. Sequences containing occlusion of the tracked person by a different person produced bad results. These results already indicate that the system can cope fairly well with sudden changes in the illumination or with the person turning around. On the other hand occlusion by another person is very harmful. However, it turned out that tracking errors could have unexpected causes. For example, in the only “poor” sequence for the Lighting condition the error was produced by a background change rather than the change of illumination. Therefore, a more careful analysis of the results was in order. For all erroneous sequences an analysis was made of which changes in the sequence were involved in the error. Four changes were identified as contributing to errors. These were background changes, turns, lighting changes, and occlusions. Each change affects a particular set of cues. For instance, a change in the lighting will primarily disrupt the color cue. It was investigated to what extent the changes contribute to tracking errors. Altogether, 56 changes contributed to 34 tracking errors, which gives an average of 1:65 changes per error. (In a 35th “poor” sequence, the system did not detect the person’s exit from the scene and continued tracking the door

t=0s

t=5s

Figure 3. Example of a sequen e of the \Lighting" ondition. The images summarize a 5 se ond stret h from the middle of the sequen e. Individual images are 1.25 se onds apart (10 frames). The ir le marks the estimated fa e position. During the sequen e two signi ant hanges o

ur. Between the se ond and third image the lighting is hanged abruptly su h that the whole s ene appears green. Between the fourth and fth image the person steps in front of a di erent ba kground. frame.) Changes affecting only a small number of cues, such as background changes and lighting changes, produce only few errors, while changes affecting many cues simultaneously produce many errors. Although turns and occlusions can possibly influence the same set of cues, occlusions were more harmful, since they introduce a competing target (the other person) which “attracts” the system.

An example of successful tracking is given in Figs. 3– 4. It shows an example of a sequence of the Lighting class. Two significant changes occur during this sequence. At one instant the lighting is changed as described above, and later the person steps in front of a different background (Fig. 3). Both changes only affect a minority of cues (Fig. 4), and tracking remains stable. Had both changes occured at the same time, though, they would have posed a serious threat to the system.

The most important parameters of the system are the time constants  and i for the adaptation of reliabilities and prototypes on the one hand, and the detection threshold T on the other hand. If the system can adapt too fast, it has no memory and will be easily disturbed by high temporal frequency noise. If adaptation is too slow, the system has problems with harmless changes occurring in quick succession. E.g., if there were two subsequent changes in the scene as in Fig. 3, both of which would be tolerated by the system if occurring in isolation, their appearance in quick succession would pose a serious threat to the system. If T is too high, relatively small changes in the scene can result in missed detection. If T is too low, the system will tend to continue tracking the background when the person leaves the room. Interestingly, a good choice of the parameters seems to depend on the types of changes occurring in the scene. By using a higher detection threshold and slower adaptation, performance increased for the Occlusion sequences but decreased for the Turning sequences.

4 Discussion Related Work on Face Tracking. In its present guise, the system is not specific to the task of face tracking, its only a priori assumptions being to look for a moving object of a particular color. In addition, we deliberately chose to use very simple cues since our focus was on the integration of cues rather than on the cues themselves. The system should thus be regarded as a first step towards a full-fledged system for detecting and tracking humans. There are a number of other systems (e.g. [8, 3, 1]), which are more specific to the task of face tracking, since they introduce more a priori knowledge about the problem. Often the characteristic shape of the head/shoulder contour is used [3, 1]. However, these approaches lack an adaptive component. The system by M C K ENNA et al. is adaptive, but it relies on a single color cue [6]. It can only adapt to slow continuous changes. In contrast, the system presented here can also cope with successions of sudden changes as long as they affect only a minority of cues at the same time. Related Work on Sensor Fusion. Sensory integration plays an important role in many fields of technology and diverse techniques have been employed for fusing different sensors [5], but issues of self-organized adaptivity have hardly been addressed so far. The most related approaches were proposed by M URPHY and DASARATHY. M URPHY ’ S SFX architecture [7] stresses the importance of adapting the fusion strategies to the currently perceived object or current perceptual task, thus suggesting to use different fusion strategies for perceiving different objects. The fusion strategies to employ for a particular task are constructed manually by the designer. Adaptations to environmental changes are handled by a separate “exception handling mechanism” observing the fusion process. It uses a set of pre-programmed “state failure conditions” to detect discordances between the sensors, and decides whether to recalibrate or suppress offending sensors. In Democratic Integration the recalibration and suppression are an emergent

q, r Intensity Change 1

q, r 1

0

0

0

0

50

t

Color

0

0

50

t

q, r Motion Continuity 1

q, r 1

0

0

0

0

50

t

Shape

0

0

50

q, r 1

t

Contrast Range

0 0

0

50

t

Figure 4. Graphs showing the ues' reliabilities (dotted) and qualities (solid) as a fun tion of time for the sequen e of Fig. 3. The bar on the ordinate marks the ve se ond stret h displayed in Fig. 3. The system is initialized with the reliabilities of olor and intensity hange at 0:5 and the other ues' reliabilities at zero. As soon as the person enters the s ene responsibility starts being shared among the ues. Changes in the s ene give rise to temporarily diminished qualities of some of the ues. The

hanged illumination produ es sharp dips in the quality of the intensity hange and olor ues. Later on, the hanged ba kground disrupts the shape and ontrast ues. property of the simple dynamics of the cue’s reliabilites and prototypes, not requiring prespecified rules for the detection and resolvement of discordances. Also, SFX does not attempt to predict a sensor’s future reliability on the basis of its performance in the recent past. DASARATHY [2] considers a decision fusion architecture, mentioning that his idea may be applicable to other levels of sensor fusion as well. He envisions (without implementing it) an architecture where feedback of the fused decision result is used to fine-tune the decision boundaries of individual decision processes leading to a “selfimprovement” of the overall system. Democratic Integration goes beyond this scheme in the sense that it also allows for drastic recalibrations and even complete suppression of sensors, which have been found to be discordant to the overall result in the recent past — thus assuming a certain temporal continuity in the reliabilities of the different sensors. Conclusion. In Democratic Integration, all cues agree on a result and each cue adapts towards the result agreed upon. The system tries to reach maximal coherence among its cues. Our system can handle environmental changes, as long as they affect only a minority of cues at the same time, although all cues may be disrupted in the long run. The system relies partly on cues for which there was no a priori knowledge about the task available. These cues were bootstrapped by the others and could then keep the system on the right track during situations in which the original cues would fail. From an engineering perspective this architecture is very handy, because new cues can be added most easily, and the system will figure out by itself, to what extent they are useful and how much it will rely on them.

Acknowledgements The authors would like to thank the developers of the FLAVOR software environment, which served as the plat-

form for this work. We are grateful to Alexander Heinrichs for help in recording the image sequences.

References [1] H.-J. Boehme, U.-D. Braumann, A. Brakensiek, A. Corradini, M. Krabbes, and H.-M. Gross. User localisation for visuallybased human-machine-interaction. In Proceedings of FG’98, The IEEE Third International Conference on Automatic Face and Gesture Recognition, pages 486–491, 1998. [2] B. V. Dasarathy. Sensor fusion potential exploitation— innovative architectures and illustrative applications. Proceedings of the IEEE, 85(1):24–38, 1997. [3] S. Feyrer and A. Zell. Ein integrierter Ansatz zur Lokalisierung von Personen in Bildfolgen. In S. Posch and H. Ritter, editors, Dynamische Perzeption, number 8 in Proceedings in Artificial Intelligence, pages 183–190. infix, Sankt Augustin, Germany, 1998. [4] R. C. Gonzales and R. E. Woods. Digital Image Processing. Addison-Wesley, 1992. [5] R. C. Luo and M. G. Kay. Multisensor integration and fusion in intelligent systems. IEEE Transactions on Systems, Man, and Cybernetics, 19(5):901–931, 1989. [6] S. J. McKenna, S. Gong, and Y. Raja. Modelling facial colour and identity with gaussian mixtures. Pattern Recognition, 31(12):1883–1892, 1998. [7] R. R. Murphy. Biological and cognitive foundations of intelligent sensor fusion. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, 26(1):42–51, 1996. [8] J. Steffens, E. Elagin, and H. Neven. Personspotter — fast and robust system for human detection, tracking and recognition. In Proceedings of FG’98, The IEEE Third International Conference on Automatic Face and Gesture Recognition, pages 516–521, 1998. [9] J. Triesch. Vision-Based Robotic Gesture Recognition. PhD thesis, Institut f¨ur Neuroinformatik, Ruhr-Universit¨at Bochum, Germany, 1999. Also published as book by Shaker Verlag, Aachen, Germany, ISBN 3-8265-6257-7.

Suggest Documents