A Probabilistic Framework for Tracking of Articulated Human Motion Hedvig Sidenbladh
Michael J. Black
David J. Fleet
CVAP/NADA, Royal Institute of Technology S–100 44 Stockholm, Sweden,
[email protected]
Dept of Computer Science, Box 1910, Brown University Providence, RI 02912, USA,
[email protected] Xerox Palo Alto Research Center, 3333 Coyote Hill Rd. Palo Alto, CA 94304, USA,
[email protected] Abstract
The tracking and reconstruction of articulated human motion in 3D is a problem that has attracted a great deal of interest in the last years. A system that recovers 3D body pose from video sequences has applications in vision-based human-computer interaction, marker-less motion capture, animation, surveillance and entertainment such as computer games. The fast, nonlinear motion and complicated appearance of humans and the large number degrees of freedom of the human body make the tracking problem a difficult one. To address these problems, a system for tracking and reconstruction of human motion in 3D should possess the following: A strong model for the appearance of humans in images; a model of how people move; and an effective strategy for searching for the right pose in each time step. In previously presented systems, the most common way of addressing these issues has been to constrain the problem domain. The appearance of humans could be constrained by assuming certain clothing and a large contrast between the human and the background. Furthermore, by adding more camera views, more information about the 3D pose of the human can be extracted and ambiguities reduced, thus making the problem easier. The goal of the work presented here is to investigate to which extent the general problem of tracking and reconstructing human motion can be solved, using only a monocular camera view. Thus, no assumptions of the appearance of either the human or the background are introduced. A probabilistic, Bayesian framework for tracking of articulated human motion in 3D is presented. The tracking makes use of a filter-based learned model of human appearance in images and image sequences, and two different types of models of human motion, intended to constrain the search in each time step of the tracking. Successful tracking results using the human appearance model and the different motion models are presented. Among the questions left open is the issue of initialization, a difficult problem in the high-dimensional search space of an articulated model in 3D.
Submitted to IEEE PAMI, December 2001 1
1 Introduction Computer systems are, despite the rapid development in the area, still largely blind to their users. Since vision is a very important cue for humans, seeing computer systems would enable the development of human-computer interfaces that are more natural for the user. A basic ability of a seeing computer system would be to detect and track humans that interact with it. This paper concentrates on the task of 3D tracking and reconstruction of human motion in monocular video sequences. Among the applications are gesture and action recognition for humancomputer interfaces, and 3D motion capture from archival footage.
1.1
What are the Difficulties?
The appearance and motion of humans in video sequences varies in complicated ways both with respect to the subject and over time. Some of the main challenges in designing a system for tracking of humans in video sequences are discussed below. 2D-3D projection ambiguities. When a 3D scene is projected into a 2D image, all information about the depth in the scene is lost. This is the most fundamental problem in Computer Vision. In Figure 1 the distance between the body and left foot of the employee of the Ministry of Silly Walks is not apparent even to a human observer. The depth information that was lost in projecting the world onto the image must be replaced by other information, for example learned knowledge about the relative size of human limbs and how humans generally move. Difference in appearance and shape. Due to differences in clothing and body shape, two different subjects in the same pose seldom appear the same in images. Thus, there is no simple mapping between 3D body configuration and image appearance. Furthermore, shading, illumination changes, and deformable clothing can change the appearance of the same subject over time. For these reasons, it is difficult to formulate a universal model for human appearance in images. Similarity between limbs and other structure.
In Figure 1, all limbs appear very similar in the
image – dark and elongated with edges at the boundaries. There are also structures in the background that bear similarity to the individual limbs, e.g. the tree further down the street (although 2
on a different image scale). To be able to distinguish between “human” and “non-human” structure in images, as much information as possible has to be taken into account. However, then it is difficult to make the model of human appearance universal enough to allow for variations in human appearance, discussed in the paragraph above. [ FIGURE 1 ABOUT HERE ] Self occlusion.
In Figure 1 the left upper arm is completely occluded by the foot and leg. In a
highly articulated structure such as a human, this will occur frequently. Other objects, such as the briefcase in Figure 1, can also be the cause of occlusion. Clothing. An issue related to self occlusion is the problem of loosely fitting clothing. Most often in existing tracking systems, the tracking subject is assumed to have tight fitting clothes, which simplifies the tracking problem considerably. The extreme case of loosely fitting clothes are a wide robe or a long skirt which occlude the pose of the limbs entirely. Dimensionality. A very simple 3D representation of a human, where limbs are modeled as rigid elements attached to each other at joints, has approximately 30 degrees of freedom (DOF). Hence, using a model like this for tracking a human in a video sequence requires estimating around 30 parameters at each time instance. The enormous volume of the search space makes robust tracking difficult. Note also that 30 DOF do not capture the real flexibility of a human body. This means that the tracking will be brittle and not robust to all motions. However, if the number of DOF is raised, the tracking will suffer from the dimensionality increase. Non-linear and fast motion. The size of the high-dimensional search space would be less of a problem if the model parameters (i.e. the position in the space) changed slowly, or in a predictable manner, over time (although initialization would still be a large problem). However, this is not generally the case. The arms and legs of humans move very fast, often with large acceleration. This creates difficulties in maintaining track of all body parameters.
3
Motion blur.
Depending on the shutter time of the camera used to grab the sequence, fast mo-
tions will cause the moving limb to appear more or less blurred in the image. At the boundaries of the limb in the image, the background will affect the pixel values. Furthermore, the internal appearance structure of the limb will be blurred with respect to the orientation of motion. This adds further difficulties to tracking fast moving limbs. Kinematic singularities.
Another problem is the presence of kinematic singularities [59]. If the
position of a limb is represented in terms of joint angles, the representation of a certain spatial position of the limb in terms of the angles will not always be unique. Furthermore, if the joint angles are limited to a certain range, this will introduce more singularities [16, 45].
1.2
Approaches to Address The Difficulties
The problems described above must be taken into regard in designing robust systems and algorithms for tracking of humans in image sequences. A natural way to represent knowledge about the shape of humans for use in tracking, is to use a model-based approach [26]. As an example, phenomena such as self occlusion can be explained in terms of a layered model.1 In general, to solve the tracking problem, a system must have: 1. A strong model of the appearance of humans in images. The model must take into regard the variability in appearance, but also be strong enough to be able to distinguish humans from non-humans. 2. Constraints to narrow the search. This could be knowledge about human motion, e.g. correlations between the motion of different parts of the body. It could also be knowledge about common body configurations. 3. An effective search strategy. In each time step, the space of possible configurations has to be explored in an efficient way, given the configuration in the previous time step and the motion constraints mentioned above. 1
Another approach would be to extract local image structure, such as edges and corners, and combine them into structures such as limbs and collections of limbs. Such a bottom-up approach would experience large problems when one of the limbs is invisible in the image. Due to these problems, model-based approaches are most often used.
4
In general, it is difficult to fulfill all these demands with a reasonable computational effort. In previously presented systems (Section 2), the most common way to address the issues in the list above is therefore to constrain the problem domain in certain ways. One approach could be to make sure the background is considerably different from the human in terms of appearance. Most often the tracking subject wears tight-fitting clothes, sometimes even markers. Thus, the design of an appearance model for humans is made easier. However, the applications of such a system are limited to areas like vision-based motion capture, where the background and clothing of the human can be controlled. In a general environment, such as a home or a street scene, this is not the case. By introducing constraints on the range of motion (often action specific models of motion such as walking) the search can be narrowed and the tracking made more robust. Again, this is useful for certain applications, e.g. surveillance or pedestrian detection, but must be augmented if to be used in a general case. The search problem is made easier if the tracking subject is viewed from two or more views. This minimizes the problems with occlusion and depth ambiguities. However, this makes the method less appropriate for applications such as 3D reconstruction from archival footage, and applications for personal computers equipped with a web-cam, since most often only one (sometimes gray-scale) camera source is available. In this paper, a probabilistic, Bayesian approach to 3D articulated tracking of humans in monocular image sequences is presented. The general approach is, instead of simplifying the tracking problem, to integrate as much prior knowledge about human motion and appearance into the tracker in a mathematically principled way, thus addressing the problems with ambiguities and missing information. The tracking problem is formulated as one of determining the posterior distribution over body poses, in each time step as the product of the likelihood of body poses given the observed image at time , and a prior distribution over body poses, dependent on posterior distribution at the previous time step, and of models of human motion. The posterior distribution over body poses is represented by a discrete set of particles, each representing a certain body pose. The particles are propagated in time using Condensation [24, 27]. This is further described in Section 3. For each of the particles, the likelihood of the body configuration is computed by projecting the 3D 5
human model into the image plane and comparing with the observed image of the human. Instead of introducing constraints such as multiple cameras or a simplified backround, our approach is to learn models of how human appear in images. This is described briefly in Section 4, while a more detailed description can be found in [56]. The prior distribution is obtained by propagating the posterior one time step using some model of human motion. In this paper, two different types of human motion is presented, one very general based on an assumption of constant velocity, and an action-specific model of cyclic motion, learned from 3D motion capture data. The two models are presented briefly in Section 5, along with results of tracking humans in natural environments. The paper is concluded in Section 6.
2 Related Work Estimation of human motion is an active and growing research area [1, 20, 32, 43] with applications to human-machine interaction, surveillance, image coding, biometrics, animation, automatic pedestrian detection in cars, computer games, among others. The numerous previously presented approaches include estimation of face, hand and full-body motion at different resolutions. Here, we concentrate on full-body motion. Crudely, the problem of motion estimation can be divided into two sub-problems, initialization and tracking. For human models with many DOF, such as the model used in this thesis, initialization in general is still an unsolved problem although it has been done with special assumtions about pose [19, 26]. Hereafter, we concentrate on the problem of tracking.
2.1
Models of Shape
Depending on the application, different kinds of information may be extracted from the human model at each time step of the tracking. The choice of human model used for tracking depends on what kind of information has to be extracted, and also on what constraints can be introduced on the environment and on the activities of the tracked human. The models vary in complexity from assemblies of 2D color blobs [63] or areas with a certain color distribution [13], to layered 2D representations of articulated figures [11, 33], and, finally, to detailed 3D articulated structures [9, 15, 17, 19, 23, 26, 36, 48, 50, 52, 61, 62].
6
2.2
Models of Appearance
Tracking using articulated models involves (in the 3D case) projecting a certain configuration of the model into the image, and comparing the model features with the observed image features. In a probabilistic formulation of the problem, this corresponds to computing the likelihood of the observed image features, conditioned on the model configuration. Depending on the application, many different techniques have been used to extract features for image-model comparison. Background subtraction [15, 25, 52, 63] can give an estimate of where the human is in the image, and of the outline of the human, but does not provide information about the motion on the foreground. To achieve more detailed information about the position of the individual limbs, researchers have also used detected image edges. Observing correlation between the boundaries of the human model and detected edges has proven to be successful in tracking, especially in indoor environments with little clutter [15, 17, 21, 19, 23, 26, 35, 61]. The common approach is to detect edges using a threshold on some image edge response, and to measure the distance between detected edges and model edges. The approach proposed in Section 4 take another approach: Instead of first detecting edges in the image, we observe the continuous edge response along the predicted limb boundary and compute the likelihood of observing the response using probability distributions learned from image data. Thus, more information about the edge response is taken into account, while enabling a principled formulation of the likelihood. Edges provide a quite sparse representation of the world, since they only provide information about the location of limb boundaries. More information about the limb appearance can be derived from the assumption of temporal brightness constancy – that two image locations originating from the same scene location at two consecutive time instances have the same intensity. This assumption is used widely for tracking of humans [9, 33, 62]. Since this cue does not give an absolute estimation of the appearance, only an estimation of the appearance change, problems with drift over time occurs. The cues described above for comparing human models with images exhibit different strengths and weaknesses. Thus, none of the cues is entirely robust used on its own. Reliable tracking
7
requires multiple spatial and temporal image cues. While many systems combine cues such as motion, color, or stereo for person detection and tracking [14, 15, 48, 61], the formulation and combination of these cues is often ad hoc. The Bayesian approach presented in this paper enables combination of different cues in a principled way. Moreover, by learning noise models and likelihood distributions from training data the problems of hand tuned noise models and thresholds are avoided.
2.3
Models of Motion
Walking motion, being regular and driven mainly by one parameter, the phase, has been modeled using a number of approaches. Hogg [26] and Rohr [52] presented analytically defined walking models. These models did not express any differences in walking between individuals, but provided sufficient information for tracking of pedestrians in outdoor scenes. Another common approach is to model the human body as a dynamical system, and use the dynamical constraints to limit the range of motions [64, 47]. Dynamical models have also been used for animation of human motion (cf. [10]). If the animation is driven by more information, such as hand and torso positions, the body pose in each time step can be generated using inverse kinematics and gravity constraints [2]. The drawback of the dynamical approach is that it is computationally expensive, since the human is an extremely complex dynamical system. Instead of explicitly modeling the physics of the human body, one can instead learn statistical properties of human motion from examples of 3D motion capture data. This approach has also been used for animation. The learned statistical properties could be wavelet decompositions [49], polynomial basis functions [22] of motion trajectories, or HMM models [8, 44]. Using these properties, often parameterized with respect to motion style or goal position of motion, new plausible-looking motions can be generated. In many methods, image measurements are first computed and then the temporal models are applied to either smooth or interpret the results. Little and Boyd [41] detected the spatial distribution of the optical flow in each frame of a sequence depicting a walking cycle, and computed image moments from the flow energy image. Different walkers could be identified by the Fourier coefficients from the moment curves. In a related approach [6], different actions were represented by their motion energy image (MEI). 8
The above approaches [6, 41] do not model the locations of individual limbs, but use the motion pattern itself to recognize activities. If the goal is to explicitly reconstruct the 3D configuration of the human, more extensive preprocessing of the image has to be made. Brand [7] extracted image moments from silhouette training sequences of people. The observed moments formed a manifold in the space of moments, to which an HMM could be fitted. Using training data consisting of sequences of 3D configurations (acquired with a motion capture system) with associated silhouettes, models of plausible 3D configurations given a silhouette were constructed. Using this model, a 3D motion could be generated from a new silhouette sequence. Leventon and Freeman [39] addressed the problem of 3D reconstruction from a monocular image by learning models of small “chunks” of motion in 3D. The chunks were grouped using a k-means clustering technique, and low-dimensional subspaces of motion chunks were learned using PCA. New chunks could be synthesized as linear combinations of eigenvectors in this space, and compared with a given 2D configuration. The most plausible 3D motion chunk given the observed 2D configuration could then be selected in each time step, and pasted together with chunks adjacent in time. The approach is somewhat similar to that of Brand, except that the observed 2D configuration in the image is a much stronger cue than silhouette moments, which means that the 3D reconstruction will follow the image data more closely. In both cases, the motion model is used as a post-processing step, to get the 3D reconstruction of the human from the found 2D image configuration. Thus, the motion model is not used to find the 2D configuration, as is the case in this paper. The goal in Section 5 is to employ the same type of data as Leventon and Freeman [39] and Brand [7], but explore the use of complex non-linear temporal models early in the process to constrain the estimation of low-level image measurements. In related work Yacoob and Davis [66] used a learned “eigencurve” model of image motion [65] to constrain estimation of a 2D articulated model. Black [3] used similar non-linear temporal models within a probabilistic framework to constrain the estimation of optical flow.
3 Bayesian Model-Based Tracking Approach In this section, the probabilistic framework in which the tracking is carried out is described. Furthermore, the geometry of the human and camera models is presented. Although the general track-
9
ing framework is independent of the choice of model, certain problems associated with the particular choice of model, and how they are addressed, are discussed. Therefore, the geometry of the model is described before the Bayesian formulation of the tracking. Let the model of a human be parameterized at time by a set of parameters
.
This section
describes how these parameters are defined, and how they are propagated in time.
3.1
Geometrical Model
The human body is modeled as an articulated object, parameterized by a set of joint angles. Given a camera model, each 3D position on the surface of the model, visible to the camera, can be associated with 2D position in the image. The geometric formulation is used when computing the likelihood of a certain configuration, described in Section 4. [ FIGURE 2 ABOUT HERE ] As shown in Figure 2, the human body is modeled as a configuration of 11 truncated cones with an elliptical cross-section. Each cone (or limb) is defined in a part-centric coordinate frame with the origin at the base of the cone. The configuration of this human model is expressed in terms of joint angles. All in all there are 19 angles, which means that the configuration of the human model can be expressed by 25 parameters: the 19 relative Euler angles
, and
! "#%$& and rotation '(
)*!)*"&)+$ of the model. The parameters are denoted ,.- /0 .' . . The corresponding velocities 1%2435 76 ' 6 6 are also included in the parameter space , which yields 50 parameters that describe the configuration and global translation
velocity of the human model. Using this representation, the configuration of each limb can be derived hierarchically using the global configuration and the relative angles. Knowing the parameters of the camera observing the human, model configurations with different values of
can be projected into the image plane and compared with the observed image,
according to some criteria (see Section 4). Given that the limbs are opaque (a reasonable assumption), some parts of the surface of the human model is not visible in the image (c.f. [61]). Surface areas that are occluded by other limbs are marked as self occluded and discarded from likelihood computations.
10
3.2
Propagating the Parameters in Time
An approach to tracking could be to search for the configuration that best fits the image in each time step. The tracking is then formulated as an optimization or search task. Possibly, the 3D body configuration could be estimated using inverse kinematics, based on found limb positions in the image. However, this would lead to a number of problems with singularities of the type that appear in robotics [59]. Instead, we formulate the problem as a Bayesian inference task. The problem of tracking a human using a model, parameterized at time by the set of parameters
can be formulated using
Bayes’ rule [5, 31]:
8+9 ;:+= < ?> A@ 8+9 = : >CB +8 9 ;: E D > +8 9 E D :F= < ED >HG E D where
=
is the image at time ,
independent of
.
notation see [54].
=
is called the posterior distribution over
=
given < . This
entity represents all knowledge extracted about the model configuration, and can be used for recognition of activities, or for reconstructing the motion of the human. The distribution 8I9
= likelihood of observing the image , conditioned on model configuration .
= : >
is the
This distribution is
very difficult to formulate analytically, given the complicated appearance of natural images. However, a function can be designed that returns the (unnormalized) likelihood for a given value of
.
The integral is referred to as a prior, or a prediction, as it is equivalent to the probability over states at time
given the image measurement history, i.e.
8I9 : = < ED > .
It is useful to understand the
integrand as the product of two terms; these are the posterior probability distribution over states at
= ED : < E D > , and the dynamical process, the temporal prior 8+9 : ED > , that propagates this distribution over states from time ;JLK to time . This could either be activity de-
the previous time, 8+9
pendent – the distribution for walking is probably quite different from the distribution for jumping – or a general model for human motion. Note that Equation (1) tells us nothing about how to initialize the model parameters at time M ,
ON . The initialization problem is not addressed here, although it is discussed in Section 6. 11
3.3
Representing the Parameter Space
Previous approaches to the problem of articulated human tracking (see Section 2) have often assumed the posterior to be unimodal and Gaussian. This means that the problem of maintaining the posterior distribution over time is reduced to the problem of finding the maximum a posteriori (MAP) estimate, and sometimes the covariance of the model parameters. However, self occlusions, matching ambiguities and singularities (see the introduction) make
the likelihood 8+9
= : >
highly non-Gaussian, and often multi-modal. While we cannot derive an
analytic expression for the likelihood function over the parameters of the entire state space, we can evaluate the likelihood of observing the image given a particular state QP . The computation of this likelihood is described in Section 4. Furthermore, the representation of the posterior is complicated
by the use of non-linear models for estimating the temporal prior 8+9
: ED > .
Particle filtering. For the reasons above, we represent the posterior distribution as a weighted set of samples or particles, which are propagated using a particle filter with sequential importance sampling. The method is called Particle Filtering or Condensation [24, 27] and have previously been used for 2D image tracking with linear (cf. [29, 60]) and non-linear temporal models (cf. [3, 4]). The posterior at time *JRK is represented by S
state samples, where sample T is denoted
Each of the samples have a normalized likelihood value
U EP D .
EP D .
A time step in the particle filter is
carried out according to the following: 1.
S
new particles are drawn with replacement from the posterior distribution at time !JK with
Monte Carlo sampling, according to the likelihoods U
EP D .
2. The new samples are propagated in time by sampling from the temporal prior 8I9
= : < ED > . = = For each of the S particles, U P is computed as U W P +8 9 : P >XVYZP[]\ 8I9 : P [ > . The S
3.
V: ED > .
samples
P
now represent the prior distribution over , 8+9
The re-
sulting set of samples, weighted by their normalized likelihoods, approximates the posterior distribution 8+9
= ;: < ?>
at time .
12
As noted above, the particle representation has the advantage that it can represent distributions that are difficult to model analytically. However, an apparent drawback with representing the whole posterior distribution with particles is the computational complexity - the computing time for each time-step is
^ 9 S > . Thus, it is desireable to use as few particles as possible.
It is difficult to automatically select the number of particles needed to model the search space. MacCormick and Isard [42] derive the requirement
S`_badc*egf )+h The exponent G is the number of dimensions in the parameter space,
(2) is the smallest acceptable
aic*egf number of particles to survive the sampling and the survival rate )kj K is a constant related to the shape of the posterior and prior distributions. In short, ) measures how well the filter predicts the posterior distribution at each time step. Since ) is very small, the lower bound on S grows exponentially with the number of dimensions G . ) will also in general be lower for noisy distributions with sharp peaks, which means that more samples are needed to represent them. Thus, the needed number of particles, S , depends both on the volume of the search space, and on the shape of the distribution [12, 42]. In the case of highly articulated structures, such as the human model used in this thesis, with complicated image appearance, S
will be very large, leading
to high computational cost in the tracking. Improving the sampling. A number of methods (for an overview, see [54]) have been proposed to strive to lower the number of samples needed for tracking, and to make the samples cover the parameter space more efficiently. Here, a method called ICondensation [28] is employed. The difference from Condensation is that the samples are propagated in time using a different distribution than the posterior at the previous time instant. In the original formulation [28] the algorithm was used for fusing information from two different image cues – for example, the true likelihood in the tracking could be based on edge information, while information from color blob detection could be used for propagating the samples in time. For each sample T , a value
l P
(e.g. the result of image-model comparison
with respect to color blobs) is computed along with the normalized likelihood U the result of image-model comparison with respect to edges). The value 13
l P
P
(which could be
(normalized to sum
T
to one over all samples) is used as the weight of sample correction term m
P
in the Monte Carlo sampling step. A
is computed to compensate for the difference between the true prior distribution
l P.
of samples, and the distribution achieved by sampling from
One time-step in the algorithm
proceeds as follows: 1.
EP D are drawn from the distribution represented by all old particles EP D [ weighted with l EP D [ , using Monte Carlo sampling. S
new samples
2. The new samples are propagated in time by sampling from the temporal prior 8+9 3. For each sample, a correction term
Y Z l EP D [ +8 9 P : EP D [ P [\ The numerator p P Y
m o P qp P X#r P
n: E D > .
is now computed. The denominator
r s P
> , is the probability with which the sample was actually generated. Z U EP D [ 8+9 P : EP D [ > is the probability that the sample T could have P [\
been generated if sampling from the posterior (as in standard Condensation).
4. For each sample, the normalized likelihood weighted by the prior correction coefficient is computed as U
= P tm P +8 9 : P > . Furthermore, the weight for propagation, l P , is computed.
The sum of both functions over the particles are normalized to sum to one. The set of samples
= P , weighted by U P , approximates the posterior distribution 8+9 : < ?>
at time .
A variation of this algorithm is to let a certain fraction of the samples be propagated with
l
and
another fraction by the actual posterior distribution at the previous time instant. Note that if all particles are to be propagated with
l
, the distribution over
l
must be similar to the likelihood
distribution, and especially have peaks in the same locations. However, here, ICondensation is not used to fuse different image cues. Condensation encounters problems when the posterior distribution has very sharp and narrow peaks. By smoothing the likelihood function so that the peaks are damped, this effect is lowered. Thus, version of the real likelihood, so that
l W P u 9 U P %> v
, where
wyxzK
and
l P
l
is a smoothed
is normalized to sum to
one over all samples. This is similar to the annealing idea of Deutscher et al. [15]. However, this approach does maintain a probabilistic representation of the posterior distribution. The sums over old samples in the computations of p step in ICondensation to be
^ 9 S|{ > .
For very large
S
P
and r
P
cause the complexity of one time
, this adds a considerable time factor to the
tracking. Therefore, an approximation is introduced. Only the old samples for which U 14
EP D [
and l
EP D [
}K M D N ) are considered in the summation. This excludes a very large part of the samples, about ~~~! of the samples in the sum over U P , and about ~! of the samples in the sum over l when S is on the order of K}M# .2 The number of samples
that are not excluded from the sampling is approximately constant with respect to S . This means that the complexity is lowered to ^ 9
S > A^ 9 S > . are above a certain very low threshold (on the order of
Making decisions about configuration. parameters
The tracking estimates a posterior distribution over the
at each time step . However, for most applications, e.g. animation, a decision has
to be made about the most probable configuration at each time step. In the tracking experiments presented in Section 5, the expected value ]
YZ U P P P7\
is used to visualize the tracking.3
4 Likelihood Models of Human Appearance In this section, the formulation of the likelihood 8I9
= : >
=
of the image , given the model param-
, is discussed. It is given as a function which returns the likelihood of a certain, hypothesized, configuration . The likelihood is used for tracking according to Equation (1). In a particle filtering framework (Section 3.3), the normalized likelihood U P of each particle T can be computed
eters
given the learned likelihood of human appearance provided here. The detection and tracking of humans in unconstrained environments is made difficult by the wide variation in their appearance due to clothing, illumination, pose, gender, age, etc. We seek a generic model of human appearance that is somewhat invariant to the ways in which people’s appearance varies and, at the same time, is specific enough to be useful for distinguishing people from other objects. Building on recent work in modeling natural image statistics, the approach exploits generic filter responses that capture information about appearance and appearance change over time. Statistical models of these filter responses are learned from training examples and provide a rigorous probabilistic model of the appearance of human limbs. Within a Bayesian
2
The percentage of excluded samples is a measure of the survival rate . Apparently, the survival rate for the original posterior distribution is very low, while the smoothed distribution displays a slightly higher survival rate. This means, according to Equation (2), that fewer samples are needed to correctly track the distribution using our version of ICondensation. 3 This is a good estimate, as long as the distribution has one maximum. In the case of several maxima (matching ambiguities), the expected value will fall somewhere in between the maxima, and give a poor summary of the distribution.
15
framework, these object-specific models can be compared with generic models of natural scene statistics. [ FIGURE 3 ABOUT HERE ] [ FIGURE 4 ABOUT HERE ] In contrast to previous approaches, which used edge models or models of brightness constancy, formulated in an ad hoc manner, to detect people, our goal is to formulate a rigorous probabilistic model of human appearance by learning distributions of image filter responses from training data. Given a database of images containing people, we manually mark human limb axes and boundaries for the thighs, calves, upper arm, and lower arm (Figure 3). Motivated by [37], probability distributions of various filter responses on human limbs, observed in the training images, are constructed (Figure 4). These filters are based on various derivatives of normalized Gaussians [40] and provide some measure of invariance to variations in clothing, lighting, and background.
4.1
Selection of Filters
The boundaries of limbs often differ in luminance from the background resulting in perceptible edges. Filter responses corresponding to edges are therefore computed at the boundaries of the limbs. First derivatives of normalized Gaussian filters are steered [18] to the orientation of the limb and are applied at multiple scales. Note that an actual edge may or may not be present in the image depending on the local contrast; the statistics of this are captured in the learned distributions and vary from limb to limb. In addition to boundaries, the elongated structure of a human limb can be modeled as a ridge at an appropriate scale. We employ a steerable ridge filter that responds strongly where there is high curvature of the image brightness orthogonal to the limb axis and low curvature parallel to it [40]. Motion of the body gives rise to the third and final cue considered here. The intensity pattern on the surface of the limb is assumed to change slowly over time. Given the correct motion of the limb, the image patch corresponding to it can be warped to register two consecutive frames. The assumption of brightness constancy implies that the temporal derivatives for this motion-compensated pair are small. Rather than assume some arbitrary distribution of these differences we learn the distri-
16
bution for hand registered sequences and show that it is highly non-Gaussian.
4.2
Definition of the Likelihood
These learned distributions can now form the basis for a likelihood model for human limbs. While the distributions characterize the “foreground” object, reliable tracking requires that the foreground and background statistics be sufficiently distinct. We thus also learn the distribution of edge, ridge, and motion filter responses for general scenes without people. This builds upon recent work on learning the statistics of natural scenes [37, 38, 46, 53, 58, 67] and extends it to the problem of people tracking. We show that the likelihood of observing the filter responses for an image is proportional to the ratio between the likelihood that the foreground image pixels are explained by the foreground object and the likelihood that they are explained by some general background (c.f. [30, 51]):
8+9 : 4H > 8+9?H :4H & H > 4 +8 9 4 H H : H >
This ratio is highest when the foreground (person) model projects to an image region that is unlikely to have been generated by some general scene but is well explained by the statistics of people. This ratio also implies that there is no advantage to the foreground model explaining data that is equally well explained as background. It is important to note that the “background model” here is completely general and, unlike the common background subtraction techniques, is not tied to a specific, known, scene. Additionally, we note that the absolute contrast been foreground and background is less important than the consistency of edge or ridge orientation. We therefore perform contrast normalization prior to filtering. The formulation of foreground and background models provides a principled way of choosing the appropriate type of contrast normalization. For an optimal Bayesian detection task we would like the foreground and background distributions to be maximally distinct under some distance measure. We propose an approach for choosing among contrast normalization techniques, based on the Bhattacharyya distance between foreground and background distributions [34, 37]. The approach extends previous work on person tracking by combining multiple image cues, by using learned probabilistic models of object appearance, and by taking into account a probabilistic model of general scenes in the above likelihood ratio. Experimental results suggest that a combi-
17
nation of cues provides a rich likelihood model that results in more reliable and computationally efficient tracking than can be achieved with individual cues.
5 Models of Human Motion As discussed in Section 3, tracking is formulated as the problem of estimating a posterior distribution over the state variables. The posterior is the product of a prior distribution and a likelihood distribution. While the likelihood was discussed in the previous section, we here formulate models for estimation of the prior distribution, i.e. the integral in Equation (1). As described in Section 3.3, the prior distribution at time
is represented by particles sam-
pled from the previous posterior distribution, which have been propagated in time using a temporal model, or temporal prior, 8I9
V : ED > . The temporal model encodes information about the dynam-
ics of the human body. Here it is formulated as a conditional probability distribution and is used to focus the sampling on portions of the parameter space that are likely to correspond to human motions.
5.1
Model of Constant Velocity
A simple and general form of temporal model is the smooth motion model, which assumes that the angular velocity of the joints and the velocity of the body are constant over time, and that each joint angle is conditionally independent of the others. It models the effects of the limb inertia on the motion in a linear and simplified manner. For a description of the model, see [57, 54]. To test the performance of the general smooth motion model together with the likelihood function described in Section 4, a series of experiments were carried out. The test sequences contain clutter, no special clothing, and no special backgrounds. The experiments use monocular grayscale sequences with both static and moving cameras. [ FIGURE 5 ABOUT HERE ]
Combining different image cues. A model of low dimensionality, representing only the torso and the right arm, is used. The configuration is represented by 20 parameters, including the left arm angles, the global torso position and rotation, and their respective velocities. Figure 5 shows 18
four different tracking results for the same sequence. The model is initialized with a Gaussian distribution around a manually selected set of start parameters
N.
Camera translation during the
sequence causes motion of both the foreground and the background. Figure 5a shows tracking results using only the motion cue. Generally, motion is an effective cue for tracking, however, in this example, the 3D structure is incorrectly estimated due to drift. The edge cue (Figure 5b), does not suffer from the drift problem, but the edge information at the boundaries of the arm is very sparse and the model is caught in local maxima. The ridge cue is even less constraining (Figure 5c) and the model has too little information to track the arm properly. Figure 5d shows the tracking result using all three cues together. We see that the tracking is qualitatively more accurate than when using any of the three cues separately. While the use of more particles would improve the performance of the individual cues, the benefit of the combined likelihood model is that it constrains the posterior and allows the number of particles to be reduced. [ FIGURE 6 ABOUT HERE ] Handling self occlusion. Next, we show an example of tracking two arms (Figure 6). The number of parameters is 16, including the arm angles and their velocities. In this example, the right arm is partly occluded by the left arm. Since each sample represents a generative prediction of the limbs in the scene, it is straightforward to predict occluded regions. The likelihood computations are then performed only on the visible surface points. Tracking a whole human.
Since the number of particles is exponentially dependent on the di-
mensionality of the search space, as shown in Equation (2), the required number of particles for a model with 50 parameters is very large. With 5000 samples, the tracking of a full human diverges after a few frames, due a too large search volume. The use of stereoscopic views of the scene would disambiguate the depth information and largely solve the problems of self occlusion,thus enabling full body tracking with the smooth motion model. However, for many applications, only a monocular sequence is available. Therefore, it is interesting to study other ways to constrain the tracking.
19
5.2
Activity-Specific Model of Cyclic Motion
Although general and flexible, the smooth motion model presented above required too many particles to make full-body tracking feasible. However, in many activities, such as walking or running, the motion pattern is quite constrained, which makes the smooth motion model unnecessarily general. More specific motion models, which capture the most common motion patterns of humans, can be learned from training data in the form of examples of human motion. Here, such a model is described. Since the model is more specific to human motion than the smooth motion model, fewer particles are needed to represent the probability distribution. This makes the tracking faster. However, the model is activity specific. The model activity is walking, but models of any cyclic activity could be learned using the same scheme. A set of examples of walking cycles is acquired using a motion capture system. From the set of cycles, a model of the mean walking cycle, as well as the most common variations, or “eigencycles”, are learned using principal component analysis (PCA). A synthetic walking cycle can now be formulated as a linear combination of eigencycles. Given the mean cycle, the eigencycles, the coefficients for linear combination of the eigencycles, and the phase at time step FJK , the sought
conditional probability 8I9
: VED >
can be formulated, using a considerably smaller number of in-
dependent parameters than from the general smooth motion model. For a more detailed overview of the model, see [57, 54]. To verify that the required number of particles is lower that for a more general motion model, the walking model was tested on two different sequences, one where the human is walking on a straight path and one where the human moves in a circle. [ FIGURE 7 ABOUT HERE ] Tracking a walking human.
The walking model is used to track a person walking on a straight
path parallel to the camera plane over 50 frames (Figure 7). All parameters were initialized manually with a Gaussian prior at time
oM
(Figure 7, frame 0). As shown in Figure 7 the model
successfully tracks the person although some parts of the body (here the arms) are poorly estimated. There are two reasons for this. Firstly, the training set is limited, and the mode of walking of the tracking subject has to be extrapolated from the training set. This creates artifacts in the 20
walking motion, since small deviations within the training set are expanded. Moreover, the articulated shape model does not account for all shape deformations of the tracking subject, especially on the upper body where the range of motion is large. Despite this, the walking model makes it possible to track the person through the sequence. [ FIGURE 8 ABOUT HERE ] Change in orientation and depth. The next experiment involves tracking a person walking in a circular path and thus changing both depth and orientation with respect to the camera. Figure 8 shows the tracking results for frames from 0 to 50. During the first 15 frames, the human is tracked correctly. In frame 20, the tracker is confused, in part because of the large self occlusion and the low contrast between the foreground and the background. However, the tracking recovers after a few frames, so that the position is correctly estimated in frame 30 when more of the human is visible. As the human turns away from the camera, less and less information about the walking phase is given. The tracker becomes confused and the parameters are slowly diverging from the true values. This is not the effect of too few particles, since the same result is given from tracking with ten times as many particles. A probable cause is the limited training set - it is not always possible to extrapolate the true position from the current training set, and this becomes even more apparent when the model is viewed from a “degenerate” view where the motion in the image plane is very limited. Note also that the training data only contained examples of people walking in a straight line. Due to dynamic effects, the joint angles during a cycle of walking on a straight line differs slightly from the angles observed during a cycle of walking on a bent path. Thus, the true position can only be approximately estimated. If the deviation between the true and the estimated position is too large, the likelihood measure does not give correct information, since it is highly local. Therefore, it might be necessary to use less local likelihood measures for tracking with stronger motion priors. [ FIGURE 9 ABOUT HERE ]
21
Significance of the walking prior.
How significant is the temporal walking prior model in the
tracking? Figure 9 illustrates the effect of repeating the above experiment with a uniform likelihood function, so that the evolution of the parameters is determined entirely by the temporal model. While the prior is useful for constraining the model parameters to valid walking motions, it does not unduly affect the tracking.
6 Conclusions Articulated tracking of 3D human motion is made difficult by the large number of dimensions in the human model, and by the complicated appearance and nonlinear motion of humans. Previous work has often constrained the problem by limiting the range of clothing and background appearance, and by using multiple cameras with wide baseline. In contrast, the goal of the work presented here has been to investigate how well human motion can be tracked and reconstructed, using only a monocular gray-scale camera image source and no particular assumptions about the appearance of either the human or the background. A probabilistic framework for articulated tracking of humans in monocular video sequences was presented in Section 3, and successful tracking results in a number of different environments were shown in Section 5. Although the results are qualitative, they show that it is possible to reconstruct fullbody human motion in 3D from a monocular sequence if certain assumptions of the human motion pattern are made. Although computationally expensive, the Bayesian framework for tracking presented provides a mathematically rigorous way of combining different types of shape, appearance and motion models, and including learned knowledge of the appearance and motion of humans. With the constant increase in affordable computational power, the possibilities of employing probabilistic model-based approaches such as the one presented here increase. Future issues to be adressed include the question of initialization, and the development of motion models that are less constraining than the model of cyclic motion presented in Section 5, while less general than the model of constant velocity. A suggested approach is presented in [55]. It is the hope of the author that the development of algorithms for tracking of articulated motion in the future leads to robust systems for vision-based human-computer interaction and marker-less
22
motion capture, to enhance and simplify the communication between computers and their users.
References [1] J. K. Aggarwal and Q. Cai. Human motion analysis: A review. Computer Vision and Image Understanding, 73(3):428–440, 1999. [2] N. I. Badler, M. J. Hollick, and J. P. Granieri. Real-time control of a virtual human using minimal sensors. Presence, 2(1):82–86, 1993. [3] M. J. Black. Explaining optical flow events with parameterized spatio-temporal models. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, volume 1, pages 326–332, 1999. [4] M. J. Black and D. J. Fleet. Probabilistic detection and tracking of motion boundaries. International Journal of Computer Vision, 38(3):231–245, 2000. [5] G. Blom. Sannolikhetsteori och statistikteori med till¨ampningar. Studentlitteratur, Lund, Sweden, 1989. [6] A. Bobick and J. Davis. An appearance-based representation of action. In International Conference on Pattern Recognition, ICPR, 1996. [7] M. Brand. Shadow puppetry. In IEEE International Conference on Computer Vision, ICCV, volume 2, pages 1237–1244, 1999. [8] M. Brand and A. Hertzmann. Style machines. In Computer Graphics, SIGGRAPH 2000, pages 183– 192, 2000. [9] C. Bregler and J. Malik. Tracking people with twists and exponential maps. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 8–15, 1998. [10] A. Bruderlin and T. W. Calvert. Intelligence without representation. Computer Graphics, 23:233–242, 1989. [11] T-J. Cham and J. M. Rehg. A multiple hypothesis approach to figure tracking. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, volume 1, pages 239–245, 1999. [12] K. Choo and D. Fleet. People tracking using hybrid monte carlo filtering. In IEEE International Conference on Computer Vision, ICCV, volume 2, pages 321–328, 2001. [13] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigid objects using mean shift. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, volume 2, pages 142–149, 2000. [14] T. Darrell, G. Gordon, M. Harville, and J. Woodfill. Integrated person tracking using stereo, color, and pattern detection. International Journal of Computer Vision, 37(2):175–185, 2000. [15] J. Deutscher, A. Blake, and I. Reid. Articulated motion capture by annealed particle filtering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, volume 2, pages 126–133, 2000.
23
[16] J. Deutscher, B. North, B. Bascle, and A. Blake. Tracking through singularities and discontinuities by random sampling. In IEEE International Conference on Computer Vision, ICCV, volume 2, pages 1144–1149, 1999. [17] T. Drummond and R. Cipolla. Real-time tracking of highly articulated structures in the presence of noisy measurements. In IEEE International Conference on Computer Vision, ICCV, volume 2, pages 315–320, 2001. [18] W. T. Freeman and E. H. Adelson. The design and use of steerable filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(9):891–906, 1991. [19] D. M. Gavrila. Vision-based 3-D Tracking of Humans in Action. PhD thesis, University of Maryland, College Park, MD, USA, 1996. [20] D. M. Gavrila. The visual analysis of human movement: A survey. Computer Vision and Image Understanding, 73(1):82–98, 1999. [21] D. M. Gavrila and L. S. Davis. 3-D model-based tracking of humans in action: a multi-view approach. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 73–80, 1996. [22] L. Goncalves, E. Di Bernardo, and P. Perona. Reach out and touch space (motion learning). In IEEE International Conference on Automatic Face and Gesture Recognition, pages 234–238, 1998. [23] L. Goncalves, E. Di Bernardo, E. Ursella, and P. Perona. Monocular tracking of the human arm in 3D. In IEEE International Conference on Computer Vision, ICCV, pages 764–770, 1995. [24] N. Gordon. A novel approach to nonlinear/non-gaussian Bayesian state estimation. IEE Proceedings on Radar, Sonar and Navigation, 140(2):107–113, 1993. [25] I. Haritaoglu, D. Harwood, and L. Davis. A real time system for detecting and tracking people. Image and Vision Computing, to appear. [26] D. C. Hogg. Model-based vision: A program to see a walking person. Image and Vision Computing, 1(1):5–20, 1983. [27] M. Isard and A. Blake. Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5–28, 1998. [28] M. Isard and A. Blake. ICondensation: Unifying low-level and high-level tracking in a stochastic framework. In European Conference on Computer Vision, ECCV, volume 1, pages 893–909, 1998. [29] M. Isard and A. Blake. A mixed-state Condensation tracker with automatic model-switching. In IEEE International Conference on Computer Vision, ICCV, pages 107–112, 1998. [30] M. Isard and J. MacCormick. BraMBLe: A bayesian multiple-blob tracker. In IEEE International Conference on Computer Vision, ICCV, volume 2, pages 34–41, 2001. [31] E. T. Jaynes. Probability Theory: The Logic of Science. http://bayes.wustl.edu/etj/prob.html, 1996.
Unpublished, preprint on
[32] S. Ju. Human motion estimation and recognition (depth oral report). Technical report, University of Toronto, Toronto, Canada, 1996.
24
[33] S. X. Ju, M. J. Black, and Y. Yacoob. Cardboard people: A parameterized model of articulated motion. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 38–44, 1996. [34] T. Kailath. The divergence and Bhattacharyya distance measures in signal selection. IEEE Transactions on Communication Technology, COM-15(1):52–60, 1951. [35] I. Kakadiaris and D. Metaxas. 3D human body model acquisition from multiple views. In IEEE International Conference on Computer Vision, ICCV, pages 618–623, 1995. [36] I. Kakadiaris and D. Metaxas. Model-based estimation of 3D human motion with occlusion based on active multi-viewpoint selection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 81–87, 1996. [37] S. M. Konishi, A. L. Yuille, J. M. Coughlan, and S. C. Zhu. Fundamental bounds on edge detection: An information theoretic evaluation of different edge cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, submitted. [38] A. B. Lee, D. Mumford, and J. Huang. Occlusion models for natural images: A statistical study of a scale-invariant dead leaves model. International Journal of Computer Vision, 41(1/2):35–59, 2001. [39] M. E. Leventon and W. T. Freeman. Bayesian estimation of 3-d human motion from an image sequence. Technical Report TR–98–06, Mitsubishi Electric Research Lab, Cambridge, MA, USA, 1998. [40] T. Lindeberg. Edge detection and ridge detection with automatic scale selection. International Journal of Computer Vision, 30(2):117–156, 1998. [41] J. J. Little and J. Boyd. Recognizing people by their gate: The shape of motion. Videre, 1(2):1–32, 1998. [42] J. MacCormick and M. Isard. Partitioned sampling, articulated objects, and interface quality hand tracking. In European Conference on Computer Vision, ECCV, volume 2, pages 3–19, 2000. [43] T. B. Moeslund and E. Granum. A survey of computer vision-based human motion capture. Computer Vision and Image Understanding, 18:231–268, 2001. [44] L. Molina and A. Hilton. Realistic synthesis of novel human movements from a database of motion capture examples. In IEEE Workshop on Human Motion, HUMO, pages 137–142, 2000. [45] D. Morris and J. M. Rehg. Singularity analysis for articulated object tracking. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 289–296, 1998. [46] B. Olshausen and D. Field. Natural image statistics and efficient coding. Network Computation in Neural Systems, 7(2):333–339, 1996. [47] V. Pavolvi´c, J. Rehg, T-J. Cham, and K. Murphy. A dynamic Bayesian network approach to figure tracking using learned dynamic models. In IEEE International Conference on Computer Vision, ICCV, volume 1, pages 94–101, 1999. [48] R. Pl¨ankers and P. Fua. Articulated soft objects for video-based body modeling. In IEEE International Conference on Computer Vision, ICCV, volume 1, pages 394–401, 2001. [49] K. Pullen and C. Bregler. Animating by multi-level sampling. In IEEE Computer Animation, pages 36–42, 2000.
25
[50] J. Rehg and T. Kanade. Model-based tracking of self-occluding articulated objects. In IEEE International Conference on Computer Vision, ICCV, pages 612–617, 1995. [51] J. Rittscher, J. Kato, S. Joga, and A. Blake. A probabilistic background model for tracking. In European Conference on Computer Vision, ECCV, volume 2, pages 336–350, 2000. [52] K. Rohr. Towards model-based recognition of human movements in image sequences. CVGIP - Image Understanding, 59(1):94–115, 1994. [53] D. L. Ruderman. The statistics of natural images. Network: Computation in Neural Systems, 5(4):517– 548, 1994. [54] H. Sidenbladh. Probabilistic Tracking and Reconstruction of 3D Human Motion in Monocular Video Sequences. PhD thesis, KTH, Stockholm, Sweden, 2001. [55] H. Sidenbladh and M. J. Black. Implicit probabilistic models of human motion for synthesis and tracking. In European Conference on Computer Vision, ECCV, 2002, submitted. [56] H. Sidenbladh and M. J. Black. Learning the statistics of people in images and video. International Journal of Computer Vision, submitted. [57] H. Sidenbladh, M. J. Black, and D. J. Fleet. Stochastic tracking of 3D human figures using 2D image motion. In European Conference on Computer Vision, ECCV, volume 2, pages 702–718, 2000. [58] E. P. Simoncelli. Statistical models for images: Compression, restoration and optical flow. In Asilomar Conference on Signals, Systems and Computers, 1997. [59] W. Stadler. Analytical Robotics and Mechatronics. McGraw-Hill, New York, NY, USA, 1995. [60] J. Sullivan, A. Blake, and J. Rittscher. Statistical foreground modelling for object localisation. In European Conference on Computer Vision, ECCV, volume 2, pages 307–323, 2000. [61] S. Wachter and H. Nagel. Tracking of persons in monocular image sequences. Computer Vision and Image Understanding, 74(3):174–192, 1999. [62] J. Wang, G. Lorette, and P. Bouthemy. Analysis of human motion: A model-based approach. In Scandinavian Conference on Image Analysis, SCIA, volume 2, pages 1142–1149, 1991. [63] C. Wren, A. Azarbayejani, T. Darrel, and A. Pentland. Pfinder: Real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):780–785, 1997. [64] C. R. Wren and A. P. Pentland. Dynaman: Recursive modeling of human motion. Image and Vision Computing, to appear. [65] Y. Yacoob and M. J. Black. Parameterized modeling and recognition of activities in temporal surfaces. Computer Vision and Image Understanding, 73(2):232–247, 1999. [66] Y. Yacoob and L. Davis. Learned models for estimation of rigid and articulated human motion from stationary or moving camera. International Journal of Computer Vision, 36(1):5–30, 2000. [67] S. C. Zhu and D. Mumford. Learning generic prior models for visual computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, to appear.
26
aDepth mbiguities ¡Similarity limbs between and other structure
¡Self
¢occlusion
Figure 1: Extreme case. Tracking of humans in video sequences is made difficult by a number of ambiguities and singularities.
£Y
ªθ
1
X¦
£Y §
1
¨
Zg
11
° ±
[θ8,θ9 ,θ10]
2
ªθ«
7
® ¯
[θ4,θ5,θ6]
1
Z1
¥Z
[θ16 ,θ17,θ18] 2
ªθ«
X2
¤X ¨
¬
[θ2 ,θ3]
ªθ
© joint angles [θ ,...,θ ] ªθ« global rot [α ,α² ,α ] global trans [t ,t ² ,t ] 1
19
Y¨
[θ12,θ13,θ14]
15
x
y
19
z
x y z
g
g
(a) 2 connected limbs with coordinate systems
(b) Overview of the assembly
Figure 2: Human model. Each limb, ³ , has a local coordinate system with the ´µ axis directed along the limb. Joints have up to 3 angular DOF, expressed as Euler angles. (a) Part-centric coordinate systems. (b) The assembly of truncated cones used for tracking.
27
Figure 3: Example images from the training set with limb edges manually marked.
Log, Thigh and Background
Log, Thigh and Background −1
−1 −2
−2
−3 −3 −4
log(Pon)
log(Pon)
−4
−5
−6
−5 −6 −7 −8
−7 −9 −8 −10
Thigh Background
−9
−10
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
Thigh Background
−11
0.4
0.6
0.8
−12
1
Edge response in edge orientation
−1
−0.8
(a) Edge response
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
(b) Ridge response
Log, Thigh
Log, Background
0
0
Image level 0 Image level 1 Image level 2 Image level 3
−1
−2
Image level 0 Image level 1 Image level 2 Image level 3
−1
−2
log(Poff)
−3
log(Pon)
−3
−4
−4
−5
−5
−6
−6
−7
−7
−8 −30
−0.6
Ridge response in ridge orientation
−20
−10
0
10
20
30
Temporal difference on limb
−8 −30
−20
−10
0
10
20
30
Temporal difference on background
(c) Motion response
Figure 4: Examples of probability distributions for different filter responses, learned from the set of training images. For definitions of the filters, see [56].
28
(a) Only flow cue
(b) Only edge cue
(c) Only ridge cue
(d) All cues
Figure 5: Tracking an arm, moving camera, 5000 samples. The sub-figures show frames 0, 10, 20, 30, 40 and 50 of the sequence. In each frame, the expected value of from the posterior distribution over projected into the image. (a) Only flow cue. (b) Only edge cue. (c) Only ridge cue. (d) All cues.
¶
is
Figure 6: Tracking two crossing arms, 3000 samples. The images show the expected value of the posterior distribution in frame 0, 20, 40, 60, 80 and 100 of the sequence.
29
(a) Projection of the expected value of model the model parameters
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0
0
0
0
−0.2
−0.2
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
0 −0.2 −0.4
−0.4
−0.6
−0.4
−0.6
−0.8
−0.6
−0.8
−1
−0.8
−1 0 −0.5 −1 −1.5
6.5
6
5.5
5
−0.8
−1
0.5 0 −0.5 −1
0.5 0 −0.5 −1
5
5.5
6
0.4 0.2 0 −0.2 −0.4 −0.6 −0.8
−1 1
6.5
0.6
−0.8
−1
6.5
6
5.5
5
0.8
0.5 0
6
5.5
5
−0.5
−1 1.5
6.5
1 0.5 0
5.5
5
2
6.5
6
1.5 1
6.5
6
5.5
5
0.5
(b) 3D configuration of the expected value of model the model parameters
Figure 7: Tracking a human walking in a straight line, 500 samples. (a) Projection of the expected model configuration at frames 0, 10, 20, 30, 40 and 50. (b) 3D configuration for the expected model parameters in the same frames.
(a) Projection of the expected value of the model parameters
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
−0.2
−0.2
−0.4 −0.6
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0
0
0
−0.2
−0.2
−0.2
−0.2
−0.4
−0.4
−0.4
−0.4
−0.4
−0.6
−0.6
−0.6
−0.6
−0.6
0.8 0.6 0.4 0.2 0
−0.8
−0.8
1 0 −0.5
1.5
2
2.5
3
0 −0.5 −1
1.5
2
2.5
3
0 −0.5 −1 −1.5
2
2.5
3
3.5
−0.8
−0.8
−1
−1
0.5
0.5
−0.8
−0.8
−1
−1
0.2
−1
−1
−0.5 −1 −1.5 −2
2
2.5
3
3.5
−0.5 −1 −1.5 −2
2.5
3
3.5
4
−1 −1.5 −2 −2.5
2.5
3
3.5
4
(b) 3D configuration of the expected value of the model parameters
Figure 8: Tracking a human walking in a circle, 500 samples. (a) Projection of the expected model configuration at frames 0, 10, 20, 30, 40 and 50. (b) 3D configuration for the expected model parameters in the same frames.
Figure 9: How strong is the walking prior? Tracking results for frames 0, 10, 20, 30, 40 and 50, when no image information is taken into account (uniform likelihood function). 30