1
Human Body Parts Tracking Using Pictorial Structures and a Genetic Algorithm H. Bhaskar1 , L. Mihaylova1 and S. Maskell2 of Communication Systems, Lancaster University InfoLab21, South Drive, Lancaster University, Lancaster LA1 4WA 2 QinetiQ, Malvern Technology Centre {h.bhaskar, mila.mihaylova}@lancaster.ac.uk,
[email protected] 1 Department
Abstract— Tracking people and localising body parts is a challenging computer vision problem because people move unpredictably under circumstances of partial and full occlusions. In this work we focus on the problem of automatic detection and tracking of humans and we propose a combined background subtraction (BS) /foreground modeling and a matching technique based on a genetic algorithm. The developed architecture combines a self-adaptive cluster level BS scheme using a Gaussian mixture model (GMM) and an appearance learning model of the foreground with pictorial structures. The model of the human body parts is then matched with the background subtracted sequence using an efficient genetic algorithm. The efficiency of the designed technique is demonstrated over real video sequences.
I. I NTRODUCTION Motion detection and tracking is critical to many automated visual applications. A high degree of sensitivity and robustness is often desired from any detection mechanism. The simplest mechanism of accomplishing detection is through building a representation of the scene background and comparing the new frame with this representation. This procedure is popularly known as background subtraction [15], [16], [8], [2]. In the last several years, a number of different techniques of BS have been proposed in the literature. Some of the techniques for BS include mixture of Gaussians [21], kernel density estimation [4], [12], colour and gradient cues [7], high level region analysis [18], Kalman filter [20], hidden Markov models [17], and Markov random fields [9]. The general idea behind the Bayesian type techniques for BS is to represent each pixel of an scene using a probability density function (PDF). A pixel from a new image is classified as background depending on how well described the pixel is by its density function. Common difficulties of the BS techniques are in handling dynamic changes of the environment, gradual or sudden (e.g. moving clouds); motion changes including camera oscillations and high frequency background objects (tree branches, sea waves, etc.) and changes in the background geometry (such as parked cars) [2]. In contrast to the BS methods, foreground learning models concern detecting generic classes of objects rather than just specific templates. Most foreground learning models are
motivated by the pictorial structure representation introduced in [5]. According to the pictorial structure model, an object can be represented by a collection of parts arranged in a deformable configuration. This structure allows encoding different parts of the body with the local properties of the object whilst, the deformable configuration characterises spring-like movements. The second phase of the foreground learning model deals with matching these pictorial structures to an image. This is generally done through minimising an energy function that measures the cost for every part [6]. The pictorial structure representation is an appealing approach for foreground modeling particularly due to its simplicity and generality. However, its application to automatic detection has been limited due to the following reasons: (i) the model and its parameters are specific to different objects, it is often hard to specify a generic model, (ii) the resulting energy minimisation problem is highly complex and requires efficient techniques for real-time applicability and (iii) the matching process can contain a solution space with many outcomes resulting in ambiguities. A number of different techniques have been proposed in the literature for pictorial structure matching (e.g., [6], [11]). The procedure of pictorial structure matching generally consists of two main phases of feature extraction and matching. In the first stage, discrete primitives, or features such as corners [14], are detected. In the second stage, models are matched against those features. Finding people by matching body parts has been subject to a considerable interest in the recent years. A number of studies including (e.g., [10], [3], [13], [1]) have been performed in this domain. In the present work, we explore the advantages of combining the BS methods with the appearance of the foreground. Knowing the foreground can help improving the quality of BS and thus improving the detection process. The main contribution of this paper is in the combination of the cluster level BS technique using a mixture of Gaussians with a pictorial structure based genetic matching algorithm for foreground learning to obtain robust automatic human detection. The remaining part of the paper is organised as follows. Section II presents brief details on the the algorithm
2
for cluster level BS. A foreground modeling technique is then presented in Section III to detect the parts of a human body. Section IV gives detailed evaluation of the foreground modelling techniques over natural video sequences. Finally, some conclusive remarks and open issues for future work are outlined in Section V. II. C LUSTER BS A. Update of the GMM Parameters at Cluster Level The problem of cluster BS involves a decision whether a cluster of pixels belongs to the background (bG) or foreground (fG) object based on the ratio of probability density functions: p(bG|cik ) p(cik |bG)p(bG) = , i p(f G|ck ) p(cik |f G)p(f G)
(1)
where, the vector cik = (ci1,k , . . . , ci`,k ) characterises the i-th cluster (0 ≤ i ≤ q) at time instant k (and current image),¤ £ containing ` number of pixels such that [Im]k = c1k , . . . , cqk is the whole image; p(bG|cik ) is the PDF of the background, subtracted based on a certain feature (e.g., colour, edges) of the cluster cik ; p(f G|cik ) is the PDF of the foreground on the same cluster cik ; p(cik |bG) refers to the PDF model of the background and p(cik |f G) is the appearance model of the foreground object. In our cluster BS technique the decision that any cluster belongs to a background is made if: p(cik |bG)
µ ¶ p(cik |f G)p(f G) > threshold = . p(bG)
Determine the distance of every pixel from the cluster centres using a specific distortion criterion. For example, for each pixel feature x from the image [Im]k , the Euclidean norm is calculated between the pixel feature and its neighbourhood of nB connected pixels xnB . In our algorithm, we have used the sum of differences in the {h, s, v} feature space and the spatial distance between these pixels as the distortion criterion. • If the value d of the Euclidean norm is smaller than a predefined threshold ², i.e., d ≤ ², then the pixels from regions sharing similar colour features are clustered together. • Repeat until either the codewords do not change or the change in the codewords is small. The results obtained from this clustering method, however, can vary in accordance with the features chosen and the type of clustering scheme. A GMM, containing M components, is then used to represent the density distribution •
M X
pe(cik |