fusion of multiple features for identity estimation

3 downloads 0 Views 423KB Size Report
person for maintaining their unique identities in a visual surveillance application. The features used are a combination of MPEG-7 color descriptors and spatial ...
FUSION OF MULTIPLE FEATURES FOR IDENTITY ESTIMATION J.Annesley, V.Leung, A.Colombo, J.Orwell, S.A.Velastin 1 Digital Imaging Research Centre, Kingston University, U.K. Keywords: pedestrians, information gain

tracking,

recognition,

fusion,

location can be used to track each pedestrian, and provide them with a unique identifier. For non-contiguous camera views, features about each pedestrian are stored and used to retrieve similar instances. The reliability of this approach is constrained by the extent to which the features are invariant with respect to pose, lighting, camera position and color calibration.

In a visual surveillance application where a network of cameras is used to observe many people, there are numerous situations where the consistent identification of a person is an important component in the system. There are many approaches to describing the appearance of a person. In this paper, we examine a combination of two identity descriptors: standard MPEG-7 color descriptors, and spatial descriptors including the height and texture of the object. Using a data set of 47 people with 6 observations per person, we show that the fusion of information given by these descriptors in an information-theoretic framework gives an improvement over using individual descriptors alone. This approach is compared to a metric used in QBE (Query By Example) applications, relating the theory to the practical need in a visual surveillance application to ascertain the identity of a person.

In this paper, we examine the use of certain features of each person for maintaining their unique identities in a visual surveillance application. The features used are a combination of MPEG-7 color descriptors and spatial descriptors including height and texture. The efficacy of these descriptors for identity retrieval are evaluated using two methods. First, the Average Normalised Mean Retrieval Rate (ANMRR) [9] is used where the closeness of match is based on a ranking approach. Second, an information-theoretic approach is used, where the information contribution from each feature is combined using standard pattern analysis techniques. From the experiments, it is shown that fusing features provides a better identity estimate than using individual features, and that the relationship between these two metrics links the theory to the practical need in a visual surveillance application to ascertain the identity of a person.

Abstract

1 Introduction 2 The consistent labeling of pedestrian identity can be usefully applied to many crime detection and prevention scenarios. For automatic visual surveillance systems, one objective is to maintain a single consistent label for an individual as they traverse the area under surveillance. This will require processing of appearance-based features that can be extracted from medium-range views from standard CCTV sensors. To maintain this label over longer timescales, e.g. from day to day, further data and techniques are required, e.g. face recognition on higher resolution imagery, or contextual information such as RFID tickets used on some urban public transport systems. Fundamentally, the requirement of many surveillance systems is to be able to locate a given individual at other times and locations in the area under surveillance. This is necessary for certain real-time scenarios, e.g. a control-room application that maintains a known suspect or at-risk individual in view, as well as for off-line scenarios, e.g., the retrieval of all instances of a person leading up to the time they abandoned a package. There are two main approaches for automatically maintaining a label for each pedestrian. If the camera views are overlapping and the trajectory is continuously in view, then continuity of

Previous Work

The retrieval or recognition of people from a database in the context of visual surveillance has been addressed by many authors. Usually, the process involves the extraction of salient features from the subjects, which are then used individually or in combination to provide a unique signature of a person. Common generic features include spatial, texture and color features. Given the context of the problem, it is also possible to use specific features, such as shape (i.e. silhouette), gait [14] and the head shape [4] of each person. The use of MPEG7 descriptors as features has also been studied. Berriss et al. [2] used MPEG-7 shape and color descriptors for retrieving subjects from a database, while Hahnel et al. [5] used a combination of MPEG-7 color and texture descriptors. Once the features from all the subjects have been extracted, they can be used individually or in combination to generate a model from which the classification can be performed. One of the simplest models is a histogram of the feature values. More sophisticated parametric models, as well as neural networks (used in [5]) and Bayesian networks can also be constructed. The fusion of information from multiple features has also been considered. Simple methods include taking the maximum,

minimum, or an average of the contributing features. Standard pattern analysis techniques such as parametric probability density function (p.d.f.) estimation provide a more rigorous approach to the problem. To evaluate the performance of retrieving a person’s identity, a traditional approach is to rank the closeness of match between the query subject and those in the database. One such metric is the ANMRR [9]. On the other hand, in order to represent uncertainty in the output, the probability of correct retrieval can be used, e.g. based on an information-theoretic approach [3]. These two evaluation measures operate on two distinct types of similarity measure. The ANMRR is designed to evaluate ranked orderings of similarities; the information gain approach measures the difference between prior and posterior probability estimates. For crime prevention and detection scenarios, both types of similarity measure have their roles. For a human operator working in real time, probabilistic measures are not necessarily easy to interpret: it may be more suitable to have a ranked order of candidates from which they can choose for further inspection. For legal purposes, a statement of absolute probability is more relevant than a ranking. For example, DNA matched evidence is accompanied by a probability of correct match, rather than a ranking. A further benefit of using probability-based measures is that they can be combined with other probability measures from other systems, e.g. audio or contextual cues. In this work, the estimation of the p.d.f. and the ANMRR is performed in the space of distances between values, rather than in the space of the values themselves. This is because the models built on distances between observations can be applied on unseen individuals in the future. The most compelling practical reason for this that for the surveillance user, it is only important to ascertain whether the query subject is the same identity as an element in the database. If it is different, then it is not so important which of the different elements they do belong to. In other words, the problem can be decomposed into a series of binary decisions, for which the distance space can be used. There is also the additional benefit that several orders of magnitude more training data are available to build the models, than what would be the case if the distributions were modeled in the original feature space. 3 Input Features The two categories of features used to represent the attributes of the objects for identity estimation are MPEG-7 color descriptors and spatial descriptors. 3.1 Color Descriptors The MPEG-7 visual descriptors are low-level descriptors and are designed for use in a CBIR (Content Based

Information Retrieval) system [9]. The descriptors used in these experiments are Scalable Color and Color Structure, with a quantization level of 256, as these were found to be the best features in terms of retrieval rate [1]. These visual descriptors require a single image to produce their results. Each Color Descriptor has an associated procedure for determining a scalar distance measure, or match measure between any two descriptions. A mean-color descriptor has also been used as a baseline in this study. Its distance measure is simply the Euclidean distance. 3.2 Spatial Features: texture The texture of an object describes its regularity, coarseness, and spatial structure. The MPEG-7 standard has several texture descriptors. Here, we examine the use of the co-occurrence method for texture description [6]. In this method, the relative frequencies of grey level pairs of pixels at certain displacements and orientations are used to generate the co-occurrence matrix. Essentially this is a p.d.f. of the pixel intensity of the object. This matrix can then be used directly as a descriptor; alternatively summarising statistics can be extracted. Although the co-occurrence method is not compared directly with the standard MPEG-7 texture descriptors here, one of the advantages of the former is its potential compactness, especially when the summarising statistics are used. The generation of the co-occurrence matrix is now described [12]. The input image or object is first quantised into G grey levels. Quantisation is performed to both reduce the dimensionality of the problem and to ensure the co-occurrence matrix is not sparse. The co-occurrence of different grey levels, at different distances d and different orientations are then counted to give a G × G co-occurrence matrix P . The possible orientations include 0◦ , 45◦ , 90◦ and 135◦ , as well as an “all-direction” orientation where the eight neighbours of a pixel are used. In this paper, d = 1. The distance between two co-occurrence matrices, or essentially two discrete p.d.f.s, P 1 and P 2 , can be computed using the Bhattacharyya metric [7] or the KullbackLeibler (KL) divergence measure [8] s XX q dBhat = 1− j p1ij p2ij (1) i

dKL

=

XX i

j

p1ij log2

p1ij p2ij

(2)

where p1ij is the ith row and j th column of the matrix P 1 . Haralick [6] suggested 14 features or summarising statistics that could be computedPto describe each two-dimensional p.d.f. G−1 PG−1 The entropy EN T = i=0 j=0 pij log pij has been found to be appropriate in this application. The difference between two summarising statistics is simply their absolute difference.

Front Camera Calibration 1.72 1.64

y = 0.358*x + 38.22

px/cm

1.56 1.48 1.39 1.31 1.23

data 4 linear

1.14 1.07 250

300

350

400

450

500

y (px)

Figure 1: Sample image with calibration guides.

Figure 2: px/cm ratio: calibration points are shown as crosses; interpolated data as a line.

3.3 Spatial Feature: Height The height of a person can also be used as a spatial descriptor. The height feature requires the camera calibration of the ratio of pixels-to-metres px/m as a function of the y-position in the image y (Fig.1). Assuming zero-skew, unitary aspect ratio and null roll angle, camera calibration is straightforward, requiring only a set of features in the image whose pixels-to-metres ratios are known. In this case, the vertical panels with a height of 1.22m are used. The data points are shown in Fig.2 as crosses. Due to the perspective projection, the pixels-to-metres ratio is directly proportional to y + c where c is a constant denoting the difference between the vanishing point of the image (which is essentially outside the image) and the top of the image. The value of c is the intercept of the vertical axis in Fig.2. These points are linearly interpolated to obtain a function that maps y to pixels-per-metre ratios, called the ppm. A person’s height can then simply be calculated by hm = hpx /ppm (y). Obviously, an estimate of a person’s height can only be obtained if the entire person is visible in the image. This is addressed by discarding the observations where the bottom part of the person is occluded or clipped. An “occlusion mask” has been created manually (Fig.3) where the white pixels correspond to potentially occluding objects. When the height feature is extracted, a check is made to see whether the target’s bottom pixel touches an occluding object or whether it lies on the last row of the image. If it does, the height feature is not available. It should be pointed out that the accuracy of the person’s height is a function of the quality of the segmentation. The estimate of this uncertainty has not been incorporated into the methodology here. 4 Methodology This section describes the procedure that was followed in the experimentation. This includes the collection and preparation of the data, used by both the evaluation methods of ANMRR and information gain; the estimation of the underlying model from which the data is generated, used only by the informationbased evaluation method; and finally the estimation of the identity using the extracted features.

Figure 3: Occlusions. Left: manually created occlusion mask. The mask has been displayed inverted for clarity. Middle and right: examples where the observations are discarded. 4.1

Data Collection and Preparation

A total of 47 people, randomly selected from an undergraduate programming class, were filmed by a camera A while they go in and out of a room, thus forming 94 separate video sequences out {Ain i ; Ai i = 1, . . . , 47}. 6 images (at approximately 1 second intervals) from each sequence are then extracted, therefore providing 47 × 6 = 282 sample images in the data set. Each sample image is then segmented to produce a foreground of the detected object, from which the features described in Section 3 are extracted. Fig. 4 shows an example of the 6 (superimposed) segmented images taken from view Ain . The left hand side shows the same person but from view Aout and is the query image used in the evaluation using ANMRR in QBE.

Figure 4: Left: example query subject. Right: six sub-images comprising object data set.

4.2

Model Estimation

Essentially, the true p.d.f. of any feature or descriptor used can only be modelled if the entire population is measured. Since

this is not practical or practicable, a subset of the population is measured, and the p.d.f.(s) of the features are approximated. These estimated p.d.f.s, or models, are used in the informationbased evaluation method.

where K = relevant rank mark (or the threshold), n1 = number of correct data elements and z = query. This is averaged over all operations in the set to obtain the ANMRR.

If a single feature or descriptor is used, the distance measures between different instances of the same object correspond to the true samples, while the distance measures between instances of different objects are the false samples. A onedimensional p.d.f. can be estimated from training sets of true and false samples, the simplest p.d.f. being a Gaussian. If more than one feature or descriptor is used, a parametric model is more appropriate. Here, a multi-dimensional Gaussian mixture model (GMM) is used. Each of the GMMs is estimated by the Expectation-Maximisation (EM) algorithm using Netlab[11].

4.4

In this experiment, for each of the true and false models, half the samples are used in the training, with the remaining half used in the testing, i.e. in the evaluation of Eq.4 which will be discussed in Section 4.4. This approach to partitioning the data falls in the realms of “soft partitioning” where although the samples in each set are selected at random, observations belonging to the same subject are not required to be in one of the two sets explicitly.

4.3 Identity Estimation

Ultimately, the aim here is to retrieve the identity of the person, out of the database of people. This section discusses the two evaluation methods by which the identity of the subject is estimated: the ANMRR and an information-theoretic approach.

4.3.1 Evaluation by ANMRR The ANMRR metric is widely used to evaluate the performance of retrieval systems [9, 13]. The purpose of the metric is to allow an evaluation of different descriptors, unbiased with respect to different sample and ground-truth sizes, and correlating well with perceptual judgment about the retrieval success rate [10]. Scores are based upon the rank of results and not their value. The rank of each correctly retrieved datum is counted and penalties are issued if any of the items comes after a threshold, K. The same penalty applies to all items after K, regardless of how low-ranking the item is. The size of the ground truth set determines the rank at which the threshold is placed; a rule of thumb is for K to be twice the number of correct items in the data set [9]. Each retrieval operation z is assigned an NMRR, the Normalized Modified Retrieval Rate: M RR(z) N M RR(z) = 1.25 · K(z) − 0.5 · [1 + n1 (q)]

(3)

Evaluation by Information Gain

Two sets of experiments are performed here. In the first set, the experiments are arranged to ensure that exactly one of the n elements in the data set has the same identity as the query subject, and that, a priori, each element has an equal probability of being the correct match. In the second set, three of the elements in the data set have the same identity as the query subject. (This is possible given that there are 6 observations per subject). This provides a sufficient ground truth size so that an ANMRR measure can be computed for comparison. To measure the information provided by one or more color descriptors, about the identity of a query subject, expressions must be found for the uncertainty of the state before and after the match measures with the data set elements are observed: the information added is equal to their difference. The identity of the query subject can be written as a discrete random variable Z that can assume values between z1 and zn . An n-dimensional vector x represents the match measure x between the query and each element in a data set, using an arbitrary feature or descriptor. This has a corresponding continuous random variable X. The information gained through observation of the descriptor(s) can be written [3] as I(X, Z)

XZ

=

p(x, zi ) log

x

i

¸ · p(x, zi ) ≈ E log p(x)p(zi )

p(x, zi ) dx p(x)p(zi )

(4) (5)

where E[·] is the expectation operator, ranging over the expected joint input of x and zi . This expression can be evaluated using the two p.d.f.s p(xi |yi ), where yi = 1 if Z = zi (i.e. a correct match), and yi = 0 (incorrect) otherwise. The estimation of these p.d.f.s is discussed in Section 4.2. The two terms in Eqn. 4 can be approximately expressed in terms of these p.d.f.s: p(x) = =

p(x1 , . . . , xn ) ≈ Y

Y

p(xi )

(p(xi , 1) + p(xi , 0))

i

=

Yµ1 i

(6)

i

n

p(xi |1) +

n−1 p(xi |0) n

(7) ¶ (8)

provided that n is not too small, and similarly p(x, zi )



Yn−1 1 p(xi |1) p(xj |0) n n j6=i

(9)

0.07

0.045 P(x|0) P(x|1)

0.06

2

P(x|1)

0.03 0.025

0.03

0.02

1.8 Information gain (nats)

0.035 0.05 0.04

MPEG−7 color + Cooc Mat, All dirn, Bhat dist + Height

P(x|0)

0.04

0.015

MPEG−7 color + Cooc Mat, All dirn, KL + Height

1.6 1.4 1.2 MPEG−7 color (Color Structure + Scalable Color) MPEG−7 color + Cooc Mat, All dirn, Bhat dist

1

0.02

MPEG−7 color + Cooc Mat, All dirn, KL

0.01 0.01 0

0.8 5

0.005 x (Color Layout Match Measure)

10

15

20 25 30 No. subjects in database

35

40

45

0 x (Scalable Color 256 Match Measure)

Figure 5: p.d.f.s for correct(1) & incorrect(0) matches, as a function of the distance measure (x), for two types of MPEG-7 color descriptor.

Figure 7: Information gain of combined features using a single Gaussian model. 4 3.5

1.4 Scalable Color

Information gain (nats)

Information gain (nats)

1.2 1

Color Structure Cooc Mat, All dirn, Bhat dist Mean Cooc Mat, All dirn, KL div

0.8 0.6

Height

0.4

3 Entropy of the problem: log(n)

2.5 2 1.5 1 Information gain of different combinations of features and descriptors

0.2

0.5

Cooc, All dirn, ENT

0 5

10

15

20 25 30 No. subjects in database

35

40

45

Figure 6: Information gain of individual features with the validity mask. Examples of p(xi |0) and p(xi |1) for two Color Descriptors are shown in Fig. 5. These are the one-dimensional p.d.f.s for each descriptor. Higher dimensional p.d.f.s represented by Gaussian mixture models have not been shown here due to the difficulty in visualisation. It should be noted that in the experiments, model estimation from the training data, and the calculation of Eq.4 is performed a number of times (10 in this experiment), and an average of the resulting I(X, Z) is reported.

5 Results

Individual features are first examined to characterise the information gain provided by each of them separately. The validity mask used in conjunction with the height feature is extended to be used with all the features. This is to ensure that all the feature vectors are compatible in dimensionality when the Gaussian Mixture Models are estimated. The information gain of individual features with the validity mask is shown in Fig.6. It can be seen that Scalable Color provides the best information gain by far when used separately. Features are then combined, using a single Gaussian as the model, and the information gain provided by different combinations are computed. The results are shown in Fig.7. It can be seen that using both MPEG-7 color descriptors, in conjunction with the height as well as the Bhattacharyya distance between all-direction co-occurrence matrices,

0 5

10

15

20 25 30 No. subjects in database

35

40

45

Figure 8: Information gain of combined features using a GMM.

provides the highest information gain. The single Gaussian model is further improved by the use of GMMs. The results are shown in Fig.8. Combinations of the MPEG-7 color descriptors, Bhattacharyya distance and KL divergence between all-direction co-occurrence matrices, as well as height are examined. The results are very similar between different combinations (and are therefore not labelled on the figure), with the combination of MPEG-7 color descriptors, in conjunction with the height as well as the KL divergence between all-direction co-occurrence matrices being the best. In terms of reducing uncertainty, consider the case with 45 subjects in the database. Assuming each subject is, a priori, an equally likely match, the initial uncertainty is log(45) u 3.8 nats, or 5.6 bits. The best combination of features can remove approximately 1.9 nats (2.7 bits) of entropy. The residual uncertainty of about 2 nats (2.9 bits) is equivalent to an equal choice between e2 u 7.4 people, though in practice the posterior estimates will have an uneven distribution over all subjects. The evaluation using information gain is converted to an ANMRR metric and the results are shown in Fig.9. The ordering of the performance here is almost identical to that provided by the information gain (shown in Fig.8), thus illustrating that probabilistic and ranking evaluation paradigms show a high degree of consistency in measuring the effectiveness of a given feature set. Evaluation using ANMRR in a QBE application, with the MPEG-7 color descriptors fused with the addition, multiplication, and minimum operations on

References

0 ANMRR values calculated from QBE (query by example)

0.1 0.2

ANMRR

0.3 0.4

[1] J. Annesley, J. Orwell and J.R. Renno, “Evaluation of MPEG7 Color Descriptors for Visual Surveillance Retrieval”, Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation, VS-PETS Beijing, China, 2005.

Same set of feature and descriptor combinations as in Figure 8

0.5 0.6 0.7 0.8 0.9 1 5

10

15

20 25 30 No. subjects in database

35

40

45

Figure 9: ANMRR metrics computed from information-based results.

a database size of 45 gives slightly better results - these points are superimposed on Fig.9.

6 Conclusion and Further Work

In this paper, we examined the use of both standard MPEG7 descriptors as well as other spatial features for providing a unique identification of subjects in a database. Various combinations of these features and descriptors are investigated, and both a single Gaussian and a Gaussian Mixture Model as a method for combining them are employed. It was found that using the GMM, and a combination of MPEG-7 color descriptors, height and the KL divergence between all-direction co-occurrence matrices provided the highest information gain across all database sizes. The evaluation by information gain is also converted into the ANMRR metric, and the two has a direct relationship, linking the probabilistic approach to the ranking method favoured in many practical surveillance systems. From the results, it is observed that the Scalable Color descriptor provides most of the information gain, while the other features seem to provide much less gain. It will be worthwhile to perform a PCA (Principle Component Analysis) on the data to investigate the correlation within the data set. The comparison of the co-occurrence method with standard MPEG-7 texture descriptors would also be of interest.

[2] W.P. Berriss, W.G. Price and M.Z. Bober, “The use of MPEG-7 for intelligent analysis and retrieval in video surveillance”, in IEE Symposium of Intel. Dist. Surveillance Systems, 8/1, UK, 2003. [3] T.M. Cover and J.A. Thomas, Elements of Information Theory, John Wiley and Sons, Inc., 1991. [4] M. Everingham and A. Zisserman, “Identifying Individuals in Video by Combining ’Generative’ and Discriminative Head Models”, in ICCV 2005, 11031110, 2005. [5] M Hahnel, D. Klunder and K.F. Kraiss, “Color and texture features for person recognition”, in IEEE. Int. Joint Conf. Neural Networks, 1, 652, 2004. [6] R. Haralick, K. Shanmugam and I. Dinstein, “Textural Features for Image Classification”, IEEE Trans. Systems, Man, Cybernetics, 3, 610-621, Nov. 1973. [7] T. Kailath, “The divergence and Bhattacharyya distance measures in signal selection”, IEEE Trans. Commun. Tech., vol. 15, pp 52-60, 1967. [8] http://mathworld.wolfram.com/RelativeEntropy.html [9] B.S. Manjunath, P. Salembier and T. Sikora, Introduction to MPEG-7, John Wiley and Sons, Ltd., 2002. [10] P. Ndjiki-Nya, J. Restat, T. Meiers, J.R. Ohm, A. Seyferth and R. Sniehotta, Subjective evaluation of the mpeg-7 retrieval accuracy measure - anmrr, Technical Report M6029, 2000. [11] http://www.ncrg.aston.ac.uk/netlab/index.php [12] T. Randen and J. Hakon Husoy, “Filtering for Texture Classification: A Comparative Study”, IEEE Trans. Pat. Anal. Machine Intel., vol. 21, No. 4, pp 291-310, April 1999.

Acknowledgements

[13] K. Wong and L. Po, “Mpeg-7 dominant color descriptor based relevance feedback using merged palette histogram”, in IEEE Int. Conf. Acous., Speech and Sig. Proc., 3, 433, 2004.

Funded under GENERICK project (Faraday Imaging Partnership) & CARETAKER project (European Union IST 4-027231). The authors would like to thank J.P. Renno for providing the segmented data.

[14] S. Yu, L. Wang, W. Hu and T. Tan, “Gait Analysis for Human Identification in Frequency Domain”, ICIG’04, 282-285, 2004.

Suggest Documents