SHAPE-BASED DETECTION OF HUMANS FOR VIDEO ... - CiteSeerX

2 downloads 0 Views 418KB Size Report
Herbert Ramoser. 1 ... {herbert.ramoser, thomas.schloegl, csaba.beleznai}@acv.ac.at. 2 ..... [9] S. X. Ju, M. J. Black, and Y. Yacoob, “Cardboard people: a.
SHAPE-BASED DETECTION OF HUMANS FOR VIDEO SURVEILLANCE APPLICATIONS Herbert Ramoser1 , Thomas Schl¨ogl1 , Csaba Beleznai1 , Martin Winter2 , and Horst Bischof 3 1

Advanced Computer Vision GmbH, Vienna, Austria {herbert.ramoser, thomas.schloegl, csaba.beleznai}@acv.ac.at 2 ¨ Siemens AG Osterreich, Programm- und Systementwicklung, Graz, Austria 3 Institute for Computer Graphics and Vision, Univ. of Technology, Graz, Austria ABSTRACT In this paper we describe a surveillance system that is not only able to detect blobs and track them but also determines if a blob is a person. The given blob is segmented into sub-regions. A person model is fit to these regions such that a likelihood measure is maximized. The likelihood measure depends on the number of identified body parts, their length, location, and aspect ratio. The method is translation, rotation, and scaling invariant and computationally efficient. The results obtained for test video sequences are very encouraging. 1. INTRODUCTION The detection of the presence of humans is of importance for example in save robot navigation and automatic video surveillance. In this paper we focus on video surveillance applications with stationary cameras where human detection is generally used as an assistive technology for the system operator. A computer-based surveillance system must be able to reliably detect a possible intruder and alert the system operator. The method should be robust with respect to the wide range of appearances of persons (e. g., due to clothing), illumination conditions, and background scenes. The detection algorithm is integrated into a surveillance system which is currently under development [1]. The system detects foreground pixels using a model of the background scene. The foreground pixels are grouped into blobs which are used for further analysis. All non-foreground pixels are used to update the background model. In a second stage a basic calibration of the camera allows that every blob is classified according to its size as a (potential) single person and other objects (e. g., group of persons, cars, etc.). All single person sized blobs are subjected to further analysis by the algorithm presented in this paper. The remainder of the paper is structured as follows: first we give a brief overview of related literature, Section 2 describes the shape segmentation and model fitting algorithm in detail. Section 3 presents some encouraging results on test video sequences. Finally, Section 4 concludes with a summary of advantages and limitations and suggestions for further improvements.

tion methods can be distinguished: shape based, color and texture based, and motion based. Shape based analysis is the most widely used approach. Shape based methods include simple blob area measures [2], projection histograms [3, 4], clustering of statistical shape descriptors [5], Fourier and wavelet silhouette descriptors [6, 7], and fitting of a human model to the blob [8, 9, 10]. Few methods use color and texture features in order to detect humans [11, 12]. All methods mentioned so far operate on static images. An alternative is to use the unique motion pattern of a walking human for classification [13, 14]. Most of the published methods have severe problems in detecting clothed or partially occluded humans. The most promising exception is described in [10]: a human model is fit to the image blob such that a likelihood measure is maximized. Varying clothing and occlusions are allowed by the dynamic model assembly. In this paper we present an improved version of this algorithm which increases the processing speed by detection of the body parts in a single iteration and modified calculation of the model likelihood. 2. METHODS 2.1. System Architecture The human detection algorithm is integrated into a surveillance system which is currently under development [1]. The analysis performed by the system is outlined in Fig. 1. The basic steps are: • Frame acquisition: Frames are captured at regular time intervals which are typically in the range of 6 to 12 frames per second.

1.1. Related Work The detection of humans in images or video sequence has attracted growing attention by several research groups. Three distinct detecThis work has been carried out within the K plus Competence Center ADVANCED COMPUTER VISION. This work was funded from the K plus Program.

Fig. 1. Principal steps of the motion analyis and human detection process.

• Motion detection: Intensity profile analysis is used to classify every pixel of a single frame as moving, stationary or background. Standard adaptive background and threshold models are computed on RGB-color images. • Connected component analysis: All pixels which are classified as stationary or moving are combined into blobs based on their connectedness. • Blob filtering: All blobs obtained from the preceding step are analyzed based on the local scaling at the image position of the base line and the size of the bounding box. Blobs which are too small or too large to contain a single person are removed from further analysis. • Model fitting: Every blob is split into sub-regions which are considered candidate body parts. In order to avoid any dependency on clothing color and texture the algorithm is solely based on the blob contour. In a second stage a human model is fit to these parts. The wide range of human appearances is covered by providing several models for front/side view and different clothing. The model which gives the highest likelihood is finally selected. In order to keep the model fitting stage computationally tractable we require that the trunk is part of the blob. Furthermore, we require that the pixel aspect ratio is known (without loss of generality we assume square pixels).

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 2. Steps performed during blob decomposition. (a) input image, (b) blob contour, (c) smoothed blob contour with marked NCMs, (d) possible region boundaries starting from NCMs, (e) final set of regions, (f) approximated regions (rectangles) and joints (dots).

2.2. Blob Decomposition Blob Extraction. An inspection of the blobs determined by the surveillance system described above showed that the borders of these blobs are too fuzzy for shape based analysis. Therefore, every people sized blob is replaced by a new blob determined by background subtraction (the background model is maintained by the surveillance system). Finding Blob Region Boundaries. An inspection of the human contour shows that a negative minimum of curvature (NCM) is a possible boundary point of a body part (e. g., armpit). This property can be used to derive a method to sensibly segment a blob into possible body parts. Note, while it is correct to require that one end point of a body part boundary is an NCM the other end point may have positive curvature (e. g., hip). The following processing steps are performed to segment the blob into regions (for a sample decomposition see Fig. 2(a–e)): • Determine NCMs: The blob contour is smoothed to reduce the effect of noisy blob edges. Every negative local minimum of the contour curvature is then used as candidate boundary end point. • Find possible boundaries: Every NCM can be the end point of maximum two boundaries. For every half of the contour (measured from the NCM) the shortest cut is stored as possible boundary. • Fuse close cuts: If two cuts have close end points the cuts are combined. • Remove contradicting cuts: Some of the cuts intersect. If this is the case the longer cut is deleted. • Remove over-segmentations: A cut is deleted if the two regions which are separated by the cut have identical orientation.

Fig. 3. Hierarchy of body parts. Arrows indicate an (optionally) attached part, dashed arrows represent possible decompositions into sub-parts. For simplicity decompositions are shown only once. In addition to the labels shown a region can be also labeled as outlier.

Blob Region Representation. In order to ease subsequent analysis it is necessary to use a simplified representation of the blob regions. Each region is represented as a rectangle with same size and orientation as the region polygon. The location of a joint is the center of the corresponding cut. See Fig. 2(f) for an example. 2.3. Model Fitting Person Model. The body model used for analysis is shown in Fig. 3. Three different trunk lengths are used to allow a variety of clothing (e. g., skirt, coat, etc.). After the model fitting step every region of the blob is assigned either a body part label according to the model or is labeled as an outlier. Goodness Function. The result of the blob decomposition procedure are candidate body parts C (i. e., blob regions) which are connected by joints. The goal of the model fitting step is to assign a label to each region (i. e., a hypothesis H) such that the probability that the labeled blob resembles a human P (H, C|Person) is maximized. Using Bayes theorem it can be shown that this maxi-

mization is identical to the maximization of the goodness function head

lower torso

torso

P (C|H, Person) is the degree of resemblance between the candidate body parts and the corresponding person model. In our model it depends on the normalized length and aspect ratio of the body parts and the location of the joints with respect to the trunk. P (Person|H) is determined by the number of identified body parts and their relevance to the model. The calculation of these probabilities is explained in the following paragraphs. Model Conformity Measure. The resemblance of a labeled region i with the person model is calculated as P (C (i) |H, Person) = P (J (i) |HJ )P (L(i) |HL )P (A(i) |HA ) where P (J (i) |HJ ) is the probability of the observed joint location (normalized with respect to the trunk), P (L(i) |HL ) is the probability of the body part length (normalized with respect to the trunk), and P (A(i) |HA ) is the probability of the observed aspect ratio. The joint length, and aspect ratio probabilities are modeled as normal distributions (different parameters are used for front and side view of a body). The normalization ensures that the observations are scale, translation, and rotation invariant. The mean and standard deviation of the distributions are derived from an anthropometric database [15]. The individual probabilities of the blob regions are combined to give P (C|H, Person) = min P (C (i) |H, Person) i

(1)

where the overall probability is determined by the least fitting body part. The use of the minimum allows an efficient model fitting algorithm and enforces that poorly matching parts are labeled as outliers. Blob Likelihood Measure. The overall likelihood of the labeled blob is X P (Person|H) = W (H (i) ) i

where W (H (i) ) is the relevance of body part H (i) . Body parts which can not be decomposed into sub parts are penalized. For example, a single region labeled as arm contributes less to P (Person|H) than the summed contribution of upper arm, fore arm, and hand. When all body parts are detected P (Person|H) is one. Likelihood Maximization. The goal of the model fitting algorithm is to label every region of the blob such that the goodness function G(H) is maximized. This is achieved by an exhaustive search, i. e., every possible hypothesis H is tested. The requirement that the trunk is visible makes this approach computationally feasible. Every region of the blob is considered a possible trunk. All permutations of region labels (i. e., hypothesis) are tested for the neighboring regions of the candidate trunk. First the view (i. e., front or side) is determined using the hypothesis labels of all regions connected to the trunk and the corresponding location of the joints. Then the greedy algorithm analyses one neighboring region at a time. When the use of the hypothesis region label increases G(H) the label is retained otherwise the region is labeled as outlier. Note, when it is possible to decompose a part into sub-parts (e. g., arm) all decompositions must be tested as well. This procedure is repeated for every trunk candidate and finally the hypothesis H (i. e., labeling) with maximum G(H) is selected.

outlier

head

G(H) = P (C|H, Person)P (Person|H)

lower leg

thigh long torso

leg

long torso

lower leg outlier outlier

outlier

head

(a)

(b)

(c)

Fig. 4. Labeling of three sample blobs. The goodness function G(H) for the the models is (a) 0.37, (b) 0.25, and (c) 0.28. The height of the images is 46, 54, and 38 pixels.

(a) Video A

(b) Video B

Fig. 5. Sample frames taken from the two test sequences.

3. RESULTS Region Labeling. Figure 4 shows results obtained for three test blobs. For better visualization the corresponding input image is shown in the background. The algorithm determines correct region labels for Fig. 4(a) where the shadow of one leg is correctly labeled as an outlier while all other regions of the blob get the label of the correct underlying body part. The blob detection algorithm partly fails for Fig. 4(b) and does not correctly separate the legs. However, the labeling of the blob is still correct. An example where the algorithm fails is shown in Fig. 4(c). This failure can be attributed mostly to the poor blob detection (the shadow is considered part of the blob). Furthermore, we designed the algorithm to be rotation invariant which makes it possible that the bottom region is labeled as head. This invariance can be easily removed and in this case the blob would be labeled correctly. Video Sequences. The performance of the algorithm was tested on two video sequences. Sample screen shots of the two sequences are given in Fig. 5. Test video A (see Fig. 5(a)) shows a crowded indoor scene with many overlapping and relatively small persons. Test video B (Fig. 5(b)) contains mostly non-overlapping persons. The resolution of both sequences is 377x291 pixels. Table 1 summarizes the properties of the test video sequences. The people sized blobs were analyzed with our algorithm. Note, people sized blobs with a height < 10 pixels were not analyzed. The distribution of the goodness function G(H) obtained for the test sequences is shown in Fig. 6. For most of the blobs G(H) is below 0.25. In order to achieve a G(H) close to one it would be

Table 1. Properties of the two test video sequences. People sized and other (i. e., larger and smaller) blobs were detected by the surveillance system. The last column gives the number of people sized blobs which were too small for shape based analysis. # blobs # frames people sized other rejected Video A 500 1007 1218 62 Video B 500 923 192 24

Blob count

100

0

0.1

0.2

0.3

0.4

0.5

Video B 200 Blob count

• The use of additional clues such as edge and color information of the input image may improve the segmentation of the blob.

[1] T. Schl¨ogl, B. Wachmann, H. Bischof, and W. Kropatsch, “People counting in complex scenarios,” in Proc. ¨ OAGM/AAPR, 2002, pp. 159–166.

200

150 100 50 0

• The quality of the blob is of crucial importance to the subsequent segmentation and labeling procedures. The results of various extraction methods and, for example, the use of shadow invariant measures may improve the results.

5. REFERENCES

Video A 300

0

improvements which require further analysis:

0

0.1

0.2 0.3 Likelihood measure G(H)

0.4

0.5

Fig. 6. Distribution of the goodness function G(H) obtained for all blobs of the two test sequences.

necessary that nearly all body parts are identified and that the identified parts perfectly fit the body model. For both test sequences the size of the blobs makes it impossible to identify all parts. This explanation is supported by the fact that the increased blob size of video B (which allows the identification of more parts) causes a shift of the distribution towards higher values. The peak in the distribution at a about 0.1 is caused by blobs which are not split into sub-regions. The trunk is the only body part which is fit to such a blob which results in G(H) of approximately 0.1. Performance. The algorithm was implemented in C++ and executed on a 700 MHz Intel Pentium III processor. The average time required for the analysis of one blob in the test sequences is about 15 ms.

[2] J. A. Freer, B. J. Beggs, H. L. Fernandez-Canque, F. Chevrier, and A. Goryashko, “Automatic intruder detection incorporating intelligent scene monitoring with video surveillance,” in Proc. Eur. Conf. on Security and Detection, 1997, pp. 109–113. [3] I. Haritaoglu, R. Cutler, D. Harwood, and L. S. Davis, “Backpack: detection of people carrying objects using silhouettes,” in Proc. ICCV, 1999, vol. 1, pp. 102–107. [4] Y. Kuno, T. Watanabe, Y. Shimosakoda, and S. Nakagawa, “Automated detection of human for visual surveillance system,” in Proc. ICPR, 1996, vol. 3, pp. 865–869. [5] A. N. Rajagopalan, P. Burlina, and R. Chellappa, “Detection of people in images,” in Proc. IJCNN, 1999, vol. 4, pp. 2747– 2752. [6] C. Cenker, “Wavelet contour classification,” ¨ OAGM/AAPR, 1996, pp. 263–272.

in Proc.

[7] R. D´ıaz de Le´on and L. E. Sucar, “Human silhouette recognition with fourier descriptors,” in Proc. ICPR, 2000, vol. 3, pp. 709–712. [8] S. Ioffe and D. A. Forsyth, “Probabilistic methods for finding people,” Int. Journal of Computer Vision, vol. 43, no. 1, pp. 45–68, 2001. [9] S. X. Ju, M. J. Black, and Y. Yacoob, “Cardboard people: a parameterized model of articulated image motion,” in Proc. Int. Conf. on Automatic Face and Gesture Recognition, 1996, pp. 38–44.

4. DISCUSSION

[10] L. Zhao, Dressed human modeling, detection, and parts localization, Ph.d., Carnegie Mellon University, 2001.

This paper describes a method for shape based detection of persons in video streams. The given blob is segmented into regions and a human model is fit to these regions such that a goodness function is maximized. The method is rotation, translation, and scale invariant. The analysis is computationally efficient and can be performed in near real time for a small number of blobs. This modeling approach depends on the visibility of body parts. The number of body parts which can be found depends on the size of the blob. Therefore, a threshold for the classification of humans should not only depend on the value of the goodness function but also on the size of of the blob. The selection of this threshold requires further analysis on a large database which should also contain blobs of non-human objects. The result obtained on test sequences are encouraging. It is robust with respect to noise caused by poor blob detection or environmental influences (e. g., shadows). There are some possible

[11] C. Papageorgiou and T. Poggio, “A pattern classification approach to dynamical object detection,” in Proc. ICCV, 1999, vol. 2, pp. 1223–1228. [12] A. Utsumi and N. Tesutani, “Human detection usion geometrical pixel value structures,” in Proc. IEEE Int. Conf. on Automatic Face and Gesture Recognition, 2002, pp. 39–44. [13] R. Cutler and L. S. Davis, “Robust real-time periodic motion detection, analysis, and applications,” IEEE Trans. PAMI, vol. 22, no. 8, pp. 781–796, 2000. [14] I. Haritaoglu, D. Harwood, and L. S. Davis, “W4: realtime surveillance of people and their activities,” IEEE Trans. PAMI, vol. 22, no. 8, pp. 809–830, 2000. [15] A. D. Fisk and W. A. Rogers, Handbook of human factors and the older adult, Academic Press, San Diego, CA, 1996.

Suggest Documents