An Algorithm for Detection of Partially Camouflaged ...

1 downloads 0 Views 329KB Size Report
Oct 12, 2009 - us/um/people/jckrumm/WallFlower/TestImages.htm. [6] T. E. Boult, R. J. Michaels, X. Gao, and M. Eckmann. Into the woods: visual surveillance ...
2009 Advanced Video and Signal Based Surveillance

An algorithm for detection of partially camouflaged people D. Conte, P. Foggia, G. Percannella, F. Tufano and M. Vento Dip. di Ing. dell’Informazione e Ing. Elettrica Universit`a di Salerno, ITALY {dconte,pfoggia,pergen,ftufano,mvento}@unisa.it

Abstract

testing their detection algorithms in contexts where camouflage is unlikely, or assume that later processing phases will be able to correct the anomalies induced by camouflage. Among the papers specifically devoted to the camouflage problem, Tankus and Yeshurum [11] propose the use of an operator to enhance areas whose shading corresponds to a convex object to separate such areas from a “flat” background with similar intensity and texture. However the method is not suitable for environments in which the background also contains convex objects, and does not work well for objects with dark colors. The paper by Harville et al. [9] is representative of an approach to the problem that involves the use of depth information to detect camouflaged objects. The authors also evaluate other popular video analysis methods proposed in the literature, maintaining that among the considered systems, only the ones incorporating depth information are able to deal with camouflage. While the use of depth information can surely improve the detection performance of a video analysis system, it has a non negligible computational cost and, more important, it precludes the use of the legacy cameras often already installed for applications such as video-surveillance or traffic monitoring. The paper by Boult et al. [6] is devoted to intentional camouflage, as opposed to accidental camouflage, but the proposed techniques are also applicable to the latter. The proposed method uses background subtraction with two thresholds: a larger one, used to detect pixels that are certainly in the foreground, and a smaller one, to detect pixels that can be either part of the background or a camouflaged part of the foreground. The regions detected using the two thresholds are then grouped using suitable conditions to form the so called “Quasi Connected Components” to recover the split of camouflaged objects. The approach proposed by TrakulPong and Bowden [13] is instead based on a simple model of the shape, represented as a bounding box. However this method is integrated in the tracking phase, instead of the detection. The main idea is that the method builds a statistical model of the shape of

Several video analysis applications perform object detection using a background subtraction approach. Camouflage can be a serious problem for these applications, since the objects of interest may appear fragmented into small, disconnected pieces, with a dramatic negative impact on later processing phases such as classification or tracking. Nevertheless, this problem is largely underestimated in the literature. In this paper an effective, model-based solution is presented for the case of people detection. The proposed method acts as a post-processing phase, grouping together the fragmented blocks to restore the original object. A quantitative evaluation of the effectiveness of this method has been performed on real world videos from a videosurveillance application. The videos used for the experiments (with metadata) have been made publicly available on the Internet.

1. Introduction Several important video analysis applications, such as video-surveillance, behavior analysis and traffic monitoring are based on a fixed camera setup. In this conditions, background subtraction is a powerful and widely adopted technique for detecting the objects of interest as opposed to the static elements that are part of the observed scene. Background subtraction is subject to several well known problems, categorized in [12]. Several algorithms have been proposed to face each of these problems, with the noteworthy exception of camouflage, the problem that occurs when the pixel characteristics of a foreground object are too similar to the background to be discerned. Camouflage has received comparatively less attention than the other problems, probably because most detection methods operate at a pixel level, where there is not enough information to effectively tackle this problem. So most authors either ignore the issue,

978-0-7695-3718-4/09 $25.00 © 2009 IEEE DOI 10.1109/AVSS.2009.83

340

Authorized licensed use limited to: Universita degli Studi di Salerno. Downloaded on October 12, 2009 at 09:40 from IEEE Xplore. Restrictions apply.

and maintenance of the background model (which usually changes with time), noise at the pixel level that must be filtered, or the need to adapt the threshold to different conditions on different parts of the image. Camouflage occurs when some of the pixels belonging to an object have a difference from the background model that is under the threshold used in the thresholding phase, and so they are not considered as foreground pixels. It could be expected that this situation is very unlikely, given the huge number of colors that current acquisition devices can distinguish in their output; at least, unlikely unless someone is deliberately trying to hide the object to the video analysis system. Unfortunately, this is not true: in real applications, especially in outdoor environments, camouflage on the contrary occurs quite frequently. The reason is that in a real world application the system has to deal both with the noise related to the image acquisition device and with changes in the background due, for instance, to the different lighting at different times of day. This latter problem is usually faced by continuously updating to the background model, but for obvious reasons the updates cannot be instantaneous. As a consequence, the threshold on the difference cannot be too tight, or else too many actual background pixels would be considered as foreground. Hence, camouflage cannot be avoided.

the tracked object; when an abrupt shape change occurs, the algorithm assumes it is due to camouflage and tries to match the object image at the previous frame to restore the correct shape. Huang and Jiang [10] propose a method to tackle the camouflage of wildlife animals based on two ideas: an iterative region consolidation operator to fill the gaps introduced by camouflage, and a model of the contour built during the tracking of an object to improve the detection of the actual shape. However their method, being based on intraframe differences instead of background subtraction, is not suitable for the case of slowly moving object, or objects with uniformly colored areas. The paper by Guo et al. [8] proposes to address the camouflage problem by performing a temporal averaging of the frames before computing or updating the background model. The idea is that this way, the model will have a smaller variance and so a smaller detection threshold can be used. However, as the experiments performed by the authors show, the method has problems with slowly moving objects. In this paper we propose an object detection method based on background subtraction that deals with camouflage by introducing a grouping phase. Grouping is performed on the basis of a model of the shape of the targets, that in our system are people. In section 2 the rationale of the method is explained, while in section 3 a detailed description is provided. Experimental results are then presented and discussed in section 4.

2. Rationale

Figure 1. Conceptual structure of the first phases of a background subtraction algorithm.

Figure 2. Example of the problem. On the first row, the original sequence. On the second row, the foreground mask, with detected objects represented by white boxes. Notice that the person appears divided into more objects because of camouflage between his trousers and the background.

At least conceptually, a background subtraction algorithm is organized as shown in fig. 1. First, the “difference” between the current frame and a background model is computed (usually, it is some distance in the adopted color space). Then this difference is thresholded, giving a foreground mask that indicates which pixels belong to the foreground. Foreground pixels are then grouped according to some criterion (e.g. connected component analysis) into objects, that constitute the output of object detection. Of course, actual algorithms are more complex than this scheme, having to deal with details such as the definition

An example in a frames sequence from a real world video-surveillance installation is shown in fig. 2. It can be easily seen that the person in the video has a large part of his trousers missing in the foreground mask because of camouflage. What is more important, as a consequence of the missing parts, the person is detected as two or more disconnected objects.

341

Authorized licensed use limited to: Universita degli Studi di Salerno. Downloaded on October 12, 2009 at 09:40 from IEEE Xplore. Restrictions apply.

The grouping algorithm starts form the consideration that the splitting caused by the camouflage problem affects always the same parts of a connected component representing a person. In fact, observing the detection results of several video sequences it is possible to notice that a person is always split in two or three parts and these parts usually correspond to the head, the bust and the legs. It is worth to note that this phenomenon does not occur or it occurs a few times in standard databases as the PETS database ([2], [3]).

This splitting along the vertical direction is a very frequent effect of camouflage when the objects of interest are people, since their overall shape is narrow and tall. This phenomenon can have a disastrous impact on the subsequent processing phases that use the output of object detection: • tracking becomes more difficult, since the splitting happens only on some of the frames, and furthermore the object could be split at different points in different frames (consider that the object is moving, so on each frame the portion of background behind it can be somewhat different); • classification cannot reliably use geometric information such as the height of the object or its aspect ratio, since the detected object could be only a part of the actual one. This problem cannot be solved during the background subtraction process, since this process operates at the pixel level, where simply there is insufficient information to decide whether a pixel showing a small difference from the background model is a background pixel affected by noise or a camouflaged foreground pixel. On the other hand, it is desirable to solve this problem before the successive, higher level processing phases: first, this would avoid adding complexity to those phases, that already have to solve several other problems; second, this would insulate the successive phases by the concerns of object detection, promoting a greater modularity of the application (for instance, allowing to change more easily the tracking algorithm with a different one). So our solution involves a post-processing step of the object detection output, before it is fed to the successive processing phases. This step is aimed at recognizing the object splits produced because of camouflage, regrouping the broken pieces into something closer to the original object. The goals for this step are:

Figure 3. Examples of the splitting of a person. a) A person wearing a sweater and a pair of trousers; b) a person wearing a suit.

Usually the appearance of a person has two or three main color regions according to his clothing. If a person wears a suit his appearance will consist in two main color regions corresponding to the head and the rest of the body (e.g. see Fig. 3a). Instead if a person wears a sweater and a pair of trousers his appearance will consist in three main color regions corresponding to the head, the sweater and the trousers (e.g. see Fig. 3b). Therefore, the splitting occurs exactly between these parts. Typically, when the camouflage problem occurs, a person is divided in several parts according to the configurations shown in Fig. 4.

• sensitivity: the method should regroup as many broken objects as possible; • specificity: the method should limit the number of cases where unrelated objects are grouped together as if they were parts of the same object; • speed: the computational cost should not add a significant overhead to the overall system.

3. The proposed method In this section we provide a description of the proposed method for model-based person detection in presence of the camouflage phenomenon.

Figure 4. The different configurations in which the box of a person can be separated because of the camouflage problem.

342

Authorized licensed use limited to: Universita degli Studi di Salerno. Downloaded on October 12, 2009 at 09:40 from IEEE Xplore. Restrictions apply.

The aim of this algorithm is to group two or more boxes (arranged according to the model presented in Fig. 4) in order to form a unique box representing a person (according to the model presented in Fig. 5). Let’s introduce four parameters h1 , h2 , b1 and b2 to represent the model of the bounding box representing a person (see Fig. 5). In the following we describe the rules to group two boxes in a unique box in order to have a box that is closer to the model of a person. Given two boxes X and Y (that we can consider, without loss of generality, arranged as in Fig. 4a) they will group in a unique box if and only if all the following assertions are true:

Figure 5. The model of the bounding box of a person.

1. The projection on the horizontal axis of the boxes intersect each other.

of the perspective effects, the size (in pixels) of a bounding box representing a person is not independent from its position in the frame. In the proposed method, we adopt an initial semi-automatic calibration phase; thus, parameters h1 , h2 , b1 and b2 are calculated for each pairs of bounding boxes to group so that they represent always the same width and height (expressed in meters) of a person independently of the position of the considered bounding boxes within the scene. Finally it is important to highlight that the proposed method is not computational expensive because in a frame the number of detected boxes is never greater than one or two dozens.

2. The height of the grouped box is closer to h1 than height of X or height of Y . 3. The height of the grouped box is no greater than h2 . 4. The width of the grouped box is included between b1 and b2 . In formulas: 1. right(X) ≥ lef t(Y ) ∧ lef t(X) ≤ right(Y ) where lef t(·) (right(·)) is the left (right) coordinate of the box.

4. Experimental results

2. |bot(Y ) − top(X) − h1 | ≤ |bot(X) − top(X) − h1 | ∧ |bot(Y ) − top(X) − h1 | ≤ |bot(Y ) − top(Y ) − h1 | where top(·) (bot(·)) is the top (bottom) coordinate of the box. 3. bot(Y ) − top(X) ≤ h2 4. right(Y ) − lef t(Y ) ∈ [b1 , b2 ] We apply the latter algorithm to each pair of detected boxes, also re-applying the algorithm between grouped boxes. In this way we can group the boxes in all configurations of Fig. 4. In fact it is important to notice that the configurations a-b-c-d of Fig. 4 can bring back to the configurations e-f -g when the box A has been grouped with the box B. If we consider only the A and B boxes a-b-c-d configurations are identical. Furthermore the configuration e can bring back to the configuration f or g (that are similar) if box B has been grouped with the box C or D. It is important to notice that the parameters h1 , h2 , b1 and b2 are expressed in pixels and they are not the same over the entire frame but they depends on the position of the considered bounding boxes for the grouping. In fact, because

The proposed method for solving object splits due to camouflage can be used as a post-processing module operating after a generic blob detection algorithm. Several approaches have been proposed in the literature to model the background in a more or less sophisticated way. In particular, in this paper we used the algorithm in [7] that models the background through a Gaussian, an approach followed by many authors, and is characterized by a good trade-off between the detection performance and the computational complexity. With regard to the processing workflow of the method described in [7], the proposed grouping algorithm is introduced just before the shadow suppression module. It has to be noted that, since here we are interested to assess the performance improvements induced by the use of the proposed method with respect to the case where no measure is taken to manage splitting problems, the choice of the foreground detection algorithm is not particularly crucial. The test of the proposed system has been done on a new, large dataset whose characteristics are summarized in Table 1. The videos used in this paper are publicly available in the database section of http://nerone.diiie.unisa.it/zope/home/mivia/databases. Creation of a new dataset was motivated by the fact that the available datasets used in the literature [1, 2, 3, 4, 5]

343

Authorized licensed use limited to: Universita degli Studi di Salerno. Downloaded on October 12, 2009 at 09:40 from IEEE Xplore. Restrictions apply.

Video ID 1

Length (# of frames) 4’575

2

21’000

Description

Video ID

cloudy, very high camouflage, few shadows late afternoon, high camouflage, very long shadows

Table 1. Characteristics of the employed dataset. Videos were framed at 25fps.

1 2

Original algorithm Pr Re f 0.229 0.396 0.290 0.093 0.312 0.143

Modified algorithm Pr Re f 0.495 0.417 0.452 0.216 0.397 0.279

Table 2. Performance obtained by the original and the modified algorithm on the considered dataset.

uation of this kind of algorithms. To measure the effectiveness of detection systems the precision and recall figures of merit are often used. They are defined as follows: precision = (a)

(b)

Figure 6. Example frames extracted from video with (a) ID = 1 and (b) ID = 2.

do not take into account for occurrences of camouflage problems or such situations occur only marginally. This is mainly due to the fact that the sequences within such datasets are very short and conceived to highlight only the classical problems, as ”light of day”, ”shadows”, ”waving tree”, etc., typically one video per each problem. On the contrary our dataset is composed by two videos obtained by using a camera that framed a public outdoor area for a long time. Each video refers to the same scene, but in two different environmental situations (see fig. 6). In the video with ID = 1 the scene is cloudy and characterized by frequent camouflage problems. In the video with ID = 2 the scene was framed just before sunset so there are very long shadows. Since we are interested to assess the performance improvements induced by the use of the proposed method for solving problems abscribable to camouflage, we have assessed the performance obtained by both the original algorithm in [7] and that modified with the proposed method for model-based box grouping. In the literature there have been much efforts to evaluate the performance of the tracking algorithms, whereas similar results have not been obtained for the assessment of the performance with respect to the problem of object detection. One reason is the huge effort needed to produce the ground truth that would require to determine for each pixel of each frame if it belongs to the foreground or the background. Furthermore, an evaluation at pixel level, i.e. counting misdetected pixels (as in [12]), provides measure that is not so meaningful with respect to the problem of object splitting caused by camouflage. Thus, here we use a quantitative method, widely used in the context of information retrieval systems, but never in the eval-

TP TP + FP

recall =

TP TP + FN

(1)

where T P is the number of true positives, that is, objects correctly detected by the system; F P is the number of false positives, that is false objects detected by the system but not actually present; F N is the number of false negatives, that is, actual objects that are not detected by the system. Sometimes it is preferable to have one single index for measuring the performance (e.g. for performance tuning of a parametric system); in this case some authors propose the f-score, defined as the harmonic mean of precision and recall: f-score =

2 · precision · recall precision · recall

(2)

In the context of this paper we declare an object as correctly identified, thus as a T P , if there is a detected bounding box that covers the object. In particular, given an object g and its ground truth bounding box G and detected bounding box D then we say that D correctly covers the object g if the following condition is verified: |x(D) − x(G)| ≤ δx (3) ∀x ∈ {top, right, bot, lef t} where  δx =

λ · |top(G) − bot(G)| if x ∈ {top, bot} λ · |lef t(G) − right(G)| if x ∈ {right, lef t}

For our experiments we used λ = 0.2. Given the above defined values of T P , F P and F N , we have computed precision, recall and the f-index on the considered dataset of the original algorithm in [7] and that modified with the proposed method for managing splitting problem. The results, summarized in Table 2, clearly demonstrate that the proposed method for split problems management allows to significantly improve the overall object detection

344

Authorized licensed use limited to: Universita degli Studi di Salerno. Downloaded on October 12, 2009 at 09:40 from IEEE Xplore. Restrictions apply.

performance of the method in [7]. As it could be expected its use is particularly effective in reducing the number of FP, thus improving the precision. What is more important, this is not paid in terms of sensitivity; on the contrary, the modified algorithm correctly detects more objects than the original with respect to both considered videos. An example of how the model-based method proposed in this paper is able to correctly detect the moving person that, instead due to the camouflage the original algorithm splits in two or more parts, is shown in fig. 7 (see the sequence of frames in fig. 2 for the output of the original algorithm).

surveillance videos.

References [1] Database: Elgammal. http://www.cs.rutgers.edu/ %7eelgammal/Research/BGS/research bgs.htm. [2] Database: Pets2001. http://peipa.essex.ac.uk/ipa/pix/pets/ PETS2001/. [3] Database: Pets2006. http://www.cvg.rdg.ac.uk/PETS2006/ data.html. [4] Database: W4: Who? when? where? what? http://www.umiacs.umd.edu/%7ehismail/W4 outline.htm. [5] Database: Wallflower. http://research.microsoft.com/enus/um/people/jckrumm/WallFlower/TestImages.htm. [6] T. E. Boult, R. J. Michaels, X. Gao, and M. Eckmann. Into the woods: visual surveillance of noncooperative and camouflaged targets in complex outdoor settings. Proceedings of IEEE, 89(10):1382–1402, 2001. [7] D. Conte, P. Foggia, M. Petretta, F. Tufano, and M. Vento. Evaluation and improvements of a real-time background subtraction method. In M. Kamel and A. C. (Eds.), editors, Lecture Notes in Computer Science, Springer-Verlag, Berlin Heidelberg, volume 3704, pages 1234–1241, 2005. [8] H. Guo, Y. Dou, T. Tian, J. Zhou, and S. Yu. A robust foreground segmentation method by temporal averaging multiple video frames. In International Conference on Audio, Lenguage and Image Preocessing, pages 878–882, 2008. [9] M. Harville, G. Gordon, and J. Woodfill. Foreground segmentation using adaptive mixture models in color and depth. In IEEE Workshop on Detection and Recognition of Events in Video, pages 3–11, 2001. [10] Z. Q. Huang and Z. Jiang. Tracking camouflage objects with weighted region consolidation. In Proceedings of the Digital imaging Computing: Techniques and Applications, 2005. [11] A. Tankus and Y. Yeshurum. Convexity-based visual camouflage breaking. Computer Vision and Image Understanding, 82:208–237, 2001. [12] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: Principles and practice of background maintenance. In Seventh IEEE International Conference on Computer Vision, volume 1, pages 255–261, 1999. [13] P. K. TrakulPong and R. Bowden. A real time adaptative visual surveillance system for tracking low-resolution colour targhets in dynamically changing scenes. Image and Vision Computing, 17:913–929, 2003.

Figure 7. Results of the proposed method on the sequence of fig. 2. On the first row, the original sequence. On the second row, the foreground mask, with detected objects represented by white boxes, when the proposed grouping method is adopted.

5. Conclusions Camouflage is a serious problem for those applications that perform object detection using a background subtraction approach. In spite of its importance this problem is disregarded by the majority of the papers. In order to cope with this problem, in this paper an effective, model-based solution is presented for the case of people detection. The proposed method has been tested on a large set of real world videos from a video-surveillance application, demonstrating its effectiveness. In particular, the method allows to significantly improve the specificity without reducing the sensitivity when used as a post-processing module of a standard object detection algorithm based on background subtraction. Different issues will be considered for future work, as the assessment of the method in combination with other object detection approaches, the experimental validation on an even more larger dataset including different situations causing camouflage, the specialization of the method for characterizing other objects commonly appearing in video-

345

Authorized licensed use limited to: Universita degli Studi di Salerno. Downloaded on October 12, 2009 at 09:40 from IEEE Xplore. Restrictions apply.

Suggest Documents