2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)
Improved Color and Intensity Patch Segmentation for Human Full-Body and Body-Parts Detection and Tracking Hai-Wen Chen Booz Allen Hamilton Inc. Belcamp, MD 21017, USA
Mike McGurr Booz Allen Hamilton Inc. Belcamp, MD 21017, USA
[email protected]
[email protected]
Abstract This paper presents a new way for detection and tracking of human full-body and body-parts (head, torso, arms, and legs) with color and intensity patch segmentation. The original R, G, and B are transformed to H (hue), S (saturation), and V (value) domain, as well as to Y, I, and Q for the NTSC system. With the help of morphological image processing, the fusion of S, V, Y, I, and Q segmentations are used for full-body detection, while the individual V, I, and Q segmentations are used for body-parts detection. An adaptive thresholding scheme has been developed for dealing with body size changes, illumination condition changes, and cross camera parameter changes. Preliminary tests with the PETS 2014 datasets show that we can obtain high probability of detection (Pd=100%) and low probability of false alarm (Pfa=1.95%) for both full-body and body-parts. The reliable body-parts (e.g. head) detection allows us to continuously track the individual person even though the torsos and legs of several closely spaced persons are merged together, and accurate human head localization is critical for human ID (face recognition). Furthermore, the detected body-parts allow us to extract important local constellation features of the body-parts’ positions and angles related to the centroid position of the full-body. These features are critical for human walk gating estimation (a biometric feature for walking pattern recognition), as well as for human pose (e.g. standing or falling down) estimation for potential abnormal behavior and accidental event detection.
1. Introduction Security surveillance cameras (networks) play an important role in protecting our daily life. However, excessive video data of multiple cameras from a crowded urban area post a huge load for the human analysts to timely detect and track down every abnormal/criminal behavior activities. An automated human detection, tracking, and recognition system is urgently needed. With the manufacturing advances in both improved spatial and temporal resolutions in recent years, the R, G, B color video cameras have replaced the traditional high-resolution panchromatic cameras as the major sensor platforms for security surveillance. The additional color
978-1-4799-4871-0/14/$31.00 ©2014 IEEE
361
features (R, G, B colors vs. intensity only) enable us to design algorithms with better target detection, tracking, and recognition performance. There are several technical challenges for human (vs. vehicles) detection and tracking in a crowded area: 1) The non-rigid shape and form of human bodies make it difficult to apply matched filter techniques; 2) Human motion has much more complicated moving kinematic patterns than vehicles, and thus it is more difficult to predict the forward (future) positions and the tracking gate size; and 3) in a crowded area, there are high chances that a person is partially or totally blocked by other persons around, or a group of closely spaced persons are detected as a large single detection blob, leading to difficulty with tracking associations for each single persons between video frames. A current popular state-of-the-art feature descriptor for human detection is Histogram of Oriented Gradients (HOG) developed by Dalal and Triggs [1]. This method is similar to scale-invariant feature transform (SIFT) descriptor developed by Lowe [2], but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy. In [3], Zweng and Kapel have reviewed more recent approaches in human detection based on the original HOG approach, and they have also reviewed recent approaches on human body-parts (head, torso, arms, and legs) detection. In a recent technical paper [4], Chen and McGurr presented a different way for human detection. In their approach, the color features are obtained by taking the difference of R, G, B (R-B and R-G) spectrum and by converting R, G, B to HSV space. For stationary camera platforms, a time-differencing (TD) process [5], [8] is applied to suppress the heavy urban clutters for improved human body segmentation and detection. The extracted color and intensity features based on morphological patch filtering and regional segmentation are applied for target detection [6]. Target detection is conducted on both color and intensity (black and white) spaces. Several important image color, intensity, shape, and size features associated with detections are also extracted to be used for the following tracking process. The color and intensity features are invariant when the body is partially occluded. Furthermore, the morphological operations are conducted
on binary images (vs. gray scale im mage processing for HOG) and the detection process doess not need machine learning (SVM) to train the features annd to classify image regions through all possible image poositions in multiple image scales as required by the HOG approach, and thus lead to much lower required computatioonal throughputs. In this paper, we have developed an improved color and intensity patch segmentation method ffor human full-body and body-parts detection and trackinng. There are three major differences of our approach from the approach in [4]: G, B (R-B and R-G) (i). Instead of using difference of R, G spectrum, we convert the original R, G, and B to Y (luma), I (orange-blue range chroma), and Q ((purple-green range chroma) for the National Television System Committee (NTSC) system. I and Q images proovide us with more accurate sub-patch color features segmentation for body-parts detection. (ii). In [4], the detected patched aree tracked with three separate trackers; the color patch ttracker, the bright intensity tracker, and the dim inteensity tracker, and proposed to lately associate the sub-bbody patches into a full-body patch based on the tracking m motion kinematic and spatial position features. On the other hand, in this paper, we first fuse the S, V, Y, I, and Q sub-ppatch segmentations into full-body detections. In many ccases, the detected sub-patches may not connect together to form a full-body patch. We have developed a morphological vertical bar dilation filtering method to actively ffuse the body-parts sub-patches into a full-body patch. (iii). The approach in [4] does not haave the capability to extract the local constellation featuress of the body-parts related to the full-body detection. In thhis paper, we show that we can reliably detect the body-parrts (e.g. the head and arms) and associate them with the full-bbody position by the individual V, I, and Q sub-patch ssegmentations. The detected body-parts allow us to extraact important local constellation features of the body-paarts’ positions and angles related to the centroid position oof the full-body. A critical parameter in the patch segmentation approach is the selection of segmentation thressholds. An adaptive thresholding scheme has been developped for dealing with body size changes and cross camera parrameter changes. Preliminary tests with the PETS 20144 datasets show that we can obtain high probability of deteection (Pd) and low probability of false alarm (Pfa) for bboth full-body and body-parts. The reliable body-parts (ee.g. head) detection allows us to continuously track the inddividual person even though the torsos and legs of several cloosely spaced persons are merged together. wing way: HSV and The paper is organized in the follow YIQ color transformations in Secttion 2; Full-body detection and tracking with non-linnear morphological operations in Section 3; Adapttive segmentation thresholding in Section 4; Body-parts sub-patch detection and association in Section 5; and finallyy, we summarize the
362
paper in Section 6.
2. HSV and YIQ Color Transformations T HSV is a common cylindricaal-coordinate representation of points in an RGB color mod del developed in the 1970s [7]. This representation rearrang ges the geometry of RGB in an attempt to be more intuitivee and perceptually relevant than the Cartesian (cube) repreesentation, by mapping the values into a cylinder: (1)
360 with, [
]
[
1
(2)
] /
[
, ,
]
(3)
, ,
(4)
d by the NTSC color TV YIQ is the color space used system. The YIQ system is inteended to take advantage of human color-response characteristics. The eye is more ge-blue (I) range than in the sensitive to changes in the orang purple-green range (Q): , , , [0, 1], I [-0.5957,, 0.5957], Q [-0.5226, 0.5226] 0.299 0.595716 0.211456
0.587 53 0.27445 0.52259 91
0.114 0.321263 0.311135
(5)
As indicated from Equations (1)-(5), HSV is a nonlinear transform of RGB, while YIQ iss a linear transform of RGB. Figure 1(a) shows an example of an original image from ARENA 01_02_TRK_RGB_2 Dataset, and figure 1(b)
shows the Time-Differenced (TD) imagge by subtracting the original image from a reference im mage that has no pedestrian in the scene. It is seen thatt most of the heavy background clutters have been subtracteed out. Figure 2 shows the S, V, H, Y, I, and Q images transformed from the TD image in figgure 1(b). It is seen that the V image is similar to the Y imaage.
color. From figure 3 it is noted that the mean value is i around zero for I and Q around 0.5 for V image, and is images. The sub-patch segmentation process is expressed as , , 1 , , , and 0 1 0
, ,
Figure 2: The S, V, H, Y, I, and Q images fr from the TD image in figure 1(b).
3. Full-Body Detection and Traacking with Non-linear Morphological Opeerations
, ,
(6)
, , and , , are segmented where, binary images for the left and right tails, respectively, , , is the original (V, Y, S, I, or Q) image, is and are the the mean of the original imagee, and threshold levels for the left an nd right tails, respectively. These thresholds are set as con nstants in this section, and will be discussed further in thee next section of Adaptive Segmentation Thresholding, where they are set to adaptively change with times as functions of the image standard deviations. Finally, thee full-body segmentation is obtained by a logic OR operatio on of nine binary sub-patch segmentation images (two tails for V, Y, I, and Q, one tail for S). The full-body segmentatiions for the image shown in figure 1(b) are shown in figure 4(a). 4
Figure 4: (a) Full-body detection beefore morphological filtering, (b) Full-body detection after morph hological filtering Figure 3: The histograms of V, S, I, and Q iimages.
Figure 3 shows the histograms of V, S, I, and Q images. It is seen that the V image may be aapproximated by an exponential probability density functionn (pdf), the S image by a one-side tail Rayleigh pdf, and the I and Q images by a me values in the pdf Gaussian pdf. The pixels with the extrem tails are what we want to segment out as the target detections. In general, the left tail in V is related to dim intensity objects (black clothes, black hhairs, shadows, etc.), the right tail of V is related to bright inteensity objects (white shirts and pants, objects with strong ssunshine reflection, etc.). The left tail in I is related to objeccts with bluish color while the right tail in I is related to objeects with orange-like color including the human skins. As shhown in figure 2, the human skins (face and arms) have high I values. Likewise, the left tail in Q is related to objects with greenish color while the right tail in Q is related to objeects with purple-like
363
One problem with the con nstant thresholding is that illumination varies with times, and also the body intensity also changes with distance between the body and the camera. This will cause either over-thresholding that the three bodies are connected togeether or under-thresholding that a full-body is broken into seeveral small pieces. Here we introduce three morphological operations: o erosion, dilation, and opening [7] to solve this pro oblem. Dilation is defined in terms off set operations. With A and B as sets in the dilation of A by B, denoted , is defined as |
(7)
where is the empty set and B is the structuring element. , is defined as The erosion of A by B, denotted |
(8)
where is the complement of A. The opening of set A by structuring element B, denoted , is defined as (9)
)
Thus, the opening A by B is the erosionn of A by B followed by a dilation of the result by B. The morphological operations thaat we applied are expressed as , ,
, ,
(10)
, , is the binary full--body segmentation where image (figure 4(a) shows such an example), the B structuring element is a 9x1 vertical vector, and the C structuring element is a 30x1 vertical vvector. The first two operations with B indicate that any coonnection with less than a height of 9 pixels will be cut (a scissors operation), and the third operation with C means thhat any vertical gap less than 30 pixels will be bridged. O One example of the , , after the morphollogical operations is output shown in figure 4(b). It is seen that thhese operations can also reduce small size false detections.
are the thresholds fo or Y and V when the body where sizes are relatively large, and is the STD of V (or Y) measured at different time frrames as shown in figure 5. After we obtained full-body segmentations, s the detection process estimates the centroids of o the body patches, and the four corners x, y positions of th he bounding boxes. Several important image color, intensitty, shape, and size features associated with detections are also a extracted to be used for selecting target detection types and for the following tracking process. The color and intensity features include n of the full-body patches in the mean and standard deviation V, S, I, and Q images forming an 8-taps feature vector. The shape and size features includee detection sizes, detection height to width ratios, and detection patch extent deviated from rectangular shape. The frame numbers are also recorded in the detection outputt data structure.
Figure 6: Detection results from furrther away to nearby.
Figure 5: Image Statistics at different time fframes.
To test our algorithms, we uused an ARENA 01_02_TRK_RGB_2 video clip withh 336 frames. This scene contains three pedestrians walkking from far away towards the camera position. The body sizes changed considerably from small to large with about 20 times size ments (the mean and differences. The first two statistic mom the STD) of these video frames are show wn in figure 5. As shown in figure 5, the STD of Y (and also V) are small (