IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
1
Appearance Modeling Using a Geometric Transform Jian Li, Shaohua Kevin Zhou, Member, IEEE, and Rama Chellappa, Fellow, IEEE
Abstract—A general transform, called the Geometric Transform (GeT), that models the appearance inside a closed contour is proposed. The proposed GeT is a functional of an image intensity function and a region indicator function derived from the closed contour. It can be designed to combine the shape and appearance information at different resolutions and to generate models invariant to deformation, articulation, or occlusion. By choosing appropriate functionals and region indicator functions, the GeT unifies Radon transform, trace transform, and a class of image warpings. By varying the region indicator and the types of features used for appearance modeling, five novel types of GeTs are introduced and applied to fingerprinting the appearance inside a contour. They include the GeTs based on a level set, shape matching, feature curves, and the GeT invariant to occlusion, and a multiresolution GeT (MRGeT). Applications of GeT to pedestrian identity recognition, human body part segmentation, and image synthesis are illustrated. The proposed approach produces promising results when applied to fingerprinting the appearance of a human and body parts despite the presence of non-rigid deformations and articulated motion. Index Terms—Geometric transform, multiresolution, appearance model, invariant feature, Radon transform, trace transform, image warping, level set, shape matching, interpolation, pedestrian identity recognition, human body part segmentation.
I. I NTRODUCTION In order to account for geometric variations between two image templates, the standard 2D affine transformation is used for registration before direct comparisons of the two appearances. However, when the appearance inside an arbitrary contour is to be modeled, a more general transform that can incorporate geometric transformations and include prior information about the shape is needed, especially when the contour has very large non-linear deformations or articulated motion of parts. Many methods for incorporating geometric context into appearance models have been proposed in the literature. Active appearance models (AAM) [1], where the statistical behavior of a set of tracked feature points is modeled and these points are then used to generate a normalized appearance in the mean shape, have been proposed for image warping. However, it requires explicit detection of feature points and can only deal with small deformations that obey Gaussian distributions, making it ineffective for tasks like modeling the appearance of pedestrians. The most comprehensive approach is to have a full 3D generative model. Then attempts for recovering the imaging process can be carried out. For example, 3D models Partially funded by the ARDA/VACE program under the contract 2004H80200000. J. Li and R. Chellappa are with Center for Automation Research and Department of Electrical and Computer Engineering in University of Maryland, College Park, 20742, USA. Emails: {lij,rama}@cfar.umd.edu. J. Li is now with Goldman Sachs & Co in New York, NY. S. Kevin Zhou, corresponding author, is with Integrated Data Systems Department in Siemens Corporate Research, Princeton, NJ, 08540, USA. Email:
[email protected].
have been successfully used in face recognition [2] to deal with pose changes. However this approach is computationally intensive and requires significant prior information, but with no guarantee of convergence to the global optimal solution, thus ineffective for low resolution imagery [3]. Elastic Graph Matching (EGM) [4] is also a popular method to extract the appearance signature with some prior knowledge of the geometric structure. Gabor filters are used to extract features at fiducial points that have certain link architectures. Similar to AAM, EGM needs feature points and an explicit model. In [5] and [6], the authors use the trace transform to generate invariant features with respect to a group of affine transformations. Their approach uses the property of the transform to deal with 2D rigid motions as well as small non-linear deformations. It has the advantage of not using explicit models, but does not have the capacity to include complicated prior knowledge. In this paper, we propose a general definition of geometric transform (GeT), which can be designed to incorporate geometric prior knowledge, or the so called geometric context, into appearance models. In the case of modeling the appearance inside a closed contour, the context is mostly inferred from the shape itself. Mathematically, a curve C is defined as a continuous mapping C : [0, 1] → R2 ; a closed curve is with C(0) = C(1). To illustrate the concept of GeT, consider the following example. The signed distance function φC (x) [7] is widely used to represent the contour boundary. Each level set {x|φC (x) = ω} corresponds to the point with equal distance to a closed contour; hence φC (x) = 0 is the contour. We define the GeT as Z R(ω) = I(x)δ(φC (x) − ω)dx. (1) where I is the image intensity function. It will be shown in section III-A that the above GeT is invariant to translation, rotation, scaling, and even certain nonrigid deformation, and hence a better way to represent the appearance by simply leveraging the geometric context arising from the closed curve. The GeT has very important applications in recognizing humans with articulated motion and automatic body part segmentation after background subtraction. In these applications, objects have very large deformations and articulation, rendering rigid transformations insufficient to capture the correspondences. We provide a generic transform-based framework for appearance modeling of objects with large deformations. The resulting transform is used to represent the visual pattern with certain invariance before additional learning algorithm is applied. Guidelines of how to design a GeT to transform an image for deformation-invariant models are given. Three major contributions are made in this paper. 1) A unifying framework based on a generally defined geometric transform [8] is proposed in section II. Section
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
II also shows that image warping [9], Radon transform [10][11][12], and trace transform [6] are special instances of GeT. 2) Different ways of designing the geometric set and functional of GeT are proposed to incorporate both explicit and implicit knowledge into appearance models. Five new types of GeTs are proposed. Level set based and shape matching based GeTs are proposed in section III. Their selection schemes are based on the contour itself so that large deformations and articulated motions can be handled. Feature curve-based GeT in section IV goes beyond the feature points based warping used in AAM by using feature curves to find point-to-point correspondences. Occlusion invariant GeT using a proper functional is proposed in section V. A multiresolution version of GeT is proposed in section VI and used for the combined representation of shape and appearance information at different scales. 3) GeT is a very useful tool to obtain deformation invariant appearance models. We focus on obtaining the signature of the appearance inside a contour. GeT based on shape matching is used to build a geometric appearance model of walking humans. Section VII shows the application of our approach for recognition of human and body parts and automatic body part segmentation. II. G EOMETRIC TRANSFORM Inspired by the difficulties in appearance modeling inside a contour and the mathematical representation of the Radon transform, we give a unifying definition of GeT in this section. Many existing methods are shown as special instances of GeT. The guidelines of designing GeT to incorporate prior knowledge are also given. A. Inspirations 1) Appearance inside a contour: The idea of geometric transform first comes up when the appearance inside a contour is modeled. The matching of two appearances can be viewed as a comparison of two 2D functions with compact support regions. For regular image matching, usually the two images are registered through an affine transform before the pixel-wise difference is computed. But for appearance inside a contour, since the support region has an arbitrary shape, it is hard to find a transform for direct comparison. The role of such a transform is to align the corresponding parts. The key to accurate matching is to find the correspondences. Once the correspondences inside the two regions are established, comparisons can be made directly in the transform domain. Note that the correspondences do not have to be pixelwise. They can be curve-to-curve, region-to-region or even volume-to-volume in the case of 3D data or videos. In Fig. 1, different types of correspondences are illustrated. Therefore, correspondences can be viewed as a mapping between two sets. If each set contains more than one point, certain statistics can be computed over that set yielding features such as the mean or the variance, or even the histogram. These ideas are
2
(a)
(b)
Fig. 1. Illustration of different types of correspondence. (a) The curve correspondence across different views. (b) The region correspondence across different poses. Marked contours with the same color show the boundary of two corresponding regions.
the basis of geometric set and geometric functional in our definition of Geometric Transform. Another important fact is that these correspondences can be based on either explicit or implicit prior knowledge. AAM uses an explicit model for faces [1], where the tracked feature points are used to generate the correspondences. But to model appearances inside an arbitrary contour, quite often there is no explicit model available. On the other hand, the contour boundary contains very important information about the correspondence as it can be used to implicitly infer the correspondence. Further, such a knowledge can be combined with feature points to generate additional correspondences. So our goal is to find a transform domain representation, where the support region of the two appearance patches are properly registered based on explicit or implicit knowledge. The registration is not necessarily pixel-wise. In the transform domain, geometric deformations are taken care of before direct comparisons are made. 2) Radon Transform: To build a transform that solves the problem of matching the appearance in a contour, we first briefly review a special instance of GeT: the Radon transform, which has all the key elements of our framework. In 2D, the Radon transform [11], [12] is defined as Z Z R(θ, p) = I(x, y)δ(x cos θ + y sin θ − p)dxdy. (2) The Radon transform applies integral operations to image I(x, y) along a set of lines as illustrated in Fig. 2. It can also be viewed as a line projection to obtain the directional histogram. The Radon transform has been well studied in computer tomography (CT) [12]. In CT, the focus is more on image reconstruction based on the transform domain representation. Given enough resolution in θ and p, the image can be fully reconstructed using filtered back-projection according to the Fourier slice theorem or using algebraic reconstruction techniques. In computer vision, the basic use of Radon transform has been for line detection, which is also referred to as the Hough transform [13]. Usually an edge detector is applied to an image, then lines are detected by locating the peaks in the Radon transform of the edge map. In Fig. 2, an edge map and its Radon transform are illustrated. By changing the geometric sets into arbitrary shapes such as circles, rectangles or even non-parametric shapes, other shapes can be similarly detected.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
Edge Detection Using Canny Method, σ = 2
(b) R(θ,p) 180
−400
160
−300
140 −200 120 −100
p
100 0 80 100 60 200 40 300 20
(a)
400 0
20
40
60
80 100 θ (degrees)
120
140
160
(c)
Fig. 2. Illustration of Radon transform and its use for line detection. (a) Radon transform is essentially a line projection. (b) An edge map. (c) The Radon transform of the edge map in (b). Line detection is accomplished by locating the peaks in the transform domain and finding the lines corresponding to these peaks.
The Radon transform carries two important elements: the geometric sets of straight lines, and the functionals defined over these sets, which are integrals. Through arbitrary choices of these two elements, we provide a general definition of geometric transform. B. Definition of Geometric Transform Suppose that an image I is a function I : Ω 7→ Rq , where the support region of I is Ω ⊂ Rp and a subset of Ω indexed by ω, S(ω) ⊂ Ω, is equivalently defined by a region indicator function χS(ω) : Ω 7→ {0, 1}, the Geometric Transform R(ω) is defined as a functional G of two functions I and χS(ω) , G : Rq × {0, 1} 7→ Rr . R(ω) = G[I; χS(ω) ].
(3)
While the above definition seems very general, the GeT has very practical implications as clarified by below remarks: q • Image function I : Ω 7→ R . Usually it corresponds to the intensity function of an image. As Ω ⊂ Rp , so a 2D color image has p = 2 and q = 3. In most cases considered in this paper, f is chosen as the image intensity function defined in a 2D compact region Ω ⊂ R2 , which is usually inside a closed contour. The domain of interest in (3) is not limited to an image plane that lies in R2 . It can be generalized to x − y − t plane in the spatial-temporal domain or x − y − n domain where n refers to the index of the camera when we have multiple cameras. • Geometric set S(ω) and its associated function χS(ω) . First, the index ω can be viewed as the transform domain coordinate(s). Second, using different types of S(ω), we can obtain different types of GeTs. For example, if S(ω) only contains a set of points, it is called a GeT based on point sets. In the case of Radon transform, S(ω) is a line set. Finally, the region function χS(ω) can be further generalized to have an arbitrary output value: χS(ω) : Ω 7→ R. We will address this while introducing the multiresolution GeT in section VI. • Geometric functional G[I; χS(ω) ]. If we denote a function space as = = {f |f : S(ω) 7→ Rq }, then the
3
resulting functional is G : = 7→ Rr . The only difference of geometric functional from a regular functional is that it is also a function of the selected geometric set S(ω). For example, if the functional calculates the mean of a function, then the geometric functional gives the mean function value over the set S(ω). • Geometric transform R(ω). As mentioned earlier, the index ω in R(ω) can be viewed as the transform domain coordinate(s). Regarding the dimension of the transform r, mostly we study the case r = 1 in this paper. Ideally, the scalar gives the signature of the image I. The key of GeT is to embed the geometric context in the selection of sets {S(ω)}. Intuitively, as discussed in section II-A1, a geometric set reflects the correspondences, and the functional G(.) extracts the feature vector by obtaining the desired statistics over the set S(ω). Details of how to select the set and the functional are provided in section III. C. Special instances Many existing transforms and methods are special cases of the general definition of GeT in (3). 1) Radon transform: As mentioned above, the Radon transform is a special case of GeT. In n-dimensional Radon transform [10], [11], [12], the collection of sets S(ω) are hyperplanes parameterized by ω = (n, p), such that S(ω) = {x ∈ Rn |xT n − p = 0}. The region function is χS(n,p) (x) = δ(xT n − p). The functional G is an integral operating on the set S. Z Z R(n, p) = G[I; χS(n,p) ] = I(x)χS(n,p) (x) = I(x)dx. Ω S(n,p) (4) 2) Trace transform: In 2D, by changing the functionals defined on the line set, the Radon transform is generalized to the trace transform. The trace transform has been successfully used for object recognition in [5][6]. We give some examples of functionals in trace transform. In trace transform, the geometric set remains as straight lines. The points x = (x, y) in the line set S(s, p), with s = (s1 , s2 ) being the line normal and ω = (s, p), and the corresponding region function χS(s,p) (x, y) are represented as x = ps1 + ts2 , y = ps2 − ts1 , −∞ < t < +∞, χS(s,p) (x, y) = δ(x − ps1 − ts2 , y − ps2 + ts1 ). One type of trace transform is defined as
=
R(s, p) = G[I; χS(s,p) ] (5) Z Z Z ( |I(x, y)|q χS(s,p) (x, y)dxdy)r = ( |I(t)|q dt)r ,
which is a GeT. Other trace transforms using different geometric functions G are GeT’s too, such as Z Z Z R = ( tI(t)dt)/( I(t)dt), R = |I 0 (t)|dt. (6) In [6], the authors focus on designing combinations of functionals so that the extracted features vary or remain invariant under a group of affine transforms, which is very useful for recognition of appearances inside a contour. However because
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
4
they limit the geometric sets to be straight lines, their methods lack the ability to capture object appearances with large nonrigid motions. 3) Image warping: Traditional geometric transformation of images [9], which includes affine and perspective transformations are instances of GeT. Usually, a transformation refers to the type of the transform that only changes coordinates and there is a one-to-one mapping between the original domain and the transform domain. In this case, ω can simply be the new coordinates in the transform domain and rewritten as ω = ˜x = (˜ x, y˜). For affine transformation, the geometric set S(˜x) is · ¸ · ¸ a11 a12 tx ˜x + (7) S(˜x) = . a21 a22 ty For more complicated image warping, like the feature point based warping used in AAM [1], where two sets of feature points, {xcp }P xcp }P p=1 and {˜ p=1 , are registered, the geometric set is given by P X
S(˜x) = {
hi (˜x)xci },
(8)
i=1
where the function hi (.) is an interpolator that satisfies hi (˜xcj ) = 1 for i = j, hi (˜xcj ) = 0 for i 6= j. This way, S(˜xcp ) = {xcp } is guaranteed. In [1], it is shown that by properly choosing hi , the corresponding warping can be reduced to piecewise affine or thin plate splines. Letting the corresponding function χS(x˜ ) be χS(x˜ ) (x) = δ(x − S(˜x)), the GeT form for image warping is Z R(˜x) = G[I; χS(x˜ ) ] = I(x)χS(x˜ ) (x)dx = I(S(˜x)). (9) D. Designing GeT for appearance modeling The special cases discussed above have been proven very useful in appearance modeling. Here we further exploit the key elements in GeT and develop more complicated transforms for effective appearance modeling under different scenarios. As seen above, one fundamental form of GeT is Z R(ω) = I(x)χS(ω) (x)dx. (10) The existence of the above integral is an important theoretic question. While we believe that there exist some carefully crafted unmeasurable sets S(ω) that make the nonexistence of the above integral; however, the sets used in the paper are measurable. Proper selection of the set can help to find a representation with certain invariance properties. Ideally, the selected set incorporates meaningful prior knowledge or corresponds to regions of homogeneous distribution. Mostly, S(ω) in (3) reflects correspondences inferred from prior geometric knowledge. In this paper, we focus on geometric sets generated from a closed contour itself. In Fig. 3, we illustrate the relations among all the GeTs discussed in this paper. Three different elements in the original Radon transform are changed: the geometric functional, the indicator function, and the geometric set. By changing the indicator function from a Dirac delta function to a Gaussian kernel, a multiresolution GeT is introduced in
Fig. 3. The hierarchical graph of GeT families illustrates the relations between different types of GeTs.
section VI. According to the size of the geometric set, we have point and curve set-based GeTs. According to different applications of the transform, we have contour-driven GeT and traditional image warping. Detailed discussions on how to design each part of the framework are provided later. III. C ONTOUR - DRIVEN G E T An interesting way of generating the set is from the contour itself. Neither explicit models nor feature points are required. Both curve sets and point sets can be generated from the contour as follows. A. GeT based on level set In the introduction, we mentioned that the widely used level set is an implicit way of representing the contour. The signed distance function φ(x) is the solution of the Eikonal equation [7]: k∇φk2 = 1. Another choice of φ(x) is from the Poisson equation with the same boundary condition [14]: 4φ = φxx + φyy = −1, which gives a smoother solution due to the use of second order. If the Eikonal equation is used, each level set corresponds to the point with equal distance to the contour boundary. The geometric set can be generated from these R level sets. As defined in (1), the GeT is given as R(ω) = I(x)δ(φ(x)−ω)dx. For ω > 0, the integral is over the level set inside the contour. Because when the contour translates or rotates, the relative position of each level set does not change, so the transform in (1) is translation and rotation invariant, and it can be easily made scale invariant by changing ω. We empirically show that this selection of the set is particularly useful for modeling the appearance of a single component contour with bending and small distortion, for example when it is applied to modeling human arms in section VII-A. In Fig. 4(d), we plot the curves of R(ω) displaying the average intensity of arm images for different levels ω. For comparison, we also show in Fig. 4(e) the curves of average intensities of arm images along the rigid straight lines of slope -1. The curves in Fig. 4(d) are tightly clustered together, indicating that the GeT based on level set is less sensitive to bending. GeT
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
5
B. GeT based on shape matching
(a)
(b) 1
Transform for image 1 Transform for image 2 Average Intensity along Projection
Average Intensity for certain color
1
0.8
0.6
0.4
0.2
(c) Color Histogram for image 1 Color Histogram for image 2
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0 1
1.5 2 2.5 3 3.5 Distance to contour Boundary
4
0
(d)
−20
−10 0 10 Distance to image center
(e)
20
Fig. 4. Illustration of the contour-driven GeT applied to bending. (a)(b) the level set of two contours. (c) two arm images. (d)(e) the average intensities of the r, g, and b color components along (d) the levels for the two arm images and (e) the straight lines that are of slope -1.
(a)
(b)
(c)
0.8
0.6
0.4
0.2
0 0
0.5
1 1.5 2 Distance to skeleton
2.5
3
(e)
(d)
Transform for image 1 Transform for image 2
1
Transform for image 1 Transform for image 2
Average Intensity for certain color
Average Intensity for certain color
1
0.8
0.6
0.4
0.2
0 0
0.5
1 1.5 2 Distance to skeleton
2.5
3
(f)
Fig. 5. Illustration of modified GeT based on a level set designed for human arms. (a)(b) The skeleton of the contour is obtained by thinning the contour and dividing it into upper and lower parts. (c)(d) The geometric sets consists of points with equal distance to the upper or lower skeletons. Each color represents a geometric set. (e) The GeT finds the average intensity of the two arm images in Fig. 4 over the geometric sets for the upper part of the arm. (f) GeT for the lower part of the arm.
provides the rough correspondences of these curves generated from the level set. In practice, since the contour boundary is noisy and it may change the topology of the level set, so a modified version can be used instead. First, the contour region is thinned to a skeleton as in Figs. 5 (a) and (b). Then the level set curve is generated from the distance transform with respect to the skeleton. The modified level set still keeps the topology shown in Fig. 4, but each set contains points with equal distance to the skeleton instead of the contour boundary. These sets are marked with different colors in Figs. 5 (c) and (d). The above method is good for a contour with a single skeleton. For improved modeling of the appearance of human arms, these sets are further divided into upper and lower parts according to whether the point is closer to the upper or the lower skeleton respectively. The functional calculates the average intensity over these two sets. In Figs. 5 (e) and (f), we see that the resulting transforms of the two arm images are very close to each other. This modified transform is used for invariant matching of bending human arms in section VII-A.
The GeT based on shape matching integrates shape matching and image warping: given two closed contours, shape matching is invoked to establish the correspondence of the points along the two contours and then the images are warped to establish the correspondence of the points inside the two contours. Note that the conventional image warping often uses feature points such as corners or points with large curvatures that are automatically detected or manually specified. Using the GeT based on shape matching, we are able to obtain an appearance model independent of shape deformations such that pixel-wise comparisons can be made, which is good for application such as appearance based recognition of pedestrians at different poses. More details are given in Section VII-A. To generate the corresponding point sets, we focus on contour based shape matching methods [15] [16][17][18], among which a descriptor called shape context [16][18] finds correspondences without feature points. We incorporate this idea into GeT and use it to model the appearance of pedestrians, where the objects have articulated motion of parts and self occlusions. A GeT based on shape matching is defined as follows: suppose we have two 2D regions Γ0 , Γ1 bounded by two contours C0 and C1 respectively. Denote the intensity function in region Γ0 as A0 . Let the two contours be represented by the sampled points on them, i.e., C0 = {xci |i = 1, ..., N0 }, C1 = {˜xci |i = 1, ..., N1 }. Then by applying any shape matching method to the two sets of points, a one-to-one mapping of their subsets is found as ˜xm ↔ xm i i , for i = 1, ..., M and M ≤ min(N0 , N1 ). We design geometric sets for the interpolated dense correspondences as M X
S(˜x) = {
hi (˜x)xm i },
(11)
i=1
for xi ∈ Γ0 , x˜ ∈ Γ1 and hi (.) satisfying hi (˜ xm j ) = 1 for m i = j, hi (˜xj ) = 0 for i 6= j as shown in section II-C3. Then the corresponding GeT, denoted as RΓ0 Γ1 , is the image warping function that is based on shape matching between Γ0 and Γ1 . The GeT RΓ0 Γ1 transforms the appearance A0 inside the contour C0 to the appearance A1 inside the contour C1 . IV. F EATURE CURVE / SKELETON BASED G E T Point sets are used in GeT based on shape matching and AAM. These sets are generated from matched contour points or feature points. But sometimes we only have correspondences of some feature curves or skeletons. Direct interpolation becomes difficult since the points on the curve may be nearly colinear, as illustrated in Fig. 6(b). In this section, a feature curve-based point set generation is proposed. The feature curves can be curves along the intensity edges of the image as in Fig. 1 (a); or they can be curves inferred from skeletons. The GeT based on the latter is used as an interface between skeleton based shape matching [19] and appearance modeling. The skeleton can be generated using morphological operations, the medial axis space [19], principal curves, or some prior model space [20].
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
6
For example in Fig. 6, in order to deal with bending, we can also use the correspondence between two skeletons of the shapes. The local coordinate system along the skeleton can be specified through differential geometry. For example in Fig. 6, at every point on the skeleton, the y-axis in Fig. 6 is the tangent vector of the curve, while the x-axis is the normal vector. This coordinate system can be used to generate dense pointto-point correspondences by retaining the local coordinates at each point.
small margin is added to each part boundary. This method is illustrated in the experimental section.
(a) C
0
P0 (xloc,yloc)
X axis
Q
0
P1 (xloc,yloc)
Y axis
Q1
(a)
Y axis
(c)
Fig. 7. (a,b) Illustration of skeleton based transform for the appearance of human with articulations: (a) The skeleton of a human silhouette. (b) Results of part segmentation. (c) Illustration of reconstructing the convex hull from the support of the transform. Ω is the original contour. Π is the convex hull of this contour. From the support of the transform R along each direction, the convex hull Π can be found.
C
1
X axis
(b)
(b)
(c)
Fig. 6. Illustration of mapping through local coordinate systems defined by skeletons. Images (a,b) show how to map P1 to P0 according to the local coordinate system at Q1 , which is the closest point to P1 on curve C1 . C0 and C1 are matched curves and Q0 corresponds to Q1 . Image (c): The skeletons of arm images in Fig. 4 and the synthetic appearance generated using the skeleton based GeT from one arm to another are shown.
A GeT based on feature curves is defined as follows. Suppose that we have two matched curves as in Fig. 6, C0 = {(x(s), y(s))|s ∈ [0, 1]}, C1 = {(˜ x(s), y˜(s))|s ∈ [0, 1]} and for s ∈ [0, 1], S((˜ x(s), y˜(s))) = {(x(s), y(s))},
(12)
Feature curve-based GeT is complementary to GeT based on shape matching when no canonical shapes are available or when feature curves can be more reliably tracked. For example, the appearance of a human with arbitrary articulations cannot be handled by methods in section III-B, but it is possible to use the skeleton-based GeT. One may argue that the feature curves can be represented by a set of points, then similar interpolation as in (11) can be used. But all the interpolation methods including thin plate spline only work well for region inside or near the convex hull of these points. So when the convex hull of the feature curves does not cover most of the contour region, such as the case in Fig. 6 (a), simple interpolations are not effective.
then one simple way of generating S((˜ x, y˜)) for any (˜ x, y˜) inside the contour is as follows: 1) Define the local coordinate system for every point on C e that reflects local correspondences. For example, and C the tangent and normal vectors of the curve at that point can be chosen as bases. But for end points, joints, or discontinuities, the system needs to be chosen carefully. 2) For each P1 = (˜ x, y˜) inside the contour, find Q1 = arg minQ∈C1 |Q − P1 |. Then find the local coordinate of P1 at point Q1 , denoted as (xloc , yloc ). 3) S((˜ x, y˜)) will be P0 = (x, y) that has the local coordinate (xloc , yloc ) at point Q0 , which is the corresponding point of Q1 on curve C0 . The corresponding GeT is denoted as RC0 C1 . In Fig. 6, we illustrate synthetic images of human arms from a skeletonbased GeT. Here the skeleton is obtained by thinning the shape. The synthetic images are very close to the real ones. For a more complicated skeleton-based GeT that can be used for synthesizing the appearance of human with arbitrary articulations, the following steps can be followed. First, associate each segment of the skeleton with a body part, then the human silhouette is divided into parts according to which skeleton each point is closest to. This is illustrated in Fig. 7. Second, for each body part, the GeT discussed above is used. The transform of each part is applied in the order of from the part farthest from the camera to the closest part, to deal with occlusions. Also note that, in order to have a smooth synthesis near the boundary between body parts, a
Fig. 8. Images 1 and 2: The original image and its skeleton. Images 3,4 and 5: Synthesis results, the ground truth and the skeleton. Next 3 images: Another set of synthetic imagery.
We illustrate the above skeleton-based GeT using image synthesis. Assuming that the corresponding skeletons across frames are known, we generate the appearance of a human with articulated motion in subsequent frames from the GeT of the first frame. The results are shown in Fig. 8 together with the ground truth. We observe that the curve-based GeT can handle articulated motion with large nonlinear deformation and occlusions. Although the synthetic images have artifacts due to out-of-plane rotation etc., it is a simple model and can be further developed for accurate image based rendering.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
V. S ELECTION OF GEOMETRIC FUNCTIONAL In (10), the geometric functional can be changed along the geometric set as in trace transform [6] to obtain different statistics. Some examples of the functionals are listed in section II-C. We now show a useful geometric functional that helps to deal with occlusions. The following GeT can be used to find the average intensity over set S. R I(x)χS (x)dx R= R , (13) H(I(x))χS (x)dx where H(.) is the Heaviside function and we set H(0) = 0. The function H(I(x, y)) gives the mask of the contour region. Now we show why the GeT in (13) is insensitive to occlusion that does not change the convex hull of the shape. By letting the set S contain straight lines, the GeT becomes R I(x, y)δ(x cos θ + y sin θ − p)dxdy R R(θ, p) = . (14) H(I(x, y))δ(x cos θ + y sin θ − p)dxdy In (14), if I is constant inside Ω, then the transform will be constant when the line passes through the contour region and zero otherwise as shown in Fig. 7(c). Therefore, the shape information is partly lost. From the support of the transform, we can only reconstruct the visual hull of the contour. This reconstruction is well studied in computational geometry. For all the contours that have the same convex hull and the same intensity inside, their GeTs are identical. The GeT in (14) is unable to differentiate among these appearances. However, it becomes useful when the appearance inside the convex hull instead of the exact shape is to be modeled. Fig. 9 illustrates an example when a human walks sideways to the camera and the torso is partly occluded, but the average intensity along each line does not change much even with the presence of occlusion. The GeT in (14) can be effective.
Fig. 9. Two illustrations of partially occluded human torsos as examples of when the contour changes but the convex hull remains similar. Images 1 to 4 contain a partially occluded torso, the ground truth appearance inside the convex hull containing the torso, the reconstructed appearance from the Radon transform in image 1, and the reconstructed appearance by using filtered back projection from the average intensity times the Radon transform of the convex hull as in (15). Images 5 to 8, show another set of illustration as in 1 to 4. Note that in image 6, the ground truth image has outliers because the arm occludes the torso.
To illustrate the accuracy of such a representation, we compare the reconstructions from two different transforms (shown Fig. 9). The first one is from the original Radon transform of a function I(x) inside Ω as in (2). The construction is through a filtered back-projection. As observed from the figure, the exact shape and appearance are recovered. But the appearance in the missing part is not inferred. The other one is from the GeT defined in (14), which finds the average intensity along each line. From the the support of GeT, the binary mask of a
7
convex hull is obtained as χΠ . Then the Radon transform of I(x) inside the convex hull Π is estimated as Z ˜ R(θ, p) = R(θ, p) χΠ (x, y)δ(x cos θ + y sin θ − p)dxdy, (15) which is the product of the average intensity along each direction and the corresponding Radon transform of the binary mask. Finally reconstruction is achieved by applying the filtered back-projection to the estimated Radon transform ˜ This reconstruction gives an estimate of the appearance R. inside the convex hull. A comparison of this estimate with the true appearance inside the convex hull in Fig. 9 shows fairly accurate reconstruction. Thus using such a GeT helps to represent the appearance with partial occlusions. In section VII-A, a method based on this GeT is used for fingerprinting torsos with occlusions. It essentially gives a very good estimate of the average intensity along each line even for regions of non-uniform intensities. Finally, it is interesting to note that the above reconstruction that fills in the occluded pixels in the convex hull of a contour implements a form of image inpainting [21]. VI. M ULTIRESOLUTION ANALYSIS The resolution problem becomes significant when using explicit model based methods, because it is not reliable to impose point-to-point correspondences. Here we propose a multiresolution geometric transform (MRGeT) that can deal with noisy observations and inexact contour extraction at the proper scale space, as well as properly combine the appearance and shape information. Specifically, we can change χ(.) in (10) into the following kernel function: δ² (x) = √
1 2π²2
exp(−
x2 ), 2²2
(16)
where ² determines the resolution of the kernel. Note that lim²→0 δ² (x) = δ(x). If x is the distance of the point from the geometric set S, then by replacing χ(.) with δ² in (10) will produce a weighted integral in the neighborhood of the set S. Such a kernel is also denoted as χ²S (x). In the case of basic Radon transform, the function δ² corresponds to a line spread function and ² determines the width of the spread. The multiresolution representation can be used to combine the shape and appearance information. Consider introducing a kernel to the functional defined in (13). The key is to use different resolution parameters in the numerator and the denominator. For the basic geometric set of straight lines, we have R I(x, y)δ²1 (x cos(θ) + y sin(θ) − p)dxdy . R(θ, p) = R H(I(x, y))δ²2 (x cos(θ) + y sin(θ) − p)dxdy (17) By changing ²1 and ²2 , we achieve different representations for various purposes. 1) When ²1 → 0, ²2 → 0, R(θ, p) corresponds to the average intensity over a straight line, as discussed in section V. 2) When ²1 → 0, ²2 → +∞, the denominator will be almost constant. For example in Fig. 10(d), ²2 = 10 and
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
(a)
(b)
(c)
8
(d)
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 10. (a) The shape mask of an occluded torso. (b) The appearance of an occluded torso. (c) The Radon transform of the shape mask in (a). (d) The result of (c) convoluted with a kernel function with ²2 = 10, which is the denominator in (17). By choosing a large ²2 , this denominator becomes almost constant, especially in the central region. Fig. 11. Illustration of the MRGeTs of Fig. 10(a) with different selections of (²1 , ²2 ) and different top one recognition rates. In order to compare with the average intensity values in (a), all these transform values are kept in the original scale. The value above one is clipped to one, which causes some saturation. (a):(0,0). The MRGeT in (17) becomes identical to (14), which calculates the average intensity along each line and has a recognition rate of 66.2%. (b):(0,10). The MRGeT in (17) becomes approximately a scaled Radon transform with a recognition rate of 55.7%. (c):(2,0). This MRGeT corresponds to the worst recognition rate 2.6% for extreme selections of ²1 and ²2 as shown in Table I(b). (d):(2,3). This MRGeT corresponds to the best recognition rate 69.8% when ²1 = 2 along the fourth row of Table I(a). (e):(2,10). Rate: 59.0%. (f):(4,4.5). Rate: 69.8%. Distribution of Ratios of Occlusions 12
0.8 Top Rec Rate
Percentage of samples (%)
14
10 8 6 4
0.6 0.4 0.2 0 5
2
6 4
0 0
10 20 30 40 50 Percentage of occlusions (%)
60
Recogition rate with ε1=2.0
2
(a)
ε1
(b)
ε2
Top Recognition Rate
1
0.8 0.6 0.4 Top One Top Two Top Three
0.2 0 0
0 0
Recogition rate with ε2=2.5
1 Top Recognition Rate
the denominator in (17) has a uniform intensity pattern, especially in the central region. So in this case R(θ, p) is a scaled Radon transform of I(x, y). It can be fully reconstructed using filtered backprojection. 3) When ²1 → +∞, ²2 → 0.0, the numerator in (17) becomes almost constant so that most appearance information is lost, while the denominator retains all the shape information. This is a reversed situation from case (2). In the torso recognition test discussed later, such a transform gives very poor performance. 4) Other combinations of ²1 and ²2 will give combined representation of shape and appearance at different resolutions. ²1 adjusts the resolution of the appearance. ²2 varies depending on if we need to model the appearance in the actual shape or its convex hull. Using a bigger ²2 allows more accurate description in the actual shape, since R(θ, p) is closer to a scaled Radon transform. Using a smaller ²2 will make R(θ, p) closer to the average intensity along the line, thus modeling the appearance inside the convex hull. It will help to handle occlusions that do not change the convex hull of the shape. If the shape is closer to its convex hull, then occlusion does not pose a serious issue and ²2 can be bigger so that we have a fully reconstructible transform. The numerator and denominator can also be obtained by convoluting the Radon transforms of I(x, y) and H(I(x, y)) with Gaussian filters along the direction of p. In Fig. 11, we illustrate six MRGeTs with different selections of (²1 , ²2 ). The results of a recognition test carried out to study how to select (²1 , ²2 ) to handle occlusions are now presented. The appearances of occluded torsos from the USF database are matched using MRGeT with different (²1 , ²2 ) and the recognition rates are illustrated in Fig. 12 and Table I. The torso recognition experiment has 71 classes. The gallery has one image per class and the probe set has twenty-eight images per class. We classify the test image to one of these classes according to distances to the gallery image in the MRGeT domain. The size of the torso images is less than 52 × 25, as illustrated in Fig. 10(b). Details of the USF database can be found in section VII-A. Fig. 12(a) shows the histogram of the occlusion ratios of the torso images. This ratio is one minus the ratio of the area inside the contour to the area of the convex hull. Fig. 12(b) shows the top one recognition rate as a surface when (²1 , ²2 ) changes. The exact rates are in Table I(a). Fig. 12(c,d) show the recognition rate for fixed ²1 and ²2 respectively. In addition, Table I(b) shows the rate for some extreme choices of (²1 , ²2 ).
1
2
3 ε2
4
5
6
(c)
Top One Top Two Top Three
0.8 0.6 0.4 0.2 0 0
1
2
3 ε1
4
5
6
(d)
Fig. 12. (a) The histogram of occlusion ratios for torsos for the whole database. (b) The surface constructed from top one recognition rates for different (²1 , ²2 ) as shown in Table I(a). (c) The top one, two, and three recognition rates as ²1 = 2.0 and ²2 varies. (d) The top one, two, and three recognition rates as ²2 = 2.5 and ²1 varies.
From these tests, we observe the following: • ²1 ≤ ²2 gives better performance. From Figs. 12(b)(c)(d) and Table I, we observe very low recognition rates when ²1 > ²2 . In the extreme cases, such as ²1 = 2.0, ²2 = 0.0, the rate is as low as 2.6%, which is similar to case (3). It indicates that large values of ²1 combined with small values of ²2 tends to smooth out the appearance and causes the numerator in (17) to be almost constant. The appearance information is lost while the shape information does not suffice for recognition purpose, so the recognition rate drops. Therefore, we should always choose ²2 /²1 ≥ 1. • For a given ²1 , selecting ²2 slightly larger than ²1 gives the best performance. For example the (1.5, 2.5) gives the best rate of 70.5% among all tests. As ²2 gets even larger, the rate drops, because the GeT is closer to a scaled Radon transform and it does not handle occlusion as well. • Increasing both ²1 and ²2 while keeping ²2 /²1 close to
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008 ²1 \ ²2 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
0.5 67.1 13.0 12.1 12.1 12.1 12.1 12.1 12.0 12.2 11.9 12.1 11.8
1.0 68.2 67.8 26.2 16.2 14.3 13.2 12.7 12.5 12.5 12.0 12.1 11.8 (²1 , ²2 ) rate
1.5 67.3 69.3 67.9 42.1 23.8 19.7 17.8 16.8 15.7 14.8 14.5 14.0 (0,0.0) 66.2
9
2.0 2.5 3.0 66.2 64.8 63.2 69.3 67.9 67.3 69.2 70.5 69.4 67.7 69.2 69.8 51.0 68.1 69.1 33.4 56.4 68.3 26.1 42.3 60.2 23.3 33.0 48.2 21.3 29.1 39.9 19.7 26.2 33.7 18.5 24.2 31.1 18.4 23.0 29.8 (0,10.0) (2,0.0) 55.7 2.6
3.5 62.2 65.5 68.3 69.3 69.0 69.5 68.8 62.9 53.6 45.5 39.7 36.8 (2,10.0) 59.0
4.0 4.5 5.0 5.5 61.1 60.0 58.8 58.6 64.4 63.3 62.0 61.1 66.8 65.3 64.4 63.5 68.6 67.6 66.9 65.6 69.2 68.9 68.3 67.0 68.7 68.4 68.4 67.9 69.7 68.8 67.6 67.7 69.1 69.8 69.1 67.4 65.1 69.1 69.9 69.3 57.5 66.5 69.3 70.0 50.5 60.1 67.0 69.5 44.7 55.2 63.2 68.7 (4,0.0) (4,10.0) (b) 2.7 62.1
6.0 57.9 60.4 63.0 65.2 66.5 66.9 67.4 67.0 67.4 69.6 70.2 70.2
(a)
TABLE I ( A ) T OP ONE RECOGNITION RATES FOR DIFFERENT (²1 , ²2 ). ( B )T OP ONE RECOGNITION RATE FOR EXTREME CHOICES OF ²1
and greater than one does not degenerate the performance much. Although increasing ²1 reduces the resolution for the appearance, the increased ²2 makes sure the numerator in (17) is averaged over a larger support region. In our experiment, the color pattern of our subject is simple and does not have much texture. So it is similar to modeling the appearance of an object with solid color inside and reducing the resolution of both shape and appearance does not greatly degenerate the result. For this dataset ²1 = 1.5 ∼ 2 is a good choice. Another nice property of R(θ, p) in (17) is that it still carries properties of a basic Radon transform with respect to the similarity transform. Suppose that ˜ y) = I(T (x, y)) = I(x, I(s(x cos α + y sin α) + tx , s(−x sin α + y cos α) + ty ), then the transform of f˜ can be easily shown as ˜ p) = R(θ − α, tx cos(θ − α) + ty sin(θ − α) + sp). (18) R(θ, Following (18), the registration with respect to the similarity transform can be easily obtained. If we register two contours by first aligning their centroids and then scale them according to the ratio of areas, the only unknown in (18) is α, which corresponds to a translation in the transform domain. Thus it is easy to match the appearances inside two contours when they are related by a 2D similarity transform. VII. A PPLICATIONS As GeT is a generic transform, it is applicable to various tasks [22]. Here we explore in detail the applications of GeT in pedestrian recognition, part segmentation and video retrieval, leveraging its invariance to deformation and occlusion and multiresolution properties. A. Pedestrian recognition We design GeTs to incorporate geometric context into appearance modeling for objects with articulated motion, bending, and local deformations. The articulating appearance variations of humans and body parts provide very good examples for our study. It is useful to model the appearance of human because sometimes the appearance is more reliable
AND ²2 .
than gait, for example, in the application of persistent tracking. Here we focus on linking the identity of humans to the representation of appearances in the transform domain.
Fig. 13. Sample of USF database from three classes. Walking pedestrians with manually segmented body parts. The first set of images is from the gallery. The second set of image is from the probe set.
We use the USF database [23]. Fig. 13 shows some sample images with their body parts manually segmented. The size of each image is around 125 × 72. There are 71 classes (i.e. 71 different people) in this dataset. For each class, we have one image in the gallery and 28 in the probe set that are taken under similar conditions. We classify the probe image according to its distance to the gallery image. The distance is accordingly defined in the tested approaches. We report the results using the top-one recognition rates and cumulative matching characteristic (CMC) curves. We implemented both holistic and part-based approaches to pedestrian recognition. Given a query image, the partbased approach uses its known part information and designs a different GeT from each part; the holistic approach does not use the part information and treats image as a whole 1 . 1) Holistic GeT approach to pedestrian recognition: Using the GeT based on shape matching defined in section III-B, we propose a pose-invariant representation to model the appearance of a pedestrian. We use images of pedestrians without background pixels. It is difficult to compare them directly because of the differences in pose and size of the silhouette. However, if we focus on the side view of the person, a walking human usually has six typical poses as in Fig. 14(a). Although each person may have a different shape and walk differently, the topology of body parts remains roughly the same for different people at the same pose. We can use this property to normalize appearances at the same pose. By normalization we mean warping the appearance to be inside a 1 However, six canonical pose templates are used in our experiments and the part information of those pose templates is used.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
10
canonical shape through GeT for pixel-wise comparison. This way, shape variations of different sibjects are handled in a fashion similar to obtaining a normalized appearance of a face inside a mean shape in the AAM.
part in the order from the part farthest from the camera to the part closest to the camera, so that self-occlusions are handled. The final representation of the appearance does not depend on the initial pose. Results in Fig. 14 show that the designed GeTs capture the structure and deformation of the parts very well. After applying GeT, appearances inside two contours can be compared directly in the transform domain. Here for shape matching, we use the inner distance based shape context method [18], which is insensitive to articulations and hi (.)’s are chosen to generate thin plate spline interpolations. Fig. 14 shows how GeT can deal with large deformations and articulations without feature points, while AAM is known to be ineffective for dealing with large deformations of feature points that do not obey Gaussian distributions [1]. The idea of modeling appearances based on shape matching has been illustrated in [16], but here we formulate using a GeT that can handle articulations and self-occlusions, and apply it to modeling the appearances of real world objects. To improve robustness, we transform the image in the probe set to not only the normalized appearance at its closest pose and but also the corresponding mirror pose. By a mirror pose, we mean two similar silhouettes with different topology of parts, such as pose 1 and 4 in Fig. 14(a). Then we match the transformed image with the normalized gallery image at the same pose and choose the closest match. This way we accomplish human identification without part segmentation. For comparison, we implemented two approaches. The first approach is based on template matching. After compensating for the rigid similarity transformation, we calculate the square of summed distance between two images. The second approach uses the SIFT representation [24], which is invariant to scale and rotation and robust to affine and pose/view change and illumination variation [25]. There is a large body of literature on descriptors that are invariant to a large class of motion. However, none of them has been tested on characterizing the appearance of pedestrian for pedestrian identity recognition. They are mostly used for pedestrian detection, object categorization, etc. Pedestrian detection aims to localize pedestrian(s) in an image (e.g., separating pedestrian and non-pedestrian class). Object categorization aims to recognize the pedestrian class among other classes like car, airplane, football, chair, etc. Identity recognition is based on pedestrian image (also referred to as pedestrian recognition in the paper). It is obvious that the pedestrian recognition is more challenging as it compares images belonging to different pedestrians captured at different viewpoints and under different illuminations. Here we investigate the use of SIFT in characterizing complex objects like pedestrians. Since the SIFT outputs a histogram, we calculate the Bhattacharyya coefficient as the distance metric. The top-one recognition rate is reported in Table II and the CMC curves are in Fig. 16(d). The holistic GeT approach outperforms the other two holistic approaches by a significantly large margin: the top-one recognition rate for the holistic GeT approach is 69% while the rates for the other two approaches are below 40.9%. The CMC curve of holistic GeT not only stands much higher than the curves of holistic template matching and SIFT but also converges to ones more
(a)
(b)
(c)
(d) Fig. 14. (a) Canonical silhouettes of six typical poses of a walking human along with segmentation of body parts taken from USF database [23]. (b) Shape matching between a pedestrian’s silhouette and the canonical silhouette at a similar pose. The corresponding points are used for the GeT RΓ0 Γj . (c) Shape matching between parts at different poses used to generate GeT Rγ k γ k . j i Here we show the matching of head, left arm and left lower leg. By applying GeT to each part in a certain order, the appearance can be transformed from one pose to another. (d) The first image is the sample image and the second image is the synthetic image at the closest pose, followed by synthesized normalized appearance at the remaining five typical poses.
So given only one image of a pedestrian with an arbitrary pose, we can obtain the normalized appearance of pedestrians at all other poses as illustrated in Fig. 14(d), by using the GeT based on shape matching. We assume to have canonical silhouettes at six typical poses {Γi |i = 1, ...6} as shown in Fig. 14(a), along with their eight-part segmentation {γik |i = 1, ...6, k = 1, ..., 8}. The six poses are found by clustering all training silhouettes. The eight parts are head, torso, left arm, right arm, left upper leg, left lower leg, right upper leg and right lower leg. The eight-part segmentation of the six poses is known in advance. We first normalize the pedestrian’s appearance inside the canonical silhouette for the closest pose, before synthesizing the pedestrian’s normalized appearance at other poses. Denote the appearance inside the pedestrian’s silhouette Γ0 as A0 . Here are the two steps: •
•
Use shape matching to find the most similar pose: j = argmini=1,...6 M atchCost(Γ0 , Γi ). Use GeT RΓ0 Γj to find the normalized appearance Aj inside Γj , as illustrated in Fig. 14(b). Synthesize from pose j to pose i. For body part k, transfer the appearance akj inside γjk to appearance aki inside γik by applying Rγjk γik to akj , as in Fig. 14. Transform each
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
11
quickly. This clearly demonstrates robustness of the holistic GeT to nonrigid deformation. 2) Part-based GeT approach to pedestrian recognition: We design GeT for each body part and study the matching of the exact appearances of parts as well as combined recognition of humans. The task of matching the exact appearances of parts is still not easy because of low resolution, poor quality imagery and errors in segmentation. We apply different transforms for each body part according to their motion and possible occlusions. Because the right arms and right legs are often severely occluded in the dataset, we did not match these parts for recognition purposes.
approach) calculated for matching a specific part are normalized to a standard log-normal distribution in order to overcome the difference in the range of these part-specific distances. Fig. 16(a,b,c,d) shows the cumulative matching curves for matching each body part and the combined recognition results. CMC for each part using GeT 1
0.8
0.8 recognition rate
recognition rate
CMC for each part using template matching 1
0.6 0.4 0.2 0 0
Head Torso Left Arm Left Upper Leg Left Lower Leg 20 40 # of top matches
60
0.6 0.4 0.2 0 0
(a)
1
1
0.8
0.8
0.6 0.4 0.2 0 0
Head Torso Left Arm Left Upper Leg Left Lower Leg 20 40 # of top matches
60
20 40 # of top matches
60
(b)
CMC curves
recognition rate
recognition rate
CMC for each part using SIFT
Head Torso Left Arm Left Upper Leg Left Lower Leg
0.6 0.4 0.2
(c)
0 0
Part−based GeT Part−based template matching Part−based SIFT Holistic GeT Holistic template matching Holistic SIFT 20 40 # of top matches
60
(d)
Fig. 16. (a,b,c) Cumulative matching curves (CMC) of all parts for all partbased methods. (d) Combined recognition rate of human appearance for all six approaches.
Fig. 15. Sample results for matching body parts using GeT. Probe images from three classes are illustrated, corresponding to subjects in Fig. 13. Here each class has five images for one part. The first image is the probe image. The second image is the correct match in the gallery using GeT for parts. The next three images show the top 3 matches in the gallery. The ranks of the correct match for each class and each part are: from top to bottom, 1,3,50 for head. 3,1,1 for torso, 18,11,1 for left arm, 5,10,2 for left upper leg and 4,2,3 for left lower leg. The ranks of the correct match of human by combining parts are 1,1,1 for the part-based GeT approach, and 6,13,10 for the part-based template matching approach.
Our choices of GeT are as follows: • Head: Use MRGeT in (17) with ²1 = 1.5, ²2 = 4. ²2 is set as 4 because the actual shape is close to its convex hull. • Torso: Since occlusion needs to be considered while the convex hull of the shape does not change much, we use MRGeT in (17) with ²1 = 1.5, ²2 = 2 to deal with occlusions. • Left Arm: We use the modified GeT based on level set as discussed in section III-A and illustrated in Fig. 5. • Left Upper Leg: Mainly 2D rigid motion, choose ²1 = 1.5 and ²2 = 3 in (17) to deal with occlusion by the arm. • Left Lower Leg: Mainly 2D rigid motion, choose ²1 = 1.5 and ²2 = 4. In the above, if the MRGet is used, the property of MRGeT in (18) is used to align the two parts. For comparison, we also implemented the part-based template matching and SIFT approaches. Note for the torso region, we use only the appearance inside the convex hull to reduce the effect of occlusion. For combined recognition, we classify the probe image to the class that has the least weighted distance. The results for matching each body part as well as their heuristically chosen weights are shown in Fig. 15 and Table II. Note that the distances (except the Bhattacharyya coefficient for the SIFT
The part-based GeT outperforms the other two part-based approaches for each body part and collectively. Specifically, the GeT gives more than 16% higher recognition rates for torso and left upper leg. It is mainly due to the ability of GeT for handling occlusions. Overall, matching the exact appearances of parts is a difficult task, as we see in Fig. 15. The left arm part usually contains very few pixels and is very blurred, thus the contour-driven GeT only gets 6% higher rate than rigid templates. For the left lower leg, the GeT method is only slightly higher, because the left lower leg displays mostly 2D rigid motion with no occlusion. For overall recognition, despite the non-rigid motions that probe images have with respect to the gallery images, the top one recognition rate of part-based GeT is as high as 93.0%, while the holistic GeT gives 69.0%. It is interesting to observe that the holistic GeT even outperforms the part-based template matching and SIFT approaches. This again confirms that the designed GeT handles articulation better even though the latter two approaches use the part information. The poor performance of both the holistic and part-based SIFT approaches is probably due to the lack of distinctive texture in the images as the pedestrians in the dataset mostly wear clothes of uniform color. B. Human body part segmentation and video retrieval In this section, the holistic GeT approach is applied to body part segmentation, part-based human identification, and video retrieval on the Honeywell database2 . In the Honeywell database, there are nine subjects with thirty sets of clothing in one camera where they walk along similar paths as illustrated in Fig. 17. People change part of their clothing in different videos, so there are two different kinds of identity: clothing identity ID1 and person identity ID2. ID1 requires the same 2 The sequence is obtained from Honeywell Corporation under the HSARPA contract 433829 monitored by the office of Naval Research.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008 Part Holistic GeT Holistic template matching Holistic SIFT Part-based GeT Part-based template matching Part-based SIFT weights
Head 53.8 36.4 42.5 0.2
Torso 69.2 52.4 40.8 0.4
12 Left arm 27.0 21.6 12.4 0.13
Left upper leg 36.4 20.4 19.7 0.13
left lower leg 30.5 26.2 12.8 0.13
All 69.0 36.0 40.9 93.0 64.9 40.4
TABLE II T OP ONE RECOGNITION RATES (%) FOR THE PEDESTRIAN RECOGNITION EXPERIMENT.
person with the same clothing. ID2 requires the same person but allows different clothing. Each subject with the same clothing has one or two non-overlapping short sequences of about 21 frames each. In total, there are 54 short sequences. Based on ID1, there are 30 classes and based on ID2, there are only 9 classes. Our goal is to match these short sequences. Given one short video, we try to retrieve similar videos from the database according to both ID1 and ID2. A system that segments the body part automatically and extracts the signature of its appearance is used to accomplish this goal. Compared with the USF database, this set is more difficult because of noisy background subtraction and lack of good canonical templates of typical poses like the one in Fig. 14(a). But the key assumption for the algorithm still holds: each body part has similar topology for different people with the same pose.
Fig. 17. Honeywell database along with background subtraction results. It contains 54 short sequences. There are 30 classes (each considered as a subject) based on ID1 and 9 classes based on ID2. From left to right, except for subjects one and six, the neighboring four subjects are the same person, i.e., having the same ID2. For example, subjects two to five are the same person with different wears.
(a)
(a)
(b)
Fig. 19. Normalized appearance of pedestrians along with parts segmentation and appearance signature extraction. (a) The original image, background ˜fp in the smooth silhouette. (b) subtraction, and “normalized” appearance A Each column of the original image is followed by two columns showing the part segmentation for the smooth silhouette marked with the two dominant colors for each body part estimated through a Gaussian mixture model.
then the silhouette is normalized and projected onto the shape space to be smoothed. The smooth silhouette is again matched with the mean shapes and the GeT based on this matching is used to produce parts segmentation; see Fig. 19. Finally, since each part has many samples from all the frames, instead of the small region in a single image as in section VII-A, the color features can be more reliably extracted. Video retrieval and human identification are done based on these features. Implementation details are given in [22]. As seen in Fig. 19(b), the segmentation and the color signature are pretty accurate. The video retrieval tests based on the color histogram of each body parts are shown in Fig. 20. In the tests, the query video is matched with the remaining 53 sequences and the top three matches based on the color of each body part are illustrated. The Bhattacharyya distance is used as the measure between histograms. The part-based retrieval gives very interesting results. For example, the leg based retrieval finds other subjects wearing jeans of similar colors. The changes in wear for the lower body do not affect the retrieval results based on the torso or the arm.
(b)
Fig. 18. Illustration of how to construct the shape space. (a) For pose one, all the appearances and masks in the training data are ’normalized’ using GeT based on shape matching. These masks are used to construct a shape space. (b) Top row: average masks for each typical poses. Middle row: appearance inside the mean shapes, which is obtained by thresholding the first row. Bottom row: manual part segmentation for the mean shapes. It contains eleven parts as listed in Table III.
Our method works as follows. After constructing a normalized shape space, we perform a manual part segmentation of the mean shapes; the results are shown in Fig. 18. These steps are the training phases of our algorithm. Following the ideas in steps (c) and (d), when a query sequence comes in, these frames are first temporally aligned with the mean shapes,
Fig. 20. Illustration of retrieval results based on the color of each body part. The triplet of images in the first column includes a sample image from the query video along with the image showing two dominant colors of each body part. The other two columns show the top three retrieval results based on each body part in the following order: 1. head, 2. right lower arm, 3. torso, 4. right upper leg, 5. right lower leg, and 6. right shoe.
To give some quantitative analysis of our method, two experiments are carried out. Experiment I uses one sequence from each subject as the gallery. Note that these thirty se-
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
13
quences are selected from the same walking videos used for shape space construction, but each sequence is only partially overlapping with the selected sequences for shape learning. The remaining twenty-four sequences make up the probe sets. The correct match is according to ID1, so there are thirty classes in total. Experiment II matches each sequence with other fifty-three sequences. The correct match is decided using ID2, so there are nine classes and fifty-four test data. The matching results for each body part and the combined recognition rates are shown in Table III. The weights for the combined recognition are heuristically chosen, based roughly on the area of each body part. For experiment I, we observe that the torso, right lower arm, head and right upper arm give the best recognition results, meaning they are reliable cues to identify the same person with the same dress. For experiment II, the head shows the highest rate, because subjects with the same ID2 have the same hair color, making it the most reliable cue to determine ID2. Apart from it, the torso and right upper arm also give very high rates. Since there may be more than one sample belonging to the same class in the gallery during queries, the rates that all top two or top three matches belong to the correct class are also listed. The combined top one recognition rates are 95.8% and 96.3% respectively.
[4] M. Lades, C. Vorbr¨uggen, J. Buhmann, J. Lange, C. von der Malsburg, R. W¨urtz, and W. Konen, “Distortion invariant object recognition in the dynamic link architecture,” IEEE Trans. on Computers, vol. 42, pp. 300–311, 1993. [5] A. Kadyrov and M. Petrou, “The trace transform and its applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 8, pp. 811–828, 2001. [6] M. Petrou and A. Kadyrov, “Affine invariant features from the trace transform,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 1, pp. 30–44, 2004. [7] S. Osher and N. Paragios, Geometric Level Set Methods. SpringerVerlag, 2003. [8] J. Li, S. Zhou, and R. Chellappa, “Appearance modeling under geometric context,” in Proc. IEEE Conf. on Computer Vision, 2005, pp. 1252–1259. [9] G. Wolberg, Digital Image Warping. IEEE Computer Society Press, 1994. [10] L. Ehrenpreis, The Universality of the Radon Transform. Clarendon Press, Oxford, 2003. [11] A. Jain, Fundamentals of digital image processing. Prentice-Hall, Inc., 1989. [12] A. C. Kak and M. Slaney, Principles of Computerized Tomographic Imaging. Soc. of Industrial and Appl. Math., 2001. [13] D. A. Forsyth and J. Ponce, Computer Vision, A Modern Approach. Prentice Hall, 2003. [14] L. Gorelick, M. Galun, E. Sharon, R. Basri, and A. Brandt, “Shape representation and classification using the poisson equation,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, June 2004, pp. 61–67. [15] R. Veltkamp and M. Hagedoorn, “State-of-the-art in shape matching,” Utrecht University, Tech. Rep. UU-CS-1999-27, 1999. [16] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 4, pp. 509–522, 2002. [17] M. Li and C. Kambhamettu, “Nonrigid point correspondence recovery for planar curves using fourier decomposition,” in Proc. of Asian Conf. on Computer Vision, Jan 2004. [18] H. Ling and D. Jacobs, “Using the inner-distance for classification of articulated shapes,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, June 2005. [19] T. Sebastian, P. Klein, and B. Kimia, “Recognition of shapes by editing their shock graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 5, pp. 550–571, May 2004. [20] T. Tian, R. Li, and S. Sclaroff, “Articulated pose estimation in a learned smooth space of feasible solutions,” in Proc. of IEEE Workshop on Learning in Computer Vision and Pattern Recognition, 2005. [21] M. Bertalm´ıo, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” in Proceedings of SIGGRAPH, 2000. [22] J. Li, Appearance modeling using a geometric transform for object recognition. Ph.D. Dissertation, University of Maryland, 2006. [23] P. Phillips, S. Sarkar, I. Robledo, P. Grother, and K. Bowyer, “The gait identification challenge problem: data sets and baseline algorithm,” in Proc. IEEE Conf. on Pattern Recognition, 2002, pp. I: 385–388. [24] D. Lowe, “Distinctive image features from scale invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [25] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2003.
VIII. C ONCLUSION In summary, a general definition of the geometric transform is given to unify Radon transform, trace transform and image warping. We show how to design each element of GeT, particularly the geometric set and functional, to incorporate geometric context into appearance modeling. We also propose a multiresolution representation by using kernel functions. The GeT is shown to be useful in a broad range of applications. Future work includes further exploration of contour driven GeT and designing GeT for videos or multi-view sequences. In the future, we will conduct a theoretic sensitivity analysis of the GeT with respect to the segmentation quality. For the contour-driven GeT, the geometric set determines the sensitivity. For the GeT based on level set, the choice of level set is crucial: using the Poisson equation [14] generates a smoother level set solution than the signed distance, hence the sensitivity is lower. For the GeT based on shape matching, it uses the inner distance shape context [18] to generate the geometric set, which is robust to deformation and articulation, and hence is less sensitive to segmentation artifacts. For the MRGeT which uses a set of lines in the geometric set, it is sensitive to poor segmentation. But, using a wider smoothing kernel (e.g., ² in δ² (x) in (16) is bigger) results in a less sensitive GeT representation. R EFERENCES [1] T. Cootes and C. Taylor, “Statistical models of appearance for medical image analysis and computer vision,” in Proc. SPIE Medical Imaging, 2001. [2] V. Blanz and T. Vetter, “Face recognition based on fitting a 3d morphable model,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 9, 2003. [3] S. Zhou and R. Chellappa, “Image-based face recognition under illumination and pose variations,” Journal of the Optical Society of America, A, vol. 22, pp. 217–229, 2005.
Jian Li (S’04) received his B.S. in Electronic Engineering from Tsinghua University, Beijing, in 2001. In Aug. 2006, he received his Ph.D. degree from the Department of Electrical and Computer Engineering, University of Maryland, College Park. During his study at Maryland, he has published papers on fingerprinting humans and vehicles for persistent tracking through camera networks, structure from planar motion, and geometric transform based appearance modeling and image segmentation. He was an intern with Siemens Corporate Research, Princeton, NJ in 2005. Since Aug. 2006, he has been working as an Equity Strategist at Goldman Sachs Co.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008 Part EXP1-1 EXP1-2 EXP1-3 EXP2-1 EXP2-2 EXP2-3 EXP2A2 EXP2A3 WGT
HEAD 75.0 87.5 87.5 100.0 100.0 100.0 85.2 79.6 0.161
RUA 70.8 100.0 100.0 96.3 96.3 98.1 72.2 38.9 0.081
RLA 79.2 87.5 91.7 85.2 87.0 90.7 51.9 18.5 0.081
LAM 62.5 83.3 83.3 87.0 94.4 94.4 61.1 31.5 0.048
TORSO 87.5 100.0 100.0 98.1 100.0 100.0 66.7 42.6 0.161
14 RUL 66.7 75.0 83.3 81.5 88.9 94.4 59.3 31.5 0.081
RLL 50.0 62.5 79.2 79.6 90.7 94.4 48.1 22.2 0.161
LUL 62.5 79.2 83.3 75.9 88.9 92.6 55.6 33.3 0.048
LLL 58.3 79.2 79.2 85.2 90.7 92.6 55.6 27.8 0.048
RFT 45.8 70.8 75.0 83.3 87.0 88.9 55.6 31.5 0.065
LFT 62.5 75.0 79.2 77.8 88.9 92.6 44.4 25.9 0.065
ALL 95.8 100.0 100.0 96.3 98.1 100.0 66.7 42.6
TABLE III R ECOGNITION RATES FOR THE H ONEYWELL DATABASE . T HE ORDER OF THE COLUMN IS : HEAD , RUA: RIGHT UPPER ARM , RLA: RIGHT LOWER ARM , LAM: LEFT ARM , TORSO , RUL: RIGHT UPPER LEG , RLL: RIGHT LOWER LEG , LUL: LEFT UPPER LEG , LLL: LEFT LOWER LEG , RFT: RIGHT FOOT, LFT: LEFT FOOT, AND ALL: ALL PARTS COMBINED . E ACH ROW REPRESENTS THE FOLLOWING . EXP1-1 TO EXP1-3: TOP ONE , TWO , THREE RECOGNITION RATES FOR E XP I. EXP2-1 TO EXP2-3: TOP ONE , TWO , THREE RECOGNITION RATES FOR E XP II. T HE ABOVE RATES SHOW THE PERCENTAGE OF QUERY VIDEOS WHOSE CORRECT MATCH IS IN THE TOP N MATCHES . EXP2A2 TO EXP2A3 SHOW THE PERCENTAGE THAT ALL TOP TWO OR TOP THREE MATCHES ARE CORRECT ONES IN E XP II. WGT GIVES THE HEURISTIC WEIGHT OF EACH PART FOR COMBINED RECOGNITION .
Shaohua Kevin Zhou (S’01–M’04) received his B.E. degree in Electronic Engineering from University of Science and Technology of China, Hefei, China, in 1994, M.E. degree in Computer Engineering from National University of Singapore in 2000, and Ph.D. degree in Electrical Engineering from University of Maryland at College Park in 2004. He then joined Siemens Corporate Research, Princeton, New Jersey, as a research scientist and currently he is a project manager. Dr. Zhou has general research interests in signal/image/video processing, computer vision, pattern recognition, machine learning, and statistical inference and computing, with applications to biomedical image analysis (especially biomedical image context learning), biometrics and surveillance (especially face and gait recognition), etc. He has written two research monographes: the lead author of the book entitled Unconstrained Face Recognition (with R. Chellappa and W. Zhao, Springer) and a coauthor of the book entitled Recognition of Humans and Their Activities Using Video (with A. Roy-Chowdhury and R. Chellappa, Morgan & Claypool Publishers), has edited a book on Analysis and Modeling of Faces and Gestures (with Zhao, Tang, and Gong, Springer LNCS), has published over 60 book chapters and peer-reviewed journal and conference papers, and has possessed over 30 provisional and issued patents. He served in the technical program committee of premier computer vision and medical imaging conferences, gave a tutorial talk on Surveillance Biometrics for ICIP 2006, and organized the third international workshop on Analysis and Modeling of Faces and Gestures (AMFG) in conjunction with ICCV 2007. He was identified as Siemens Junior Top Talent in 2006.
Rama Chellappa (S’78–M’79–SM’83–F’92) received the B.E. (Hons.) degree from the University of Madras, India, in 1975 and the M.E. (Distinction) degree from the Indian Institute of Science, Bangalore, in 1977. He received the M.S.E.E. and Ph.D. Degrees in electrical engineering from Purdue University, West Lafayette, IN, in 1978 and 1981 respectively. Since 1991, he has been a Professor of electrical engineering and an affiliate Professor of computer science at University of Maryland, College Park. He is also affiliated with the Center for Automation Research (Director) and the Institute for Advanced Computer Studies (Permanent member). Currently, he holds a Minta Martin Professorship in the College of Engineering. Prior to joining the University of Maryland, he was an Assistant (1981-1986), Associate Professor (1986-1991) and Director of the Signal and Image Processing Institute (1988-1990) at University of Southern California, Los Angeles. Over the last 27 years, he has published numerous book chapters, peer-reviewed journal and conference papers in image and video processing, analysis and recognition. He has also co-edited/co-authored six books on neural networks, Markov random fields, face/gait-based human identification and activity modeling. His current research interests are face and gait analysis, 3D modeling from video, automatic target recognition from stationary and moving platforms, surveillance and monitoring, hyper spectral processing, image understanding, and commercial applications of image processing and understanding. Dr. Chellappa has served as an associate editor of four IEEE Transactions. He was a co-Editor-in- Chief of Graphical models and Image Processing and also served as the Editor-in-Chief of IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE. INTELLIGENCE during 20012004. He served as a member of the IEEE Signal Processing Society Board of Governors during 1996-1999 and as its Vice President of Awards and Membership during 2002-2004. He has received several awards, including an NSF Presidential Young Investigator Award in 1985, three IBM Faculty Development Awards, the 1990 Excellence in Teaching Award from the School of Engineering at USC and the 2000 Technical Achievement Award from IEEE Signal Processing Society. He co-authored the 1992 Best Industry Related Paper (with Q. Zheng) and the 2006 Best Student Authored Paper in the Computer Vision Track (with A. Sundaresan), both presented at the International Conference on Pattern Recognition. He was elected as a Distinguished Faculty Research Fellow (1996-1998) and as a Distinguished Scholar-Teacher (2003) at University of Maryland. He is a co-recipient (with A. Sundaresan) of the 2007 Outstanding Innovator Award from the Office of Technology Commercialization and received the A. J. Clark School of Engineering 2007 Faculty Outstanding Research Award. He is a Golden Core Member of IEEE Computer Society and also received its Meritorious Service Award in 2004. He is serving as a Distinguished Lecturer of IEEE Signal Processing Society and will receive the Society’s Meritorious Service Award at ICASSP 2008. He is a Fellow of IEEE and the International Association for Pattern Recognition. He has served as a General and Technical Program Chair for several IEEE international and national conferences and workshops.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
15
L IST OF F IGURE /TABLE C APTIONS Fig. 1. Illustration of different types of correspondence. (a) The curve correspondence across different views. (b) The region correspondence across different poses. Marked contours with the same color show the boundary of two corresponding regions. Fig. 2. Illustration of Radon transform and its use for line detection. (a) Radon transform is essentially a line projection. (b) An edge map. (c) The Radon transform of the edge map in (b). Line detection is accomplished by locating the peaks in the transform domain and finding the lines corresponding to these peaks. Fig. 3. The hierarchical graph of GeT families illustrates the relations between different types of GeTs. Fig. 4. Illustration of the contour-driven GeT applied to bending. (a)(b) the level set of two contours. (c) two arm images. (d)(e) the average intensities of the r, g, and b color components along (d) the levels for the two arm images and (e) the straight lines that are of slope -1. Fig. 5. Illustration of modified GeT based on a level set designed for human arms. (a)(b) The skeleton of the contour is obtained by thinning the contour and dividing it into upper and lower parts. (c)(d) The geometric sets consists of points with equal distance to the upper or lower skeletons. Each color represents a geometric set. (e) The GeT finds the average intensity of the two arm images in Fig. 4 over the geometric sets for the upper part of the arm. (f) GeT for the lower part of the arm. Fig. 6. Illustration of mapping through local coordinate systems defined by skeletons. Images (a,b) show how to map P1 to P0 according to the local coordinate system at Q1 , which is the closest point to P1 on curve C1 . C0 and C1 are matched curves and Q0 corresponds to Q1 . Image (c): The skeletons of arm images in Fig. 4 and the synthetic appearance generated using the skeleton based GeT from one arm to another are shown. Fig. 7. (a,b) Illustration of skeleton based transform for the appearance of human with articulations: (a) The skeleton of a human silhouette. (b) Results of part segmentation. (c) Illustration of reconstructing the convex hull from the support of the transform. Ω is the original contour. Π is the convex hull of this contour. From the support of the transform R along each direction, the convex hull Π can be found. Fig. 8. Images 1 and 2: The original image and its skeleton. Images 3,4 and 5: Synthesis results, the ground truth and the skeleton. Next 3 images: Another set of synthetic imagery. Fig. 9. Two illustrations of partially occluded human torsos as examples of when the contour changes but the convex hull remains similar. Images 1 to 4 contain a partially occluded torso, the ground truth appearance inside the convex hull containing the torso, the reconstructed appearance from the Radon transform in image 1, and the reconstructed appearance by using filtered back projection from the average intensity times the Radon transform of the convex hull as in (15). Images 5 to 8, show another set of illustration as in 1 to 4. Note that in image 6, the ground truth image has outliers because the arm occludes the torso. Fig. 10. (a) The shape mask of an occluded torso. (b) The appearance of an occluded torso. (c) The Radon transform of the shape mask in (a). (d) The result of (c) convoluted with a kernel function with ²2 = 10, which is the denominator in (17). By choosing a large ²2 , this denominator becomes almost constant, especially in the central region. Fig. 11. Illustration of the MRGeTs of Fig. 10(a) with different selections of (²1 , ²2 ) and different top one recognition rates. In order to compare with the average intensity values in (a), all these transform values are kept in the original scale. The value above one is clipped to one, which causes some saturation. (a):(0,0). The MRGeT in (17) becomes identical to (14), which calculates the average intensity along each line and has a recognition rate of 66.2%. (b):(0,10). The MRGeT in (17) becomes approximately a scaled Radon transform with a recognition rate of 55.7%. (c):(2,0). This MRGeT corresponds to the worst recognition rate 2.6% for extreme selections of ²1 and ²2 as shown in Table I(b). (d):(2,3). This MRGeT corresponds to the best recognition rate 69.8% when ²1 = 2 along the fourth row of Table I(a). (e):(2,10). Rate: 59.0%. (f):(4,4.5). Rate: 69.8%. Fig. 12. (a) The histogram of occlusion ratios for torsos for the whole database. (b) The surface constructed from top one recognition rates for different (²1 , ²2 ) as shown in Table I(a). (c) The top one, two, and three recognition rates as ²1 = 2.0 and ²2 varies. (d) The top one, two, and three recognition rates as ²2 = 2.5 and ²1 varies. Fig. 13. Sample of USF database from three classes. Walking pedestrians with manually segmented body parts. The first set of images is from the gallery. The second set of image is from the probe set. Fig. 14. (a) Canonical silhouettes of six typical poses of a walking human along with segmentation of body parts taken from USF database [23]. (b) Shape matching between a pedestrian’s silhouette and the canonical silhouette at a similar pose. The corresponding points are used for the GeT RΓ0 Γj . (c) Shape matching between parts at different poses used to generate GeT Rγjk γik . Here we show the matching of head, left arm and left lower leg. By applying GeT to each part in a certain order, the appearance can be transformed from one pose to another. (d) The first image is the sample image and the second image is the synthetic image at the closest pose, followed by synthesized normalized appearance at the remaining five typical poses. Fig. 15. Sample results for matching body parts using GeT. Probe images from three classes are illustrated, corresponding to subjects in Fig. 13. Here each class has five images for one part. The first image is the probe image. The second image is the correct match in the gallery using GeT for parts. The next three images show the top 3 matches in the gallery. The ranks of the correct match for each class and each part are: from top to bottom, 1,3,50 for head. 3,1,1 for torso, 18,11,1 for left arm, 5,10,2 for left upper leg and 4,2,3 for left lower leg. The ranks of the correct match of human by combining parts are 1,1,1 for the part-based GeT approach, and 6,13,10 for the part-based template matching approach.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
16
Fig. 16. (a,b,c) Cumulative matching curves (CMC) of all parts for all part-based methods. (d) Combined recognition rate of human appearance for all six approaches. Fig. 17. Honeywell database along with background subtraction results. It contains 54 short sequences. There are 30 classes (each considered as a subject) based on ID1 and 9 classes based on ID2. From left to right, except for subjects one and six, the neighboring four subjects are the same person, i.e., having the same ID2. For example, subjects two to five are the same person with different wears. Fig. 18. Illustration of how to construct the shape space. (a) For pose one, all the appearances and masks in the training data are ’normalized’ using GeT based on shape matching. These masks are used to construct a shape space. (b) Top row: average masks for each typical poses. Middle row: appearance inside the mean shapes, which is obtained by thresholding the first row. Bottom row: manual part segmentation for the mean shapes. It contains eleven parts as listed in Table III. Fig. 19. Normalized appearance of pedestrians along with parts segmentation and appearance signature extraction. (a) The original image, background subtraction, and “normalized” appearance A˜fp in the smooth silhouette. (b) Each column of the original image is followed by two columns showing the part segmentation for the smooth silhouette marked with the two dominant colors for each body part estimated through a Gaussian mixture model. Fig. 20. Illustration of retrieval results based on the color of each body part. The triplet of images in the first column includes a sample image from the query video along with the image showing two dominant colors of each body part. The other two columns show the top three retrieval results based on each body part in the following order: 1. head, 2. right lower arm, 3. torso, 4. right upper leg, 5. right lower leg, and 6. right shoe. Table I. (a) Top one recognition rates for different (²1 , ²2 ). (b)Top one recognition rate for extreme choices of ²1 and ²2 . Table II. Top one recognition rates (%) for the pedestrian recognition experiment. Table III. Recognition rates for the Honeywell database. The order of the column is: head, RUA: right upper arm, RLA: right lower arm, LAM: left arm, torso, RUL: right upper leg, RLL: right lower leg, LUL: left upper leg, LLL: left lower leg, RFT: right foot, LFT: left foot, and ALL: all parts combined. Each row represents the following. EXP1-1 to EXP1-3: top one, two, three recognition rates for Exp I. EXP2-1 to EXP2-3: top one, two, three recognition rates for Exp II. The above rates show the percentage of query videos whose correct match is in the top n matches. EXP2A2 to EXP2A3 show the percentage that all top two or top three matches are correct ones in Exp II. WGT gives the heuristic weight of each part for combined recognition.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, XXX 2008
C OVER S HEET Paper ID: TIP-02778-2006.R2 Paper title: Appearance Modeling Using a Geometric Transform Journal: IEEE Transactions on Image Processing Software used: LaTex Contact information: • Jian Li Center for Automation Research and Department of Computer Science University of Maryland, College Park, MD 20742 Email: {lij}@cfar.umd.edu Phone: 301-405-0290 Fax: 301-314-9115 •
Shaohua Kevin Zhou (corresponding author) Integrated Data Systems Department Siemens Corporate Research 755 College Road East, Princeton, NJ 08540 Email: {shaohua.zhou}@siemens.com Phone: 609-734-3325 Fax: 609-734-6565
•
Rama Chellappa Center for Automation Research and Department of Electrical and Computer Engineering University of Maryland, College Park, MD 20742 Email: {rama}@cfar.umd.edu Phone: 301-405-3656 Fax: 301-314-9115
17