every camera is oriented such that a perpendicular line in the scene should project to an approximately vertical line in the image, which requires that the ...
2010 International Conference on Pattern Recognition
A Robust Approach for Person Localization in Multi-camera Environment
Luo Sun, Huijun Di, Linmi Tao and Guangyou Xu Tsinghua National Laboratory for Information Science and Technology Tsinghua University, 100084 Beijing, P.R. China {sunluo00, dhj98}@mails.tsinghua.edu.cn {linmi, xgy-dcs}@tsinghua.edu.cn estimation of dense disparities or 3D feature tracking under articulated motion is hard to be guaranteed. Consequently, it is attractive to accomplish person localization without need of dealing with the point-wise correspondence problem, such as [4]-[7]. In [4], ground location of person is obtained by estimating the cluster centroid of likelihood map on the ground plane that is determined according to color information. Under homographic constraint, feet region of person can be obtained by multiplying the warped multiview foreground likelihood maps in [5], or by computing the intersection of projected multi-view foreground images in [6]. In these methods, either the likelihood calculation based on color information is prone to being unreliable and greatly relies on the assumption on colors of people’s clothes, or feet must be visible in at least two views, which is not always guaranteed especially in indoor environment. Only in [7], the method neither relies on color information nor assumes the visibility of feet, where ground location of person is obtained by projecting the detected principal axes from each view onto the ground and calculating their intersection point. However, it assumes that every camera is oriented such that a perpendicular line in the scene should project to an approximately vertical line in the image, which requires that the camera’s image plane should be perpendicular to the ground plane, or the target person should be sufficiently far away from the camera. This assumption is usually reasonable for outdoor environment. However, it would introduce system error in indoor environment where the assumption cannot be complied well. Inspired by [7], this paper proposes a robust vision-based approach for person localization, which can eliminate the system error but without need for full calibration. We introduce the perpendicular projection of camera’s optical center and explore the multi-view geometric constraints regarding it. By employing the constraints, the localization problem is finally posed as a linear optimizing problem. The proposed approach has several advantages: 1) no assumption on the positions and orientations of cameras except the cameras should have certain common field of view; 2) no assumption on the visibility of particular body part (e.g., feet), except a portion of person should be observed in at least two views; 3) reliability in terms of tolerating occlusion, body posture change and inaccurate motion detection. Our approach can also provide error control regarding localization result and be further extended to measure person height by simple geometric relationship.
Abstract—Person localization is fundamental in human centered computing, since person should be localized before being actively serviced. This paper proposed a robust approach to localize person based on the geometric constraints in multi-camera environment. The proposed algorithm has several advantages: 1) no assumption on the positions and orientations of cameras except the cameras should have certain common field of view; 2) no assumption on the visibility of particular body part (e.g., feet), except a portion of person should be observed in at least two views; 3) reliability in terms of tolerating occlusion, body posture change and inaccurate motion detection. It can also provide error control and be further extended to measure person height. The efficacy of the approach is demonstrated on challenging real-world scenarios. Keywords-person localization; multi-camera
I.
INTRODUCTION
Person localization in indoor environment is important in many applications, such as visual surveillance and ambient assisted living. Recently there has been an increasing research interest toward human-centered computing [1]. The main motivation is that computers should be able to adapt to people rather than vice versa. For instance, in smart home for health care of elderly people, users move freely in the scene and HCI systems should be able to seamlessly interact with them at any site within the scene. In this regard knowing users’ location (i.e., person localization) is a precondition for human-centered computing. There are several ways of localization, including visionbased methods, RFID-based methods, and so on. Visionbased methods are more attractive due to their advantages such as no need to wear sensor. This paper focuses on visionbased methods of person localization which have no restrictions on person’s movement or posture, and are robust in real situations. However, enabling free movement and posture variation gives rise to difficulties caused by occlusion and the complicated articulated motion of human body. In this situation, a logical step is to make use of multiple cameras so as to recover information that might be missed in a particular view. Under multi-camera configuration, point-wise matching based stereo method is one of the possible options for localization, e.g., 3D location of person can be obtained according to the estimated dense disparities [2], or by directly 3D feature tracking [3], etc. However, reliable 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.981
4020 4040 4036
II.
MULTI-VIEW GEOMETRIC CONSTRAINTS
Suppose a planar surface π in space. In the context of localization, π is usually the ground plane where person moves. In this section, when we say “perpendicular”, we mean “perpendicular to π ”. Let’s first define the homography mapping of an arbitrary line in space. Suppose the image plane I of a camera facing π ; and assume the homography H from I to π is known. Let UV denote an arbitrary line in space and uv be the corresponding line on I , as shown in Fig.1. U gVg is the homography mapping of UV on π under the
C1
C2
Q3 C1g 2 Q3g
camera, where U g (or Vg ) is the intersection point of π and the line going through the camera’s optical center C and U (or V ). The plane coordinate of U g (or Vg ) on π can be calculated by directly applying homography H on u (or v ).
Q1
P3
Cg2
Q2
P1
P2 2 Q2g
1 Q1g
Q1g2
Figure 2. Three lines perpendicular to plane and
P3Q3 ( P1 , P2
and
1 Q3g
P3
are on
π
Q21 g
π
, namely
PQ 1 1 , P2 Q2
), are watched by two cameras from
distinct viewpoints.
Figure 1. Homography mapping of an arbitrary line.
In multi-camera environment, it can be proved that the following two constraints regarding perpendicular line remain. Constraint 1 The homography mappings of the same perpendicular line on π under different cameras are concurrent, and the intersection point is the line’s perpendicular projection. For instance in Fig.2, the homography mappings of PQ 1 1 on π under the two cameras intersect at P1 , which is the perpendicular projection of PQ 1 1 . Constraint 2 The homography mappings of perpendicular lines at different locations on π under the same camera are concurrent, and the intersection point is the perpendicular projection of the camera’s optical center. e.g., in Fig.2, the homography mappings of PQ 1 1 , P2 Q2 and P3 Q3 on π under the first camera intersect at C1g , which is the perpendicular projection of the optical center C1 . In [7], Constraint 1 is used to determine the ground location of person, by projecting the principal axes of person in each view onto the ground and calculating their intersection point. By assuming a perpendicular line in the scene is projected to an approximately vertical line in the image plane, the detection of principal axis is reduced to an estimation of the vertical axis’s horizontal position from extracted foreground pixels.
4037 4041 4021
To eliminate the above assumption that is too restricted for indoor environment, Constraint 2 is exploited in this paper, namely to make use of the perpendicular projection of camera’s optical center, which is superposition with the intersection of the lines fitting each projected person region on the ground. As the result, the objective of person localization can be posed as an optimizing problem under the proposed geometric constraints. This is the base of our approach that will be discussed in the following section. It should be noted that, once the homography from a camera’s image plane to the ground plane (used in [7] and can be determined by the algorithm in [8]) is known, the perpendicular projection of camera’s optical center can be obtained automatically by calculating the vanishing point of homography mappings of a perpendicular ruler standing at different locations in the scene. III.
PERSON LOCALIZATION
To obtain person region in the image of each camera, we use background modeling and foreground subtraction. The extracted foreground from each view is then projected onto the ground plane under the corresponding homography. A set of projected foreground pixels are therefore derived. The procedure is shown in Fig.3. Let M denote the number of cameras and N i (i = 1, 2, , M ) denote the number of foreground pixels extracted from the ith camera. All projected foreground pixels on the ground plane are represented by the set F = { p ij , (i = 1, 2, , M ; j = 1, 2, , N i )} , where p ij denotes the jth pixel of projected foreground from the ith camera. We define person location as ground location of the person's perpendicular axis along which the volume of
person distributes symmetrically. Let X denote the random variable that represents person location. The objective of localization is to maximize the posterior probability distribution P( X | F ) given the observation F , namely
x p = arg max P( X | F ) ,
If we define D as square of algebra distance from the
point p ij to the line cig X , equation (5) can be deformed into a least square problem that can be solved by linear algebra. Meanwhile, the error ellipse regarding localization result can be also determined, which would be useful for future probability-based processing. Given the ground location of person, person height can be measured by simple geometric relationship when the head region is correctly extracted in any view. e.g., in Fig.2, the following relationship exists.
(1)
X
where x p is the expected localization result. Using Bayesian rule, we have
P( X | F ) ∝ P( X ) P( F | X ) ,
(2)
where P (X ) is essentially related to the prior knowledge about current person location. Uniform prior is assumed in our approach, suggesting no prior information about person location. Under this assumption, we only need to consider the likelihood item in (2). Assuming projected foreground pixels are independent to each other, we have, C
Ni
i =1
j =1
P ( X | F ) ∝ P ( F | X ) = ∏∏ P ( p ij | X )
1 1 1 1 1 PQ 1 1 = PQ 1 1 g / C g Q1 g C g C ,
where C1g C 1 is the height of the first camera’s optical center that can be determined along with the ground location of the optical center. IV.
(3)
particular projected pixel p ij , the distance to the line cig X can be used to measure how the pixel fits the hypothesized location X . Intuitionally, the smaller the distance is, the better the fitting is. Therefore, we can define P ( p ij | X ) in our approach as, i j
i g
(4) i j
where D( p , c X ) is the distance from the point p to the
line cgi X . Substitute (4) into (3), we have C
Ni
x p = arg min D( p ij , cgi X ) X
(5)
i =1 j =1
TABLE I.
c
Our approach Scene1(4 cams) Scene2(4 cams) Scene3(3 cams) Scene4(3 cams)
c
x p . c1g , cg2
and
cg3
5.32cm 5.19cm 4.42cm 4.82cm
The approach in [7] 22.81cm 21.33cm 18.58cm 19.27cm
We have also tested the processing capability of our approach on a normal-configured PC and the results are shown in Tab.2. We notice that the processing frame rate of four-camera configuration is roughly 3/4 of that of threecamera configuration and the processing frame rate of highresolution configuration is roughly 1/4 of that of lowresolution configuration. The observation indicates that most of the running time is consumed by per-camera processing, e.g. background modeling, which implies that our
3 g
Figure 3. Projected foregrounds on the ground plane and the expected localization result
MEAN DISTANCE FROM CENTROID TRAJECTORY
1 g
xp cg2
EXPERIMENT
The proposed approach has been examined under several real-world scenarios. One is an indoor environment with a table and a chair inside (Scene1). There are four stationary cameras, monitoring the environment from distinct viewpoints. The cameras are mounted near the ceiling, tilting toward the floor. Fig.4 shows the experiment results of person localization and height measurement in the scenario. It is difficult to directly compare our approach with others’, as different methods have different definition of location as well as geometric constraints. However, the output of most localization methods is motion trajectories of objects. Thus, we manually label the perpendicular projection of body centroid on the ground as ground truth and compare the trajectories obtained by our approach and by [7]. We can find that the trajectory of our approach is much closer to the ground truth than that from [7] where system error exists. We further measure accuracy quantitatively with the mean distance between the localization result and the ground truth under three different indoor scenarios, as shown in Tab.1. Our approach has significant advantage over that in [7] in terms of robustness and accuracy in indoor environment.
As discussed in the previous section, we know that for a
P ( p ij | X ) ∝ exp(− D( p ij , cgi X )) ,
(6)
are the ground locations of the
optical centers’ perpendicular projections.
4038 4042 4022
Figure 4. Experiment results in one scenario. The person trajectories on the ground obtained by our approach and by [7] as well as the manually labeled ground truth are shown in (b), colored with white, gray and blue, respectively. (a) shows person height coinciding with the trajectories. (c) shows the localization procedure at one time instant. The centers of the yellow ellipse (the error ellipse) and the white circle are the localization results determined by our approach and by [7], respectively; and the projected principle axes used in [7] are shown as white lines. The original images and the corresponding foregrounds are shown in (d).
localization which is essentially a linear optimizing problem shows computational efficiency. TABLE II.
Average processing frame rate
PROCESSING SPEED OF OUR APPROACH 3 cams Res:320*240
4 cams Res:320*240
4 cams Res:640*480
12.2fps
8.9fps
2.4fps
REFERENCES [1]
[2]
[3]
V.
CONCLUSION
In this paper, we have proposed a robust vision-based approach for person localization in multi-camera environment. This is achieved by employing the proposed geometric constraints into the probabilistic formulization of localization problem. Real-world experiments have demonstrated robustness, accuracy and computational efficiency of the approach, which can be widely used in human computing applications.
[4]
[5]
[6]
ACKNOWLEDGMENT
[7]
This research was supported in part by the National Natural Science Foundation of China under Grant Nos. 60873266 and 90820304. We are about to be thankful for the thoughtful comments and suggestions of our reviewers.
[8]
4039 4043 4023
A. Jaimes, N. Sebe, and D. Gatica-Perez, “Human-Centered Computing: A Multimedia Perspective,” Proc. ACM International Conf. on Multimedia, ACM Press, pp. 855-864, Oct. 2006. S. Bahadori, L. Iocchi, G. R. Leone, D. Nardi, and L. Scozzafava, “Real-Time People Localization and Tracking through Fixed Stereo Vision,” Applied Intelligence, vol.26, pp. 83-97, Apr. 2007. S. L. Dockstader and A. M. Tekalp, “Multiple Camera Tracking of Interacting and Occluded Human Motion,” Proceedings of the IEEE, vol. 89, pp. 1441-1455, Oct. 2001. A. Mittal and L. S. Davis, “M2Tracker: A Multi-View Approach to Segmenting and Tracking People in a Cluttered Scene Using RegionBased Stereo,” International Journal of Computer Vision, vol. 51, pp. 189-203, Feb. 2002. S. Khan and M. Shah, “A Multi-view Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint,” Proc. European Conf. on Computer Vision, pp. 133-146(IV), 2006. S. Park and M. M. Trivedi, “Multi-perspective Video Analysis of Persons and Vehicles for Enhanced Situational Awareness,” Proc. IEEE International Conf. on Intelligence and Security Informatics, pp. 440-451, 2006. W. Hu, M. Hu, T. Tan, J. Lou, and S. Maybank, “Principal AxisBased Correspondence between Multiple Cameras for People Tracking,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 28, pp. 663-671, Apr. 2006. Richard Hartley and Andrew Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, Mar. 2004.