Multiple-Person Tracking Using a Plan-View Map with Error Estimation

2 downloads 0 Views 1MB Size Report
Takahide, Sasakawa.Koichi}@wrc.melco.co.jp. 2 Graduate School of Informatics, Kyoto University,. 36-1 Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501, Japan.
Multiple-Person Tracking Using a Plan-View Map with Error Estimation Kentaro Hayashi1 , Takahide Hirai1 , Kazuhiko Sumi2 , and Koichi Sasakawa1 1

Advanced Technology R&D Center, Mitsubishi Electric Co., 8-1-1, Tsukaguchi-Honmachi, Amagasaki, Hyogo 661-8661, Japan {Hayashi.Kentaro, Hirai.Takahide, Sasakawa.Koichi}@wrc.melco.co.jp 2 Graduate School of Informatics, Kyoto University, 36-1 Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501, Japan [email protected]

Abstract. In this paper we describe a new method for detecting and tracking multiple persons with a stereo camera. The method is based on the idea of the plan-view map, i.e., the 2D histogram of projected 3D measurements from the camera. It estimates the statistical feature in an optimal window, e.g., a rectangular region, on the histogram, considering stereo measurement error, human breadth, and height. Then, it measures the actual statistical feature in the same window on the input histogram and compares estimated feature with measured one to detect and track persons. Experimental results show that our method can achieve higher performance than a normal plan-view map.

1

Introduction

Person tracking systems by vision are useful for many applications, including human-computer interaction, traffic counters, and surveillance systems. In this paper, we mainly consider surveillance systems. In many surveillance systems, a camera may be installed in a limited space. Especially in a building, space is limited by its ceiling. Therefore, the camera should be located at a low height. In this situation, human silhouette is strongly effected by projection, and half-occlusion frequently occurs. 3D information, such as depth from a stereo camera, enables us to easily handle these issues. However, a lot of past methods [1–5] assume that the camera position is low and faces a horizontal direction. These methods do not deal with long-term half-occlusion. Beymer has proposed a person counting method using a stereo camera[6]. Although this method assumes that the stereo camera is above the path of a person and looking downward, it can deal with great changes in the viewing direction by a simple way using a plan-view map. The plan-view map is a 2D rectangular histogram that counts the number of 3D measurements in the vertical square pole on each bin(histogram component). Stereo measurements are distributed vertically on a standing person, so it can be assumed that the value of the bin under the person is higher than the others. Several papers[7, 8] have P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3851, pp. 359–368, 2006. c Springer-Verlag Berlin Heidelberg 2006 

360

K. Hayashi et al.

improved this basic idea. For instance, Harville[8] introduced a novel idea “planview height map” that applies object’s height to the plan-view map. However, he did not clearly deal with the measurement error of a stereo camera. Generally speaking, stereo measurement error grows as depth increases. Therefore, detection and tracking performance degenerates in areas far from the camera. This motivates us to take measurement error in account. In this paper we propose a new method, that analyzes the measurement error of depth from a stereo camera and reduces the effect of measurement error. This method is general enough to collaborate with another methods, such as Harville’s plan-view height map. First we derive the simplified formula of stereo measurement error and analyze the error of voting to 2D histogram, i.e., the plan-view map. Next we calculate an adaptive window at a position on the histogram using both the nature of the error and the breadth and the height of a person. The larger the window, the more reliable detection and tracking of people. However, too large window decreases the precision of the position. Therefore, we calculate the optimal window size. Lastly, we estimate the statistical feature, e.g., the average of values in the window and measure the same feature of the input histogram in the same window. We compare the estimated feature with the measured one to detect a person candidate at each position on the histogram. If these features are similar, the person exists at that position. Experimental results show that our method has higher performance than a normal plan-view map.

2

Problem Definition

2.1

Camera Configuration

Figure 1 shows the configuration of a stereo camera, its right image plane, a world coordinate system, persons, and a 2D histogram. The camera modules are disparity d

stereo camera

u

right image plane

B

θ v

D

au

p

H

Ζ

2-D histogram bin ξX

ground plane

Υ

ξY

Χ

G p = p X pY '

(

,

world coordinate system based on the ground

)

Ο

(

X

ij

0,

Y 0) ij

Fig. 1. Configuration of stereo camera, image plane, and world coordinate system

Multiple-Person Tracking Using a Plan-View Map with Error Estimation

361

placed parallel at distance B, at height D from the ground, and those slanting angles are θ. These cameras have wide-angle lenses of horizontal angle α. We calculate a disparity d based on the image right image, whose coordinate system is represented by u − v. The width and height of the right image are au and av respectively. The stereo camera observes 3D point p on an object that exists in the world coordinate system X − Y − Z. For simplification, we set axis Y to be parallel to the axis u and fix the origin O to the ground plane. The 2D histogram is attached to the X − Y plane, and its bins are arranged in a lattice shape parallel to the X and Y axes. The origin of the histogram is (Xij0 , Yij0 ). The size of the bin is ξX × ξY . H can be calculated from observations, or we can fix H to an average human height. We assume another person is outside of the circle whose center and radius are p and G, respectively. When G is large enough, e.g., 1m, we can ignore heavy occlusion. Our method outputs 2D positions of standing and/or walking persons from the input images. 2.2

Stereo Measurement Error Model

First, we correct the input images undistorted, smooth the images by Gaussian filter, calculate a dense disparity map by a block matching algorithm[9], and subtract the background depth map from the input depth map by a conventional background subtraction method such as [10]. We start from this depth map. Rodriguez et.al.[11] analyzed stereo depth errors with respect to d. In this paper we first derive simpler formula so as to focus on the errors on the 2D histogram. Suppose a 3D point pc , which is on the camera coordinate system, is observed at (u, v) on the right image plane and at (u + d, v) on the left. pc is presented as pc = [(u − u0 )B/d (v − v0 )B/d ζ/d]t = (1/d)pcm au B , ζ= 2 tan(α/2)

(1) (2)

where xt means the transpose of x. We can translate pc into p on the world coordinate system using appropriate matrix M and vector C as p = Mpc + C.

(3)

Let d be the true value of disparity at pc . We assume that d has a noise that has a Gaussian distribution, whose expectation and variation are 0 and σd2 respectively, that is, d = d + δd . (4) The expectation µpc of pc at d = d is µpc = pc |d=d .

(5)

Applying 1-dimensional Taylor expansion to pc , we can obtain pc  p˙ c |d=d (d − d) + pc |d=d .

(6)

362

K. Hayashi et al.

p˙ c is the differential of pc . Subtracting expectation µpc and substituting d∗ = (d − d)/σd into the right term, it is simplified as σd p˙ c |d=d d∗ .

(7)

Expectation of d∗ is 0, and its variance is 1. Assuming that each component of p˙ c is independent, the standard deviation σ pc of pc is obtained from equations 1 and 7, σd (8) σ pc = 2 diag(pcm ), d where diag(p) is a matrix whose diagonals are the components of p. Expectation µp and standard deviation σ p on the world coordinate system are obtained by applying equation 3, µp = Mpc |d=d +C σd σ p = 2 Mdiag(pcm )Mt . d

3

(9) (10)

Detecting and Tracking Method

As described in section 1, a simple and effective way for detecting a standing/walking person is to use a 2D histogram. We improve the performance of this basic idea by considering measurement error. 3.1

Generating 2D Histogram

First we describe how to generate a 2D histogram. A 3D point p = [pX pY pZ ]t is projected on the bin at (i, j) = ((pX − Xij0 )/ξX , (pY − Yij0 )/ξY ).

(11)

x is a maximum integer less than x. The value of the bin at (i, j) is increased by the projection. After all 3D points are projected on appropriate bins, it is expected that the bin under a person has higher value than the others. We denote bin value at (i, j) by h(i, j). 3.2

Analyzing 2D Histogram

A 3D point p includes measurement error, which increases along with the distance from the camera. Therefore, in general, the histogram generated by equation 11 has a broad and low peak underneath a person. Additional measurement error, such as extra depth attached to a person, decreases signal-noise ratio. We propose a new method considering the measurement error that is processed as follows: 1. Calculate a window on the histogram from the variance of noises and human breadth. The window size varies along with distance from the camera.

Multiple-Person Tracking Using a Plan-View Map with Error Estimation

363

2. Estimate statistical feature(s) in the window that can be an average, a variance, and/or, and so on. Since features are independent of the actual measurements, we can estimate all of them on the histogram in advance. In this paper we adopt an average as the feature. 3. Calculate the same feature in the same window on the measured histogram. 4. Compare estimated features with measured features at the same position. If they are similar, the probability of the person will be high at the position. We get this probability map by comparing features at each position of the histogram. 5. Detect and track peaks on the probability map. The detailed algorithm is explained in the following subsections. 3.3

Calculating Optimal Window Size

We assume a rectangular window on the histogram. We calculate a minimum window that contains most of the measurement points on one person. For simplification, we consider the diagonals of σ p , [σX σY σZ ]t . Let W denote the average human breadth, and WD (< W ) denote the depth. Window size (ww , hw ) is expressed as ww = Fq σX Fq σY + WD ≤ hw ≤ Fq σY + W,

(12) (13)

using a constant Fq . Figure 2 shows these relations. person window camera p'

WD ww = Fqσ X

Fqσ Y

W

hw

Fqσ Y + W

Fig. 2. Relation between standard deviation σX , σY and window size

If most persons move along the lines with depth direction, hw can be approximated as hw = W . 3.4

Estimating Statistical Feature(s) in the Window

We describe how to calculate an average in the window. We use a simple human model, which is a plane vertical to the ground and its normal direction is facing to the camera. The model plane size is W × H. Imagine that the plane is projected to the image, the distance from the bottom to the top of the plane is n pixels

364

K. Hayashi et al.

in the image coordinates, and bin width ξY is m pixels. All n · m measurement points are projected into the same bin of the histogram if we ignore noises and measurement errors. Therefore, the true bin value k is k = n · m. Let ρr be the probability of acquiring measurements on the human model plane and νk be the variance of the bin value caused by random noises on the image. kρr +νk measurement points are distributed in window width ww because of the noise on the depth. Therefore, an average µk is µk = (kρr + νk )/ww .

(14)

νk can be calculated as follows. Let ρd be the occurrence probability of a uniform disparity in a pixel. Randomly generated disparity d is projected to the bin if d − δd /2 < d < d + δd /2 and its pixel is on the vertical line through the bin. Using the distance from the ground to the image boundary along the vertical line and maximal disparity dmax , νk = hmax mρd (δd /dmax ).

(15)

If ρd is small enough, νk is also small. We can estimate other statistical features such as standard deviation. Since we are focusing on the effectiveness of our general framework, the simplest feature, i.e., the average, is sufficient for evaluating our method. 3.5

Measuring Statistical Feature(s) in the Window

An average Ak (p) in the window on the input histogram is   h(i, j) /(ww wh ). Ak (p) = (i,j)∈W(p)

(16)

Here W(p) is the window on the histogram corresponding to p. To calculate all averages on the histogram, we use a fast algorithm developed by Viola et.al.[12], i.e., the integral image. Let M 2 be the average area of a window, and N 2 the area of the histogram. A computational order of Ak by a brute-force method is O(M 2 N 2 ), which we can be reduced this into O(N 2 ) using the integral image. 3.6

Detecting and Tracking Persons

From equations 14 and 16, we obtain µk and Ak at each point of the histogram. Although Ak will vary because of occlusion and other ignored noises, the high probability that a person is at the point may be expected when these are similar. Their similarity can be evaluated in a number of ways. For example, h = 1 − |Ak − µk |/µk . 

(17)

If Ak ≤ µk , this can be simplified to h = Ak /µk . We can detect the candidate of a person at a point when h ≥ Tµ . After detection, we can track the peak near the previous peak by such filtering methods as Kalman filter, condensation method[13], or dynamic-programming[7]. We take map h as a probability map and adopt a condensation method to track each peak on it.

Multiple-Person Tracking Using a Plan-View Map with Error Estimation

4

365

Experiments

4.1

Simulating µk and Ak

The objective of the simulation is: (a) to check the correctness of the statistical feature estimates by comparing estimated features and simulated ones, and (b) to find appropriate thresholds that will work in actual situation. We simulate µk and Ak as follows. (1) Locate a vertical plane whose size is W × H at position p. (2) Scan all optical rays that go through both the image and the plane. (3) Calculate the disparity of each ray and add Gaussian noise to the disparity. (4) Count all 3D points calculated from the disparity map to the histogram. (5) Move p in the range of −1500 ≤ pY ≤ 1500 and 2000 ≤ pX ≤ 5500, and calculate µk and Ak at each position of p.

7 1.0 6 5

0.9

A_k/mu_k

mu_k 4

0.8

3 2

0.7 1500 1000

1 2000

500 3000 4000

x (mm)

5000

0 −500 −1000 −1500

y (mm)

(a) Simulated µk at each point p

2000

3000

p_X (mm)

4000

5000

−5000 −1000 −1500

1500 500 1000

p_Y (mm)

(b) Simulated Ak /µk at each point p

Fig. 3. Simulated results

Figure 3(a) shows µk at each point p. µk decreases nonlinearly as the distance from the camera increases because window size increases and the number of measurement points decreases along with the distance. Ak behaves similar to µk . Most methods in the past approximated this behavior as linear. Therefore, it was difficult to maintain detection and tracking performance at all points. We analyze the behavior of Ak more precisely. Figure 3(b) shows the Ak , µk ratio, i.e., Ak /µk . From this simulation, when pY is near 0, Ak /µk is close to 1. However, as |pY | increases, Ak /µk also decreases. The reason for this phenomenon is that the window shape and the actual peak shape is different, especially when |pY | is large. Nevertheless, we can detect persons at most positions, for instance, by setting Tµ = 0.8. 4.2

Detecting and Tracking Actual Images

We consider Beymer’s method as a baseline because his idea is so basic that everyone can comprehend it. Therefore, we compare the two methods. We developed a person tracking system using our method and also his method on a

366

K. Hayashi et al.

Y

4.2m

stereo camera

open space

m 2

m .41

person

O

X

Fig. 4. Specifications of camera locations in our experimental setup

standard PC. Many performance evaluation methods have been proposed, in this paper we use Pingali’s method[14], which investigates success and false alarm duration rates of total tracking. Pingali defines one track from the appearance to the disappearance of the person in the field of view. A video sequence contains multiple tracks. Let A be the true number of tracks in a sequence, A the number of true tracks that correspond to detected tracks, R the number of detected tracks that correspond to true tracks, and R the number of detected tracks that do not correspond to the true tracks. T (X) is the total duration of X tracks. Durational miss detection rate md and durational false alarm rate fd are represented as md = 1 − T (A )/T (A) fd = T (R )/T (A).

(18) (19)

md = 0, fd = 0 denotes ideal tracking. We located the stereo camera 2m high at a 40 degree slanting angle. The field of view angle was 109deg, and the base line length was 150mm. While multiple persons walked freely in the open space shown in figure 4, we captured video from the stereo camera about 73sec; the multiple persons were walking or running in the open space. The video contained 44 true tracks that were detected and tracked by a human operator, Beymer’s method, and our method. We set parameters σd = 0.1, Fq = 6, which means ±3σ. We used the tracks picked by the operator as true tracks, and calculated md and fd by comparing them with Beymer’s tracks and our method’s tracks. Figure 5 shows the relationship between md and fd calculated from the above experiment. Triangles(Beymer’s method) were calculated using true tracks and his tracks of several different threshold values. Circles(proposed method) were calculated using true tracks and our proposed method’s tracks of them. The solid line in the figure means the envelope of Beymer’s, which is the boundary of its best performance. The dashed line denotes the same envelope of ours. One point dashed line stands for the equal error rate(EER). Obviously, our method shows higher performance than Beymer’s. Table 1 shows the EERs of Beymer’s and ours. We show an example of the tracking process by our system in figure 6. The figure shows the moment when the system is tracking four persons. The top

Multiple-Person Tracking Using a Plan-View Map with Error Estimation

367

0.1 0.08 d _f

0.06

proposed Beymer

0.04 0.02 0

0

0.05

0.1 m_d

0.15

0.2

Fig. 5. Relation between miss detection(md ) and false alarm detection(fd ) Table 1. Equal error rate(EER) of Beymer’s and our proposed methods

EER(%)

Beymer 8.5

proposed 5.5

Fig. 6. Example of tracking results. Top left shows original image, bottom left shows disparities overlaid on the original, bottom right shows the map of h , tracking trajectories, and the world coordinate system, and top right shows rectangular person regions projected from tracking results.

left corner shows the original image, the bottom left shows disparities overlaid on the original, the bottom right shows the map of h , tracking trajectories, and the world coordinate system, and the top right shows rectangular person regions which are projected from the tracking results. Although the farthest person is heavily occluded by another person, the system can continue to track him because the occlusion duration is short.

368

5

K. Hayashi et al.

Conclusions

We described a new method for detecting and tracking multiple persons with a stereo camera, that estimated statistical feature(s) in the optimal-sized window on the histogram, considering stereo measurement error and human breadth. Then the method measured the feature(s) in the same window on the input histogram and compared estimated feature with measured one. The experimental results showed that our method reduced both miss detection and false alarms more than the normal plan-view map.

References 1. Haritaoglu, I., Harwood, D., Davis, L.S.: W 4 S: A real-time system for detecting and tracking people in 2 12 D. In: European Conf. on Computer Vision. (1998) 877–892 2. Okada, R., Shirai, Y., Miura, J.: Object tracking based on optical flow and depth. In: IEEE/SICE/RSJ Intl. Conf. on Multisensor Fusion and Integration for Intelligent Systems. (1996) 3. Rehg, J.M., Loughlin, M., Waters, K.: Vision for a smart kiosk. In: Intl. Conf. on Computer Vision and Pattern Recognition. (1997) 690–696 4. Darrell, T., Gordon, G., Harville, M., Woodfill, J.: Integrated person tracking using stereo, color, and pattern detection. Intl. Journal of Computer Vision 37 (2000) 175–185 5. Beymer, D., Konolige, K.: Real-time tracking of multiple people using stereo. In: Intl. Conf. on Computer Vision. (1999) 6. Beymer, D.: Person counting using stereo. In: Intl. Workshop on Human Motion. (2000) 127–133 7. Darrell, T., Demirdjian, D., Checka, N., Felzenszwalb, P.: Plan-view trajectory estimation with dense stereo background models. In: Intl. Conf. on Computer Vision. (2001) 628–635 8. Harville, M.: Stereo person tracking with adaptive plan-view templates of height and occupancy statistics. Image Vision Computing 22 (2004) 127–142 9. Faugeras, O., Hots, B., Mathieu, H., Vieville, T., Zhang, Z., Fua, P., Theron, E., Moll, L., Berry, G., Vuillemin, J., Bertin, P., Proy, C.: Real time correlation-based stereo: Algorithm, implementations and applications. Technical Report N. 2013, INRIA (1993) 10. Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. on Pattern Analysis and Machine Intelligence 22 (2000) 747–757 11. Rodriguez, J.J., Aggarwal, J.: Stochastic analysis of stereo quantization error. IEEE Trans. on Pattern Analysis and Machine Intelligence 12 (1990) 467–470 12. Viola, P., Jones, M.J.: Robust real-time object detection. Technical Report CRL2001/01, COMPAQ Cambridge Research Laboratory (2001) 13. Isard, M., Blake, A.: Contour tracking by stochastic propagation of conditional density. In: European Conf. on Computer Vision. (1996) 343–355 14. Pingali, S., Segen, J.: Performance evaluation of people tracking systems. In: Works. on Application of Computer Vision. (1996)

Suggest Documents