Multitarget Visual Tracking Based Effective Surveillance ... - IEEE Xplore

0 downloads 0 Views 2MB Size Report
Jan 14, 2011 - Multitarget Visual Tracking Based Effective. Surveillance With Cooperation of. Multiple Active Cameras. Cheng-Ming Huang and Li-Chen Fu, ...
234

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

Multitarget Visual Tracking Based Effective Surveillance With Cooperation of Multiple Active Cameras Cheng-Ming Huang and Li-Chen Fu, Fellow, IEEE

Abstract—This paper presents a tracking-based surveillance system that is capable of tracking multiple moving objects, with almost real-time response, through the effective cooperation of multiple pan-tilt cameras. To construct this surveillance system, the distributed camera agent, which tracks multiple moving objects independently, is first developed. The particle filter is extended with target depth estimate to track multiple targets that may overlap with one another. A strategy to select the suboptimal camera action is then proposed for a camera mounted on a pan-tilt platform that has been assigned to track multiple targets within its limited field of view simultaneously. This strategy is based on the mutual information and the Monte Carlo method to maintain coverage of the tracked targets. Finally, for a surveillance system with a small number of active cameras to effectively monitor a wide space, this system is aimed to maximize the number of targets to be tracked. We further propose a hierarchical camera selection and task assignment strategy, known as the online position strategy, to integrate all of the distributed camera agents. The overall performance of the multicamera surveillance system has been verified with computer simulations and extensive experiments. Index Terms—Active vision, cooperative system, multicamera system, multitarget visual tracking.

I. I NTRODUCTION

V

ISUAL surveillance in a dynamic environment has drawn a great deal of attention in the last decade. To construct a wide-area surveillance system economically, people have come to utilize active cameras to track multiple targets. This idea is the key concept and motivation behind this paper, which is also driven by a wide range of applications that involve detection and tracking of multiple targets in image scenes. With advances in tracking theory and increases in computation power, it is now possible for more targets to be monitored simultaneously, whereas the use of multiple cameras to track multiple targets

Manuscript received June 18, 2009; revised December 3, 2009 and March 15, 2010; accepted April 20, 2010. Date of publication June 14, 2010; date of current version January 14, 2011. This work was supported by the National Science Council of the Republic of China under Grant NSC 98-2218E-002-004. This paper was presented in part at the IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, CA, USA, October 2007, and in part at the IEEE International Conference on Systems, Man, and Cybernetics, Singapore, October 2008. This paper was recommended by Associate Editor S. X. Yang. C.-M. Huang is with the Department of Electrical Engineering, National Taiwan University, Taipei 106, Taiwan. L.-C. Fu is with the Department of Electrical Engineering and the Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCB.2010.2050878

has opened up the prospect of seamless tracking [12]. Thus, by integrating the multitarget tracking technique with the multicamera cooperative methodology, a powerful wide-area surveillance system can be constructed with real-time performance. It can also be shown that once the surveillance system has been equipped with multitarget tracking capabilities, the visual perception and camera actions will become free of instability and allow better tracking [1]. For the task of tracking multiple targets with a camera or that of high-dimensional estimate, comprehensive search in the state space is computationally expensive, thus making the performance of the surveillance system hardly real time. Apparently, the particle filter or the sequential Monte Carlo (SMC) method [17], which is based on the Bayesian filtering framework, is a solution to this problem. By approximating the probability density function in state space with discrete samples, one can obtain the target estimate from the sample set. Concentrating on tracking multiple targets in the image space, the Markov network or the Markov random field [2], [18] has been used to model the interaction between joint targets, meaning those which are close to one another and must be estimated at the same time. They are, however, still unable to track multiple targets, particularly when the targets seriously overlap with one another. Some related work [19], [20] also discussed the problem of occlusion handling during the course of target tracking. MacCormick and Blake [20] incorporated a mixed-state conditional density propagation CONDENSATION tracker with the exclusive likelihood for tracking two interacting targets. The target’s state is augmented with a discrete labeling variable to describe the geometric relations among overlapping targets. Here, we first employ the concepts in [20] and define the auxiliary depth-order state for multiple interacting (overlapping) targets. Next, the more general mixed-state transition and joint image likelihood [5] are proposed to track those multiple interacting targets. Such basic multitarget tracking function is then applied to each camera platform of our system. Since the field of view (FOV) for a single camera is limited, a camera platform is usually equipped with some degrees of freedom to extend its observation range. In general, a camera’s pan and tilt motors are controlled to track a single target after the target has been detected and localized [1]. For visual servo applications [8], the camera is mounted on a moving platform, such as a robot arm or a vehicle, which is controlled to follow the target via visual feedback of the target’s location extracted from images captured by the camera. To design the strategy for motion of a single camera to successfully track multiple

1083-4419/$26.00 © 2010 IEEE

HUANG AND FU: MULTITARGET VISUAL TRACKING-BASED EFFECTIVE SURVEILLANCE WITH ACTIVE CAMERAS

Fig. 1.

235

Flowchart of the proposed multicamera system. The colored boxes in each unit describe the considered models.

targets, mutual information (MI) is gathered for evaluation. In information theory, MI is a quantity commonly used to measure the mutual dependence of two variables. It has been widely applied in assessing the relationship between sensor performance and state estimation while under the actuator control, such as the task with camera viewpoint selection in recognition [13] and that of management of the sensors for tracking targets in cooperative sensor networks [14]. Here, the designed optimal motion strategy of the camera, which seeks to maximize the MI, will determine the next pose of the camera based on the predicted views of the targets in transitions. Another method for increasing the surveillance area is to use multiple cameras. Some multicamera surveillance systems utilize stationary cameras, without any degree of freedom, to continuously track targets in varying places of interest [10], [28] or to observe the same target from different viewpoints [9]. A small surveillance space can be completely monitored by initially allocating some static cameras with multitarget tracking capabilities at well-designed locations [11]. However, as the surveillance space increases, more cameras are required, and the implementation cost of this method becomes too high. Some research [1], [21], [22] has proposed cooperative distributed vision systems to observe targets with multiple active cameras so that these cameras can be controlled to change viewpoints. However, the above works typically utilize each camera to track only a single target, even while he/she is among the target crowd. In our proposed system, each camera platform has multitarget tracking ability, and the observed targets in each camera platform will be different from those in other cameras as much as possible, so that the number of targets to be tracked by the entire surveillance system can be maximized. As a result, a salient feature of our work is that a larger surveillance area can be monitored only with a few cameras. Most of the current cooperation methods for multiple active cameras [1], [21]–[23] require allocation of several agents and additional central managers (CMs). Each agent may have a camera platform, a computational processing unit, and a network communication module, enabling it to be able to perform its designated tasks independently. The additional CMs are then responsible for coordinating these agents to optimize the performance of the entire system. Unlike the results of previous

research [1], [21], where each agent is triggered simply by the detection of a new target, and then a group of agents are allocated to track this target, the system proposed in this paper gathers a group of agents dynamically to track multiple targets. In other words, the cameras that are assigned to monitor the targets may change during surveillance as the cameras’ and targets’ positions and characteristics change over time. To meet our goal of real-time multitarget surveillance, the negotiation process [22] is omitted as its application would slow the system. Instead, the CM will hierarchically assign an individual camera agent to observe the designated targets based on the timevarying visual feedback. Fig. 1 presents the architecture of the multicamera surveillance system proposed in this paper. Each distributed camera agent changes its camera viewpoint based on the command it received during the previous time instant. It then captures the image, including the target’s various attributes. The multitarget visual tracking unit acts as an observer, estimating the state of each target. The input/output hidden Markov model (HMM) models the dependencies between each target, a given camera action, and the captured image. The corresponding SMC inference is extended with depth-order estimation to facilitate tracking of overlapping targets. The state estimate can be represented by the maximum a posteriori of the probabilistic inference. The targets’ attributes, including the state estimates and visual appearances, are collected from each camera to be used for the creation of subsequent commands. The CM’s main role is to coordinate the camera agents in a hierarchical manner based on their respective embedded visual information, so a camera agent with higher camera utility is selected after one iteration, and this camera is commanded to design action according to its assigned task. Using this software architecture, some cameras are assigned as master cameras and must keep track of multiple groups of assigned targets, which exist in their FOV. The cameras that are designated as slave cameras assist the master cameras in tracking multiple specific targets. The free cameras, which are not yet assigned to any special observation tasks, will search for the unobserved area along a predefined path for more targets. In addition, the action of every master camera is generated based on the proposed moving strategy for a single camera to maintain the maximum coverage

236

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

of multiple tracked targets. The action hypothesis of the master camera can also predict a corresponding FOV after the next command. Finally, the camera actions for the next time instant are sent, moving the cameras viewpoints accordingly. The remaining sections of this paper are set out as follows: Sections II and III deal with the construction of a distributed camera agent capable of tracking multiple targets as related to our proposed system. Section II shows how the input/output HMM is used to model a distributed surveillance subsystem with depth-order estimate to track multiple overlapping targets, whereas Section III presents the basic moving strategy of a single camera to track multiple targets by evaluating its MI. Section IV introduces and describes the cooperation algorithm of multiple active cameras for efficiently tracking multiple targets and defines the various roles and moving strategies of the different cameras. Simulations and comparisons are given here to express the effectiveness of multicamera cooperation. The performance of the developed system is demonstrated in Section V with real-time experiments. Finally, conclusions are discussed in Section VI. II. M ULTITARGET T RACKING W ITH D EPTH E STIMATE ON AN ACTIVE C AMERA In a surveillance system with a single pan-tilt camera, the camera is typically installed at a fixed location and can expand its FOV by controlling its pan/tilt motor. The target is observed with respect to the moving camera platform coordinates and tracked in the corresponding image plane. Let the set Xt = {x1,t , . . . , xMt ,t } represent the state vectors of current Mt targets in the spherical coordinates of the camera platform [23]. The number of targets, namely, Mt , which is assumed to be known through the target detection process, may change over time due to targets entering or leaving the camera’s FOV. Given the image observation zt and the camera action ut at current time t, the problem of tracking multiple targets with a moving camera can be expressed as the posterior probability p(Xt |u0:t , z0:t ), where u0:t and z0:t are the camera actions and image observations, respectively, from t = 0 to the current time. The goal of visual tracking is to estimate the states of the targets in Xt through evaluation of the conditional distribution. In the visual tracking problem, it may be sufficient to describe the observation and state transition using HMM without knowing the control input. Since the camera’s pan/tilt motor is controlled to sense the targets in active vision, the targets’ states Xt are estimated from the camera’s viewpoint so that the commanded camera control input ut also influences the targets’ states in the camera coordinate. The input/output HMM [16] is applied to represent the visual tracking problem relative to the active camera platform. The graphical model of input/output HMM in Fig. 2 shows that the image is dependent on the targets, but the targets’ state in the camera coordinate also relies on the camera’s viewpoint. Utilizing the recursive Bayesian filtering algorithm [17] and assuming that the targets’ states transition relative to the camera is Markov, we can then express the posterior probability of visual tracking as  p(Xt |u0:t , z0:t ) ∼ αp(z |X ) p(Xt |ut , Xt−1 ) = t t · p(Xt−1 |u0:t−1 , z0:t−1 )dXt−1

(1)

Fig. 2. Graphical model of the input/output HMM for the multitarget visual tracking on an active camera.

Fig. 3. Depth-order definition for occlusion handling. The depth-order set of this joint state Xg,t is Dg,t = {1, 3, 2}.

where α is a normalization constant, p(zt |Xt ) denotes the joint likelihood that describes the observation in the current image given the set of targets, and p(Xt |ut , Xt−1 ) predicts the current states of all targets that are involved from the camera action ut and previously obtained states with their respective state transition models. In fact, both the state prediction and the likelihood update from the previous posterior p(Xt−1 |u0:t−1 , z0:t−1 ) in (1) are highly correlated among the targets. Since the posteriors of some targets are independent, we must first separate them into groups to reduce the computational complexity. After investigating the dependence of the previous state estimates Xt−1 , we assemble the states of the interacting targets from the entire target set and redefine the so-called joint states Xg,t , g = 1, . . . , G, each being the state of a group of interacting targets among the total G interacting groups. Hence, the posterior in (1) is equivalent to p(Xt |u0:t , z0:t ) ∼ =α

G 

p(Xg,t |u0:t , z0:t ).

(2)

g=1

Let Mg,t be the number of dependent targets in each interacting  group, and Mt = G g=1 Mg,t . Note that Mg,t = 1 represents an independent target, and the posterior of an independent target in (2) can be efficiently evaluated by a sequential importance sampling (SIS) particle filter [4]. Although the importance function of the SIS particle filter is used to provide visual information of a target by extracting samples, the importance functions of other overlapping targets may not be distinguishable. Thus, when tracking multiple interacting targets, one challenge is that target images may overlap. For the dependent terms Mg,t > 1, in (2), the sampling importance resampling (SIR) particle filter [3], [17], [20], in which the drawing samples are resampled from the joint posterior of overlapped targets during the previous time instant, is extended with depth-order estimate. The following description focuses on the multitarget visual tracking function for a certain interacting group g, where the depth order is the sequence of targets involved, reflecting the ascending order of relative distances between the targets and the camera, as shown in Fig. 3. If the depth-order set of the

HUANG AND FU: MULTITARGET VISUAL TRACKING-BASED EFFECTIVE SURVEILLANCE WITH ACTIVE CAMERAS

237

joint targets in one interacting group is defined as Dg,t , then it is obvious that there might be Mg,t ! depth-order hypotheses for inferring the overlapping state of the joint state Xg,t . Fig. 3 shows an example with six depth-order hypothesis events: {1, 2, 3}, {1, 3, 2}, {2, 1, 3}, {2, 3, 1}, {3, 1, 2}, and {3, 2, 1}. With this augmentation of the depth-order hypotheses, the posterior of the dependent targets p(Xg,t |u0:t , z0:t ) in (2) can be redefined as p(Dg,t , Xg,t |u0:t , z0:t ), and (1) can be rewritten as p(Dg,t , Xg,t |u0:t , z0:t ) ∼ = βp(zt |Xg,t , Dg,t )  · p(Dg,t , Xg,t |Dg,t−1 , Xg,t−1 , ut )

Fig. 4. Depth-order transition is modeled as a normal distribution with adaptive covariance. The previous depth order is assumed to be Dg,t−1 = {1, 3, 2}.

Dg,t−1

· p(Dg,t−1 , Xg,t−1 |u0:t−1 , z0:t−1 )dXg,t−1

(3)

where β is a normalization constant, p(zt |Xg,t , Dg,t ) is the joint image likelihood [5] with the depth-order variable, and p(Dg,t , Xg,t |Dg,t−1 , Xg,t−1 , ut ) denotes the mixed-state transition model, which describes the state transition of the overlapping targets and the variation of their depth order in the 2-D image after the camera action ut is applied. Using the chain rule, we can divide the mixed-state transition probability for the filtering update into the following two terms: p(Dg,t , Xg,t |Dg,t−1 , Xg,t−1 , ut ) = p(Xg,t |Dg,t , Dg,t−1 , Xg,t−1 , ut ) · p(Dg,t |Dg,t−1 , Xg,t−1 , ut ).

(4)

The first factor p(Xg,t |Dg,t , Dg,t−1 , Xg,t−1 , ut ) in (4) is the joint state transition model of dependent targets relative to the known camera action ut , the given the current depth-order hypothesis Dg,t , and the joint states with depth-order estimate at previous time t − 1. When the targets start overlapping, their joint states are formed by sampling the posterior of each individual state. The second factor p(Dg,t |Dg,t−1 , Xg,t−1 , ut ) in (4) is the depth-order transition probability, which models how the depth-order changes over time based on the target estimate at previous time t − 1. The depth order is initialized by the uniform distribution with Mg,t ! depth-order hypothesis events when the targets initially interact. Assuming that the targets are solid without any holes in the projected image frame, and observing the typical motion trajectories of dependent targets in one interacting group, we find that the depth order of each target does not experience significant change. Usually, when the depth order of a target changes, the change is +1 or −1. Furthermore, as shown in Fig. 4, the difficulty of changing the depth order is proportional to the probability that the targets are overlapping. Hence, the depth-order transition probability p(Dg,t |Dg,t−1 , Xg,t−1 , ut ) is modeled as the normal distribution with adaptive covariance p(Dg,t |Dg,t−1 , Xg,t−1 , ut )    2 ∝ N Dg,t (v)Dg,t−1 (v), σD (Xg,t−1 )

(5)

v

where Dg,t (v), as the vth element of the depth order Dg,t , is the depth order of the target v, N (Dg,t (v); Dg,t−1 (v), 2 σD (Xg,t−1 )) is the normal distribution of the variable Dg,t (v)

with the previous depth order Dg,t−1 (v) as its mean, and the 2 (Xg,t−1 ) is defined as adaptive covariance σD  Mg,t−1   2 1 σD (Xg,t−1 ) = σg

xsg,t−1 − xvg,t−1 Mg,t−1 − 1 s=1,s=v

(6) in which σg is a predefined constant of covariance, and xvg,t−1 is the estimated state of one target v at previous time t − 1 in the interacting group g. If the targets in one interacting group severely overlap, then the covariance becomes small; otherwise, the covariance is large. Finally, the Monte Carlo approximation of the solution of (3) can be expressed through application of the SIR particle filter Ng i i with a set of weighted samples {Dg,t , Xig,t , wg,t }i=1 with

i  Ng  D D g,t i p(Dg,t , Xg,t |u0:t , z0:t ) ≈ γ wg,t δ − g,t (7) X Xig,t g,t i=1 where δ(·) is the Dirac delta function, γ is a normalization i i constant, and the weights wg,t = p(zt |Xig,t , Dg,t ) measure the joint image likelihood [5]. Note that the multidimensional i , Xig,t ]T is originated as follows: sample [Dg,t g T    i  i i i Dg,t , Xg,t ∼ wg,t−1 p Dg,t , Xg,t |Dg,t−1 , Xig,t−1 , ut

N

i=1

(8) Ng i i {Dg,t−1 , Xig,t−1 , wg,t−1 }i=1

where are the weighted samples yielded from the approximation of the posterior probability at previous time. III. S INGLE C AMERA ACTION D ESIGN After obtaining the target information, the camera action can now be generated. In fact, the proposed action design approach here will be one of the basic positioning strategies applied to our multicamera system. Due to the limited FOV of a camera, the pan/tilt motor of the camera platform must be activated to continuously track moving targets. This kind of task can clearly become quite difficult as a single camera is often insufficient, particularly if the targets start to diverge from the current camera FOV. More problems arise when the camera action is not designed properly, causing further tracking errors.

238

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

This section aims for a camera action design that simultaneously captures the majority of current targets during future observations. To do this, we first generate the hypotheses of camera action ut+1 . The state estimate of the next time instant is predicted as Xˆt+1 under various influences of camera action ut+1 in the input/output HMM and targets the state transition model. According to the sensor model, after the camera action ut+1 has been applied to the motors of the camera platform, the camera can then be forecast to capture a predicted image ˆt+1 with limited FOV at the new viewpoint. observation z ˆt+1 |ut+1 ) is utilized to evaluate the common MI I(Xˆt+1 ; z relationship between the predicted state of targets and the ˆt+1 with respect to the hypothesis predicted image observation z of camera action at time t + 1. At last, the best solution of the camera action will be the one with a viewpoint producing the ˆt+1 , which contains the most information of predicted image z the predicted target estimates in its FOV. This MI is formalized as [13], [14] 

 ˆt+1 |ut+1 ) p(Xˆt+1 , z

ˆt+1 |ut+1 ) = I(Xˆt+1 z ˆt+1 Xˆt+1 z

· log

ˆt+1 |ut+1 ) p(Xˆt+1 , z dˆ zt+1 dXˆt+1 . (9) p(Xˆt+1 |ut+1 )p(ˆ zt+1 |ut+1 )

Since we want to maintain the relationship between targets and images captured by the camera, the optimal camera action u∗t+1 is selected to maximize MI as ˆt+1 |ut+1 ) u∗t+1 = arg max I(Xˆt+1 z ut+1 ∈U

(10)

where U is the set of admissible actions of the camera, corresponding to feasible steps of the pan and tilt motors. From the definition of the expected value in probability theory and the usage of the chain rule, (9) can be rewritten as

function can be defined to reflect the richness of the sensed targets information as     Mzˆt+1 ˆ λv − c1 1− p(ˆ zt+1 |Xt+1 , ut+1 ) ∝ exp − c0 Mt v∈Xˆˆz,t+1  P¯ θ P¯ φ −c2 · zˆ · zˆ (13) Hθ Hφ where c0 , c1 , and c2 ∈ [0, 1] are the relative weightings, λv ∈ [0, 1] is the user-defined priority factor for each tracked target, Hθ and Hφ are the half size of the image plane in the horizontal and vertical direction of the camera coordinate, respectively, and          P¯zˆθ = max Pzˆθ Xˆzˆi,t+1 +min Pzˆθ Xˆzˆj,t+1  2 i

j

         φ φ ˆi φ ˆj ¯  Pzˆ = max Pzˆ Xzˆ,t+1 +min Pzˆ Xzˆ,t+1  2. (14) i j The second term of the exponential function in (13) is used to measure the number of targets in the camera’s FOV at some candidate action ut+1 , whereas the third term tries to quantify the proximity of those targets to the center of the image at this action. Instead of finding an optimal action using exhaustive search in a large discretized moving space U, the Monte Carlo method and the weighted samples from Bayesian filtering in Section II are exploited here to efficiently evaluate (11). First, we resamc ple Nc weighted samples {Xti , Wti }N i=1 from the posterior distribution p(Xt |u0:t , z0:t ) approximated in Section II. By the law of total probability and the assumption of input/output HMM, as shown in Fig. 2, the probability distribution p(Xˆt+1 |ut+1 ) can be approximated by the samples drawn from i Xˆt+1 ∼

Nc 

  Wti p Xˆt+1 |ut+1 , Xti

(15)

i=1

ˆt+1 |ut+1 ) = Ep(ˆzt+1 |Xˆt+1 ,ut+1 )p(Xˆt+1 |ut+1 ) I(Xˆt+1 z   p(ˆ zt+1 |Xˆt+1 , ut+1 ) · log p(ˆ zt+1 |ut+1 )

(11)

where p(ˆ zt+1 |Xˆt+1 , ut+1 ) is defined as the “camera likelihood.” In addition, p(ˆ zt+1 |ut+1 ) can be expanded by using the following law of total probability:  p(ˆ zt+1 |ut+1 ) =

p(ˆ zt+1 |Xˆt+1 , ut+1 )p(Xˆt+1 |ut+1 )dXˆt+1 . (12)

To achieve our aim, the camera must capture as many targets as possible in its limited FOV, and thus, positioning the camera so that the captured targets are located near the center of the camera image is the optimal way to continuously observe these targets. Let Xˆzˆ,t+1 be the set of states of the targets within the ˆt+1 , Mzˆt+1 be the number of such targets, and image plane z (Pzˆθ (Xˆzˆi,t+1 ), Pzˆφ (Xˆzˆi,t+1 )) be the position of the ith target in ˆt+1 . Then, the camera’s likelihood the predicted image plane z

ˆ i = W i , where p(Xˆt+1 | and the corresponding weight W t t+1 i ut+1 , Xt ) is the state transition model used to predict the future targets’ states relative to camera action ut+1 . i ˆit+1 ∼ p(ˆ Then, we calculate the samples z zt+1 |Xˆt+1 , ut+1 ) from the camera likelihood function. Through the two sami ˆit+1 , pling procedures, the set of weighted samples {Xˆt+1 ,z N i c ˆ ˆ ˆt+1 |ut+1 ) W t+1 }i=1 is appropriately distributed over p(Xt+1 , z in (11). Finally, the expected value of MI in (11) can be approximated by   i Nc ˆit+1 |Xˆt+1 p z , ut+1  i ˆ t+1   i ˆt+1 |ut+1 ) ≈ W I(Xˆt+1 z · log ˆt+1 |ut+1 p z i=1 (16) i where the term p(ˆ zit+1 |Xˆt+1 , ut+1 ) can be measured by (13), i and the term p(ˆ zt+1 |ut+1 ) is evaluated by the law of total probability and approximated as Nc    i   i ˆj p z ˆ j , ut+1 . ˆ ˆt+1 |ut+1 ≈ W p z | X t+1 t+1 t+1 j=1

(17)

HUANG AND FU: MULTITARGET VISUAL TRACKING-BASED EFFECTIVE SURVEILLANCE WITH ACTIVE CAMERAS

The set of admissible camera actions U is decided by the precision of the pan/tilt motor and their range of motion during the processing period between two consecutive image frames. The hypotheses of ut+1 can be uniformly selected in U, ranging from coarse search to fine search. Such suboptimal action u∗t+1 will be assumed as a known input for multitarget tracking at time t + 1 after the command action has been set. IV. M ULTICAMERA C OOPERATION We now move to multicamera design, which is required to accomplish effective and continuous wide-area visual monitoring for larger spaces or a higher number of targets that may easily move out of the FOV of one camera. If the quantity of active cameras is insufficient to track all targets observed at the current time instant, then the multicamera surveillance system will determine which targets should be tracked by the cameras at the next time instant. Although some targets will have to be dismissed after the actions of cameras have been set, most targets, particularly the ones of highest priority, will be continuously tracked with the tracking system proposed in the previous sections. This will require close cooperation between the cameras; otherwise, the overall surveillance capability of the entire set of cameras will not be effectively utilized. A. Cooperation Strategy The system assumes that the cameras have been calibrated and that the map of the surveillance space has been allocated. In this scenario, we installed K active cameras in the surveillance space, and their placements are known. Each camera is equipped with a computational processing unit, multitarget tracker, camera action module, and network communication capability. To maximize the number of targets to be tracked with fewer active cameras, every camera should try its best to observe targets that are not in another camera’s FOV. Thus, instead of tracking one target in 3-D world space with several cameras [1], [21], each camera will be asked to individually track multiple targets in its camera coordinates. Since the 3-D target position cannot be accurately estimated, as proposed in other works [1], [21], the 3-D view line from the 2-D target position in the image plane through the camera’s projection center will not be used for evaluating the target’s 3-D position. Instead, the 3-D view line of one target within one camera’s FOV is only utilized to assist in confirming the target correspondence in different cameras. In this situation of multicamera surveillance, the optimal actions of all cameras must be designed to simultaneously maintain the observations of all tracked targets at every command instant. To achieve this, we extend the single camera action design in (10) to the multicamera scenario. In other words, the set of optimal actions of multiple cameras, denoted ∗ as Ut+1 , should be selected to maximize MI as ∗ Ut+1 = arg max I(Xˆt+1 Zˆt+1 |Ut+1 ) Ut+1 ∈U

(18)

where U is the set of combined feasible moving steps of all K cameras’ pan and tilt motors, and I(Xˆt+1 ; Zˆt+1 |Ut+1 ) is

239

the MI that evaluates the dependence between the predicted states of all targets Xˆt+1 and the predicted image observations ˆK,t+1 } in multiple cameras after the hyZˆt+1 = {ˆ z1,t+1 , . . . , z potheses of multicamera actions Ut+1 = {u1,t+1 , . . . , uK,t+1 } have been executed. However, the multicamera action selection from the set U is a highly complex process. If each camera has F discretized feasible pan-tilt steps, then all K active cameras will have F K choices for motion. To minimize the complexity of this problem, we propose a hierarchical decision- making strategy for designing multicamera actions. One camera is selected to design its action from F feasible steps in one iteration, and repeating this process, K iteration can obtain the actions of all cameras by only examining F K solutions. At every command instant, each camera transmits its own orientation status, as given by the encoder readings of its pan-tilt motors and the undergoing multitarget tracking result to the CM. The time duration of each target continuously tracked by each camera is also recorded and sent to the CM. With the information, the CM constructs the correspondence of tracked targets between different cameras based on the targets’ geometric locations and visual appearances [10], [28]. As a result, we obtain H different targets in the multicamera system. During the process of camera action design, the CM will hierarchically select the cameras and then assign the targets that each camera should be responsible for. The order of camera selection depends on the visual utility of each camera, which is used to describe the visual information of one camera’s image observation. The visual utility Ck of camera k is defined as [21], [22]

Ck =

H 

Vkh Tkh [d0 Pkh + d1 (1 − Ikh ) + d2 Lkh ]

(19)

h=1

where d0 and d1 , d2 ∈ [0, 1] are the weighting constants, V is the visibility matrix, T is the task matrix, Pkh is the user-defined task priority value that specifies camera k to track target h, Ikh is the normalized distance from the target position to the image center of camera k, and L is the duration matrix. The visibility matrix V is defined such that every entry is a binary element Vkh , where Vkh = 1 denotes that the target h is now observed and tracked by camera k. The task matrix T with binary element Tkh is used to indicate whether camera k will be assigned to observe target h during the process of action selection for camera k, where 1 indicates assignment. The duration matrix L is defined such that each entry Lkh is proportional to the duration over which the camera k continuously senses the target h and is normalized to satisfy K k=1 Lkh = 1, h = 1, . . . , H. Here, the duration time Lkh in (19) is taken into account to avoid the situation where tracking of one target will be frequently switched between different cameras when that target is simultaneously observed by those cameras. The target tracked by one camera over a long period of time will be preferably maintained, resulting in higher tracking reliability. Clearly, Ck = 0 means that the camera k does not detect any targets in its own FOV. By sorting the visual utilities of these cameras, the CM will then hierarchically assign the cameras’ tasks according to the visual utility.

240

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

Fig. 5. Two-dimensional illustration of master camera selecting procedure in the camera cooperation strategy. (a) First, Camera 1, which captures more target visual information than Camera 2, is selected and assigned as a master camera. (b) After action design of the selected master camera, Camera 1 cannot cover all targets. The red target will be the focus of Camera 1, and the blue target will be taken over by the unselected camera. Here, Camera 2 will also be assigned as a master camera at a later stage.

Given the matrices V , P , I, and L, the camera selection and the task assignment of all cameras can be accomplished through the following steps at every command instant: 0) All cameras act as free cameras to search for targets until at least one target is detected. 1) Initialize the task matrix T = V and the set of unselected cameras’ indices K = {1, . . . , K}. 2) Calculate the visual utility Ck of each camera k ∈ K according to (19). 3) Select the camera κ with nonzero and maximum Ck that is not yet selected, i.e., κ = arg max{Ck |Ck = 0} k∈K

(20)

and set this camera as the master camera. In addition, let one target be considered by only one camera, i.e., Tkh = 0, where k ∈ {K−κ}, and h ∈ {1, . . . , H|Tκh = 1}. 4) Set the master camera κ to design its own camera action for monitoring the targets (Tκh = 1). The visibility matrix V is also transmitted to camera κ in binary bit sequence form. a) If this master camera cannot complete its task, the report the uncovered targets to the CM. The uncovered target h will be taken over by an unselected camera k  with Vk h = 1, where k  ∈ {K − κ}. Set Tκh = 0 and Tk h = 1 for all k  ∈ {K − κ|Vk h = 1}. Fig. 5 illustrates this situation. b) If none of the unselected cameras can handle this uncovered target, then roughly estimate the 3-D position of the target and let the CM record it in an uncovered target set S. 5) Remove κ from K. Go to Step 2 until each column of T has only one nonzero element. Regard the unselected cameras in set K as free cameras. 6) If S = ∅ and the number of free cameras is not equal to zero, then select free cameras as slave cameras to cover the position of the units in S. 7) Randomly select one camera from the remaining free cameras and set it to cruise in unmonitored regions of the surveillance space. Repeat this step until the actions of all cameras have been designed. The camera’s role and the steps of camera cooperation mentioned above are reevaluated and updated dynamically based on the newest camera observations over all objects at every command instant.

B. Positioning Strategy The details about the action design of the master, slave and free cameras are stated below. 1) Master Camera: When the master camera receives the matrices V and T from the CM, the actions are then designed through the maximization of MI, as proposed in Section III. During the action design process, the master camera κ only considers the targets h ∈ {1, . . . , H|Tκh = 1} currently under its surveillance, neglecting all other targets. The problem that one camera may not cover all targets within its limited FOV would still exist, but we use multiple cooperating cameras. The camera likelihood function (13) is modified as   λv p(ˆ zt+1 |Xˆt+1 , ut+1 ) ∝ exp − c0  − c1

v∈Xˆˆz,t+1

Mzˆt+1 + Mz¯t+1 1− Mt



P¯ θ P¯ φ − c2 · zˆ · zˆ Hθ Hφ

 (21)

where Mz¯t+1 is the number of targets that cannot be covered ˆt+1 from this in the FOV of the predicted image observation z master camera but are currently visible by other unselected cameras. As shown in Fig. 5, after the action of this master camera κ is decided, the Mz¯t+1 uncovered targets will be taken over by the unselected cameras. The CM updates the task matrix T and receives the decided action of the master camera κ. However, if the uncovered targets are not visible from the unselected cameras’ FOV, then the 3-D position of those uncovered targets are estimated and sent back to the CM. Given the mounting location and pan-tilt orientation of the camera that originally observed the target, the 3-D position of the target in the surveillance space is roughly estimated [26]. Those uncovered targets will be handed over to the slave cameras later. 2) Slave Camera: Once a master camera cannot handle an uncovered target that is not taken over by any unselected camera, the CM must designate a free camera to become the slave camera to track this target. We define Ms uncovered targets in the set S and Ks free cameras that can be employed now. The CM evaluates the distance between the estimated position of target and the current optical axis of each free camera, as mentioned above, and then constructs the distance matrix S, whose entry Sij denotes the distance cost when camera i is assigned to an uncovered target j, where i = 1, . . . , Ks , and j = 1, . . . , Ms . Based on this distance matrix, the free camera

HUANG AND FU: MULTITARGET VISUAL TRACKING-BASED EFFECTIVE SURVEILLANCE WITH ACTIVE CAMERAS

with the shortest cost is assigned as the slave camera to observe the uncovered target. The Hungarian algorithm [6] is applied to find an optimal assignment solution to the minimum sum of the distance costs. If Ms < Ks , then only Ms cameras will be assigned as slave cameras and the remaining cameras will still act as free cameras to continue the search process. However, if Ms > Ks , then only Ks uncovered targets with minimum sum of the distance costs will be handed over to Ks slave cameras. The slave camera will be commanded to directly move to track the target. When the moving range is larger than the feasible moving steps of the camera motors, this task will be continuously maintained during the following time instants. 3) Free Camera: Once the actions of the master and slave cameras to track all of the currently observed targets while maximizing the MI [see (18)] have been decided, the remaining cameras become “free.” Nevertheless, we still need to effectively organize the free cameras to monitor newly entering targets or targets whose tracking have been cut off. Taking the traditional surveillance strategy of a pan-tilt camera that periodically cruises along a predetermined path, we initially predefine check points along the cruising path of each camera. Once a camera is set to act as a free camera, it will control its view to move along the sequence of check points on its cruising path. Furthermore, the FOV of free cameras is spread over the unobserved region in the surveillance space. We define the valuable sensing region of each camera as a viewing frustum model [11], which is a rectangular pyramid extending from the camera’s optical center. The CM also sends the actions of other cameras to the free cameras. In addition to cruising, the free cameras are designed for optimal action, thus keeping its viewing frustum from overlapping with the viewing frustum of other cameras. Instead of applying the exhaustive search for the optimal action of a free camera, we utilize the principle of importance sampling [17] around the check points to speed up the decision process. C. Simulation Results The epipolar geometry toolbox [15] is employed to demonstrate the proposed multicamera cooperation algorithm. In these simulation scenarios, the surveillance is over a space of 10 m × 12 m × 3 m (height) with three pan-tilt cameras set up, as shown in Fig. 6. We assume that all of the target information in each image frame has been extracted and sent to the CM. The time instants of each camera’s observation and action within the surveillance system are synchronized by the CM. The time interval between two contiguous command instants of the overall system is set to 0.1 s, the FOV of each camera is fixed to 8◦ horizontally and 6◦ vertically, and the available sensing distance is within 13 m of the camera lens. Each camera can also move its pan/tilt motor with a maximum speed of 30 deg/s, and the total allowable moving ranges of the pan/tilt motor of each camera are both set to 90◦ . Ten sets of simulation data with different targets’ motion patterns have been made, where each motion pattern lasts for 30 s. Using these patterns and the setting as mentioned before, we compare five multicamera moving strategies: 1) MAC: Multiple active camera cooperation strategy with multitarget visual tracker, as proposed in Section IV.

241

Fig. 6. Simulation scene. The solid dots represent the targets. The cruising path of each camera’s viewing direction is drawn on the horizontal ground plane.

2) Free: All cameras act as free cameras, which collaboratively explore the entire surveillance space with the searching target strategy, as mentioned in the previous section throughout the whole simulation. The cruising path of each camera is predefined, as shown in Fig. 6. 3) Random: Each camera independently searches the surveillance space with random action. 4) Static: The cameras are fixed with maximum coverage and are nonoverlapping with each other [11]. 5) Focus: The multicamera cooperation algorithm [1] is such that it dispatches at least one camera to concentrate on one target. In addition, the camera, which is not responsible for any target, will randomly search for a new target. The initial position of cameras in each strategy is the same as those in the static case. To display the performance of each strategy, 100 Monte Carlo simulations were run for each multicamera motion strategy versus each target-motion pattern over 300 command instances. We define the continuous tracking time and the total tracking time for a sensed target as the surveillance performance of each multicamera system, where the continuous tracking time denotes the time duration, in number of command instants, over which an identical target is persistently tracked by the whole multicamera system, and the total tracking time is evaluated by accumulating the number of command instants over which a specified target is observed by the whole multicamera system throughout the simulation. The total tracking time can also be referred to as the sensing or detecting capability of a surveillance system. Since there are multiple targets in the scenario, we will present the mean and the maximum values of aforementioned continuous/total tracking times for a group of targets as a measure of our performance. Fig. 7 shows the result of comparing continuous/total tracking time among the five strategies for the case where three cameras track three targets, and Fig. 8 illustrates similar data for the case where three cameras are set to track six targets. The static camera strategy with maximum coverage will be regarded as the basic contribution of a multicamera surveillance system. The random strategy indicates the worst performance in each quality index. Although the free strategy with our proposed

242

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

Fig. 7. Comparisons among five strategies for the case where three cameras track three targets ( : MAC; : free; : random; : static; (a) Mean of continuous tracking time. (b) Max of continuous tracking time. (c) Mean of total tracking time. (d) Max of total tracking time.

: focus).

Fig. 8. Comparisons among five strategies for the case where three cameras track six targets ( : MAC; : free; : random; : static; (a) Mean of continuous tracking time. (b) Max of continuous tracking time. (c) Mean of total tracking time. (d) Max of total tracking time.

: focus).

cruising algorithm operates in a similar way to the random strategy without continuous tracking capability, it provides much better sensing efficiency in the total tracking time. The focus strategy and our proposed MAC cooperation strategy with welldesigned camera maneuver both showed better results in the continuous/total tracking time than the static camera scenarios. Particularly in the ninth and tenth target-motion patterns, where the targets’ moving trajectories are purposely set far away from the coverage of the static cameras, the static cameras simply could not detect the targets throughout the simulation. Since the targets cannot be detected by the inefficient searching strategy with random camera motion, the focus strategy also showed poor performance in the seventh and eighth target-motion patterns of Fig. 7. Furthermore, the focus strategy usually showed the best performance in the maximum value of continuous/total tracking time since it is designed to always concentrate on one particular target. Nevertheless, our proposed multitarget surveillance system with MAC outperforms the focus strategy in the mean value of continuous/total tracking time of a group of targets, as our proposed strategy sacrifices on observing the detailed behavior of a single target.

rected Perception PTU-D46. The state of each target is defined as xm,t = [θ, φ, s]T , where (θ, φ) is the position set in the camera’s spherical coordinates, and s is its size. For the state transition model, the predicted state is assumed to be a normal distribution around the original [2], i.e., xm,t+1 = xm,t − ut + [Δθ, Δφ, Δs]T

(22)

where [Δθ, Δφ, Δs]T ∼ [N (0, σθ2 ), N (0, σφ2 ), N (0, σs2 )]T . The image likelihood evaluation for updating the weight of samples considers the combination of contour model, texture template model, and homogeneous color model [3], [5], [25]. Since the appearance and posture of the target will change over time, the template must be updated accordingly. To do this, we apply the Kalman filter update strategy [24] to update the target texture and color templates. When the targets are independent, each target is tracked by one particle filter with 30 particles. Moreover, during the stage of evaluating MI [see (16)] for a camera’s action, 30 particles are resampled from the targets’ posterior estimation of multitarget tracking. A. Multitarget Tracking With an Active Camera

V. E XPERIMENTS In the following experiments, we test our proposed visual surveillance system with both the online image sequences processed and the pan-tilt camera platforms controlled in real time. The image size of all frames has been set to 320 × 240 pixels, and the pan-tilt camera platforms are running Di-

First, we examine the performance of the multitarget tracker by tracking multiple tennis balls with a single pan-tilt camera. The homogeneous appearance of these targets makes them difficult to distinguish once they get close to each other. The color of the tennis ball is used to detect the entering target and acts as the importance function of the SIS particle filter for independent target tracking. The relative weightings of camera

HUANG AND FU: MULTITARGET VISUAL TRACKING-BASED EFFECTIVE SURVEILLANCE WITH ACTIVE CAMERAS

Fig. 9.

243

Multitarget tracking with an active camera.

likelihood [see (13)] are assigned as c0 = 0, c1 = 1, and c2 = 0.5. The priority factor λv of each target is set to 1, making them equally important. Fig. 9 shows several snapshots during the process of tracking the tennis balls when they overlap with one another. The proposed visual tracker does not have memory or recognition capabilities, and so a target that has left the camera’s FOV is regarded as a new target if it returns. The computational complexity of the particle filter for tracking a single target is proportional to O(N ), where N is the number of particles. Considering the case of tracking M independent targets, the computational complexity will be O(M · N ) without any reduction. However, when these targets interact at the same time, the computational complexity increases to O(M ! · M · N ) due to the need for estimating the joint target state and the auxiliary state of depth hypotheses. As M increases, the growth rate of the factorial M ! will surpass those of all polynomials and exponential functions of M . Fortunately, the number of overlapping targets in general surveillance is quite low. Recall that the typical motion trajectories of dependent targets in one interacting group do not experience significant change in their depth order. Hence, the depth hypotheses will only be proposed for the targets newly joining one interacting group, and the depth order of the already existing targets, which is seldom changing, is reserved. The computational complexity O(M ! · M · N ) can then be reduced to O(M 2 · N ) after application of this modification, and reduction methods become more obvious, particularly when M ≥ 3. As shown in Fig. 9, our multitarget visual tracking system described in Section II can successfully overcome the problem of occlusion among the dependent targets, distinguishing them from each other. We also note that the pan-tilt camera will always attempt to capture these targets at the same time through action design by maximizing the coverage of all tracked targets. For example, Frame #152 to Frame #206 in Fig. 9 show the tennis ball enclosed in the magenta circle moving away from the other two such that the camera is not able to observe all three tennis balls in its limited FOV. The camera action at which the camera can cover both targets, labeled respectively with red and brown, holds higher value in the camera likelihood function [see (13)], in contrast with the other actions. The higher value of the camera likelihood function will dominate the maximization of the MI [see (16)] so that the camera drops the magenta target. Similarly, the red and brown interacting targets are

continuously tracked and are successfully distinguished even when the third target is in motion. The multiple-object tracking accuracy (MOTA) and the multiple-object tracking precision (MOTP) for region-defined targets [29] are evaluated to assess the performance of the proposed tracking system. The MOTA considers the missed detection, false alarm, and the times of nonconsistent labeling of the same target versus the number of ground-truth objects in this image sequence. The MOTP outputs the precision ratio by computing the overlap between the region of track and the corresponding ground-truth object in this image sequence, whether the objects are independent or merged. The MOTA of depth order (MOTA-D) employs the definition of MOTA to judge the accuracy of depth order between interacting targets. We also define the multiple object servoing precision (MOSP) to evaluate the camera action design unit. The desired viewpoint Iv in image space for monitoring multiple targets at the same time can be denoted as Iv = (P¯zθt , P¯zφt ), where P¯zθt and P¯zφt are defined in (14) but with respect to current image and target state. Since the designed action is to set the desired viewpoint at the center of image at every frame, the MOSP with pixel unit can be defined as MOSP =

N frames

|Iv − Ic |/Nframes

(23)

t=1

where Ic denotes the center of image, and Nframes is the number of frames through the experiment. Here, MOTA, MOTP, MOTA-D, and MOSP are 0.98, 0.87, 0.98, and 9 pixels, respectively. B. Multitarget Tracking With Multicamera Cooperation In the experiments of multicamera cooperation, the overall surveillance system consists of one CM and two distributed cameras. Here, we regard the human head as the target to be tracked. Although the 2-D projection of the outline of the human head is not rigid when a person turns his head, we can still use an elliptical contour model [7] to find the approximate contour along the edge of the image. With assistance from the target image model update, we are able to reliably track the target, even nonrigid, given that the appearance and the 2-D projection of the target will not vary too quickly over

244

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

Fig. 11. Relationship between each target and camera (blank: unobserved; color: observed). Fig. 10. Layout of the experimental environment. The blue region is the available sensing range of Camera 1, the yellow region is that of Camera 2, and the green region is their intersection. The dotted region describes the maximum coverage of one camera at a fixed position in the horizontal plane. The dotted and dashed lines indicate the cruising paths of Cameras 1 and 2, respectively.

time. The color of each ellipse encapsulating the silhouette of the human head will be used to distinguish between different targets. Similar to the above, a target that leaves the surveillance region of the entire system will be treated as a new target upon return. To detect a person, we will use skin color with some feasible constraints [4] to draw samples for the SIS particle filter. Fig. 10 roughly illustrates the layout of the experimental environment and the available sensing range of each camera. Notice that the union of the cameras’ FOV when both are fixed at a certain position will not cover the whole experimental environment, thus justifying the necessity of the pan-tilt motion of the cameras. Since a time delay may exist during communication and the processing time of each distributed camera will be difficult to synchronize, the CM will not always be able to make decisions on the action of each camera at every time instant in practical application. That is, although each camera will continue to transmit the target information and its position to the CM, the CM will only use the newest information to coordinate the multicamera cooperation process. Each camera system will continue to design its own actions based on the current image observed and the role (master/slave/free) previously assigned to it by the CM. As soon as the camera system receives notice from the CM that it has been assigned a new role and task, the camera will update its action design methodology accordingly. As a result, the whole multicamera surveillance system can achieve near real-time performance. In the following experiments, the computational time for tracking three targets by the overall system with one central server and two camera clients is approximately 15 fps. Fig. 11 depicts the relationship between each target and the cameras that are observing this target over time. The camera’s role assigned by the CM is indicated in Fig. 12, whereas Fig. 13 shows several corresponding snapshots of these two cameras at the same frame instants, which are recorded by the CM. Initially, each camera searches for a target along its predefined cruising path, as shown in Fig. 10. As soon as it detects its target, it will start to track this target. Through the target correspondence at Frame #52, the CM distinguishes that there are two different targets, so both cameras are assigned as master cameras to track their sensed targets. We can see that the person labeled with a red ellipse in both cameras is identified as

Fig. 12.

Variation of camera role ( : master;

: slave;

: free).

the same person at Frame #250. At this moment, the CM sets Camera 2 as the master camera to track both targets at the same time. Camera 1 is set as the free camera, whereas Camera 2 will continue to simultaneously capture both targets through action design while maximizing the MI [see (10)]. Although both targets are observed several times by Camera 1 during its cruising stage, the visual utility [see (19)] of Camera 1 remains lower than that of Camera 2. At Frame #395 of Camera 2 in Fig. 13, we find a new person entering the environment. Since this new person is out of the available sensing range of Camera 2, the master camera (Camera 2) will not be responsible for monitoring her. Fortunately, due to the fact that each camera can track multiple targets at the same time, the free camera, which is currently not responsible for tracking a specific target, will be able to detect and track this new target, as shown in Fig. 13 (Frame #495 of Camera 1). For evaluating the performance of this multicamera system, we modify the MOTA as introduced in the previous experiment. The original MOTA is evaluated over the ground-truth objects that appear in one image sequence. Here, the MOTA in a surveillance space (MOTA-S) is defined to evaluate the multitarget tracking accuracy over all people in this room. In addition, the MOTP of the whole surveillance system considers the tracked target and its corresponding ground-truth image from the specified camera, which is assigned to track this target. The MOSP for each camera platform just accumulates the precision rate when this camera is assigned as the master camera. From the comparison in Table I, we can see that MOTA-S and MOTP of the whole system achieve better performance than that under the circumstance where cameras 1 and 2 contribute individually. The MOTP here is lower than that of the previous ball-tracking experiment since the human head with long hair is harder to be precisely labeled and the environment is more complex. The experimental results also show the reliability and effectiveness of the proposed system with multiple cooperative active cameras.

HUANG AND FU: MULTITARGET VISUAL TRACKING-BASED EFFECTIVE SURVEILLANCE WITH ACTIVE CAMERAS

245

Fig. 13. Three targets tracking with the cooperation of two cameras. TABLE I PERFORMANCE EVALUATION FOR THE FIRST MULTICAMERA EXPERIMENT

Fig. 14. Relationship between each target and camera (blank: unobserved; color: observed).

Fig. 15. Variation of camera role ( : master;

: slave;

: free).

In the next experiment, a more complicated scenario in the same environment is employed to demonstrate the collaboration between multiple active cameras. Once some of the tracked targets leave the tracking camera’s FOV, the cooperation mechanism will be activated to allow other cameras to track or search for these targets. The relationship showing which target appears in which camera’s FOV is plotted in Fig. 14. The camera role assigned by the CM over time is shown in Fig. 15, whereas Fig. 16 illustrates several corresponding snapshots of the captured image sequence. From Frame #265 to Frame #351, Camera 1, with higher visual utility [see (19)], acts as a master camera to monitor both targets. When target occlusion occurs at Frame #306 of Camera 1, the multitarget visual tracker is able to successfully overcome it. Camera 2 moves along its pre-

defined cruising path without concentrating on targets that are already being tracked by Camera 1. At Frame # 493, Camera 1 is about to loose one target due to its limited FOV. The target labeled with the magenta ellipse is freed since that target also appears in the FOV of Camera 2. Camera 1 then hands over the tracking responsibility of this target to Camera 2, and both cameras are assigned to be master cameras by the CM. After Frame #753, Camera 1, which is tracking two targets, is set as the master camera, and Camera 2 pans right to search for new targets. The losing-target problem occurs again at Frame #798 for Camera 1; however, in this case, none of the targets in Camera 1’s FOV appear in another camera’s FOV. Camera 1 decides to give up the target labeled with the green ellipse, and the CM reassigns Camera 2 as the slave camera to take over monitoring of this target. Camera 2 then immediately pans left to the estimated location of this newly acquired target and maintains this task for several frames. Camera 2 detects a new target at Frame #824 and confirms it as the assigned target. The slave task of Camera 2 is then complete, and Camera 2 is reassigned as a master camera. Table II provides the summary of each performance index. Through the experimental results, the proposed multicamera cooperation algorithm presents a dynamic and tight collaboration surveillance system, which is particularly applicable to a space wider than the union of the FOVs of all available cameras. VI. C ONCLUSION In this paper, we have presented a tracking-based multitarget visual surveillance system with tight cooperation among multiple active cameras and near real-time performance. Each camera platform is designed as a distributed vision system capable of tracking a crowd of targets simultaneously with its limited FOV. The problem of tracking multiple overlapping targets is solved by deriving the SIR particle filter extended with depthorder estimate. The suboptimal camera action for covering multiple targets simultaneously is then designed by maximizing MI and is effectively evaluated by the Monte Carlo method. The experiments presented in this paper show that the proposed visual tracking function on each distributed camera is able to successfully track multiple targets simultaneously. Moreover, based on these distributed vision systems, the multicamera cooperation objective may be proposed to utilize fewer cameras

246

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 1, FEBRUARY 2011

Fig. 16. Handover of tracked targets with cooperation between two cameras. TABLE II PERFORMANCE EVALUATION FOR THE SECOND MULTICAMERA EXPERIMENT

to track more targets and observe larger space. With this new concept, the targets should be as different as possible from one camera’s FOV to another. A hierarchical camera selection and task assignment algorithm is described for efficient cooperation among distributed cameras. Cameras are categorized as master when tracking existing targets, slave when taking over lost targets, and free when seeking new targets. The positioning strategy of each category has also been defined. Through this style of cooperation, the surveillance area can be made as wide as possible with the least number of cameras utilized while maintaining a high level of performance. The efficiency of this multicamera surveillance system has been verified in both simulation and experiment. When the number of interacting targets increases, the number of particles for estimating the joint state of multiple targets needs to increase. The conventional particle filter may then suffer the sample degeneration problem and begin to require a larger numbers of particles to complete its high-dimensional estimation. Choosing appropriate camera mounting locations in advance, like installing cameras on ceilings to track people [30], could alleviate the occlusion problem and reduce the number of overlapping targets. Some improved particle filters, such as those with partitioned sampling [20], with Markov chain Monte Carlo sampling [2], or with particle swarm optimization [27], could also be applied to further balance the tracking robustness and computational complexity. In the future, we will aim to extend our system to more complex practical scenarios and to reduce the overall computational time. The target model will be defined as the whole human body, and the human detector will also be incorporated. The distributed vision system equipped with a pan-tilt platform can be further implemented on a mobile robot. Seamless tracking can then be thoroughly realized to monitor activities in the entire surveillance region or to complete some human–machine interactions in an intelligent space.

R EFERENCES [1] T. Matsuyama and N. Ukita, “Real-time multitarget tracking by a cooperative distributed vision system,” Proc. IEEE, vol. 90, no. 7, pp. 1136–1150, Jul. 2002. [2] Z. Khan, T. Balch, and F. Dellaert, “MCMC-based particle filtering for tracking a variable number of interacting targets,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 11, pp. 1805–1819, Nov. 2005. [3] M. Isard and A. Blake, “CONDENSATION—Conditional density propagation for visual tracking,” Int. J. Comput. Vis., vol. 29, no. 1, pp. 5–28, Aug. 1998. [4] M. Isard and A. Blake, “ICONDENSATION: Unifying low level and high level tracking in a stochastic framework,” in Proc. 5th Eur. Conf. Comput. Vis., Freiburg, Germany, 1998. [5] C. Rasmussen and G. D. Hager, “Probabilistic data association methods for tracking complex visual objects,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 560–576, Jun. 2001. [6] R. E. Burkard, M. Dell’Amico, and S. Martello, Assignment Problems. Philadelphia, PA: SIAM, 2009. [7] H. Yoon, D. Kim, S. Chi, and Y. Cho, “A robust human head detection method for human tracking,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2006, pp. 4558–4563. [8] F. Chaumette and S. Hutchinson, “Visual servo control. I. Basic approaches,” IEEE Robot.Autom.Mag., vol. 13, no. 4, pp. 82–90, Dec. 2006. [9] B. A. Stancil, C. Zhang, and T. Chen, “Active multicamera networks: From rendering to surveillance,” IEEE J. Sel. Topics Signal Process., vol. 2, no. 4, pp. 597–605, Aug. 2008. [10] S. Khan and M. Shah, “Consistent labeling of tracked objects in multiple cameras with overlapping fields of view,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 10, pp. 1355–1360, Oct. 2003. [11] Y. Yao, C. H. Chen, B. Abidi, D. Page, A. Koschan, and M. Abidi, “Can you see me now? Sensor positioning for automated and persistent surveillance,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 1, pp. 101–115, Feb. 2010. [12] Y. Li and B. Bhanu, “A comparison of techniques for camera selection and handoff in a video network,” in Proc. ACM/IEEE Int. Conf. Distrib. Smart Cameras, Aug. 2009, pp. 1–8. [13] J. Denzler and C. M. Brown, “Information theoretic sensor data selection for active object recognition and state estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 2, no. 24, pp. 145–157, Feb. 2002. [14] T. Vercauteren, D. Guo, and X. Wang, “Joint multiple target tracking and classification in collaborative sensor networks,” IEEE J. Sel. Areas Commun., vol. 23, no. 4, pp. 714–723, Apr. 2005. [15] G. L. Mariottini and D. Prattichizzo, “EGT for multiple view geometry and visual servoing: Robotics vision with pinhole and panoramic cameras,” IEEE Robot. Autom. Mag., vol. 12, no. 4, pp. 26–39, Dec. 2005. [16] K. Murphy, “Dynamic Bayesian networks: Representation, inference and learning,” Ph.D. dissertation, Dept. Comput. Sci., UC Berkeley, Berkeley, CA, 2002. [17] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” IEEE Trans. Signal Process., vol. 50, no. 2, pp. 174–188, Feb. 2002. [18] J. Xue, N. Zheng, J. Geng, and X. Zhong, “Tracking multiple visual targets via particle-based belief propagation,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 1, pp. 196–209, Feb. 2008.

HUANG AND FU: MULTITARGET VISUAL TRACKING-BASED EFFECTIVE SURVEILLANCE WITH ACTIVE CAMERAS

[19] P. Pérez and J. vermaak, “Bayesian tracking with auxiliary discrete processes. Application to detection and tracking of objects with occlusions,” in Proc. ICCV Workshop Dyn. Vis., 2005, pp. 190–202. [20] J. MacCormick and A. Blake, “A probabilistic exclusion principle for tracking multiple objects,” Int. J. Comput. Vis., vol. 39, no. 1, pp. 57–71, Aug. 2000. [21] R. T. Collins, A. J. Lipton, H. Fujiyoshi, and T. Kanade, “Algorithms for cooperative multisensor surveillance,” Proc. IEEE, vol. 89, no. 10, pp. 1456–1477, Oct. 2001. [22] A. Bakhtari and B. Benhabib, “An active vision system for multitarget surveillance in dynamic environments,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 37, no. 1, pp. 190–198, Feb. 2007. [23] F. Z. Qureshi and D. Terzopoulos, “Multi-camera control through constraint satisfaction for persistent surveillance,” in Proc. IEEE Conf. Adv. Video Signal Based Surveillance, 2008, pp. 211–218. [24] C. Haworth, A. M. Peacock, and D. Renshaw, “Performance of reference block updating techniques when tracking with the block matching algorithm,” in Proc. IEEE Int. Conf. Image Process., 2001, vol. 1, pp. 365–368. [25] P. Pérez, J. Vermaak, and A. Blake, “Data fusion for visual tracking with particles,” Proc. IEEE, vol. 92, no. 3, pp. 495–513, Mar. 2004. [26] A. J. Davison, “Real-time simultaneous localisation and mapping with a single camera,” in Proc. IEEE Int. Conf. Comput. Vis., 2003, vol. 2, pp. 1403–1410. [27] X. Zhang, W. Hu, S. Maybank, X. Li, and M. Zhu, “Sequential particle swarm optimization for visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recog., 2008, pp. 1–8. [28] B. Song and A. K. Roy-Chowdhury, “Robust tracking in a camera network: A multi-objective optimization framework,” IEEE J. Sel. Topics Signal Process., vol. 2, no. 4, pp. 582–596, Aug. 2008. [29] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and Z. Jing, “Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 319–336, Feb. 2009. [30] N. Takemura and J. Miura, “View planning of multiple active cameras for wide area surveillance,” in Proc. IEEE Int. Conf. Robot. Autom., 2007, pp. 3173–3179.

247

Cheng-Ming Huang received the B.S. degree from the National Chiao Tung University, Hsinchu, Taiwan, in 2000, the M.S. degree from the National Cheng Kung University, Tainan, Taiwan, in 2002, and the Ph.D. degree from the National Taiwan University, Taipei, Taiwan, in 2009. Since 2009, he has been a Postdoctoral Researcher with the Department of Electrical Engineering, National Taiwan University. His research interests include computer vision, visual tracking, visual servoing, stochastic control, and multicamera cooperation.

Li-Chen Fu (S’85–M’88–SM’02–F’04) received the B.S. degree from the National Taiwan University, Taipei, Taiwan, in 1981 and the M.S. and Ph.D. degrees from the University of California, Berkeley, in 1985 and 1987, respectively. Since 1987, he has been a member of the faculty and is currently a Full Professor with the Department of Electrical Engineering and Department of Computer Science and Information Engineering, National Taiwan University. He currently serves as Editor-inChief of the Asian Journal of Control. His research interests include robotics, smart home, visual detection and tracking, intelligent vehicle, production scheduling, virtual reality, and control theory and applications. Dr. Fu has been invited to serve as Distinguished Lecturer of IEEE Robotics and Automation Society during 2004–2005 and in 2007. He has received numerous academic recognitions, such as Distinguished Research Awards from National Science Council, Taiwan, the Irving T. Ho Chair Professorship, and IEEE Fellow in 2004. He was also awarded the Lifetime Distinguished Professorship by the National Taiwan University in 2007.