Cooperative localization and tracking with a camera - CiteSeerXhttps://www.researchgate.net/...Ramiro...camera.../Cooperative-localization-and-tracki...

Proceedings of the 2009 IEEE International Conference on Mechatronics. Málaga, Spain, April 2009.

Cooperative localization and tracking with a camerabased WSN J. M. Sánchez-Matamoros, J.R. Martínez-de Dios and A. Ollero Robotics, Vision and Control Group, Universidad de Sevilla Escuela Superior de Ingenieros Avda. de los Descubrimientos, sn; Seville; Spain {jmatamoros, jdedios, aollero}@cartuja.us.es Abstract—This paper presents a vision-based system for cooperative object detection, localization and tracking using Wireless Sensor Networks (WSNs). The proposed system exploits the distributed sensing capabilities, communication infrastructure and parallel computing capabilities of the WSN. To reduce the bandwidth requirements, the images captured are processed at each camera node with the objective of extracting the location of the object on each image plane, which is transmitted to the WSN. The measures from all the camera nodes are processed by means of sensor fusion techniques such as Maximum Likelihood (ML) and Extended Kalman Filter (EKF). The paper describes hardware and software aspects and presents some experimental results. Keywords-cooperative perception, Wireless Sensor Networks, Extended Kalman Filter

I.

INTRODUCTION

The growing needs of security have increased the research in surveillance systems and, in particular those based on imaging sensors. The advantages of observing the same object from different views justifies the high interest in multi-camera systems, which for instance allow developing cooperative surveillance strategies exploiting the complementarities of fixed cameras at different locations [1] and/or cameras on mobile platforms, such as ground or aerial robots, [2]. These cooperative surveillance systems can be centralized or decentralized from the processing point of view. Decentralized systems have clear advantages, such as higher reliability and parallel processing, [3]. These decentralized cooperative surveillance systems require specific communication infrastructure. Issues such as communication errors and delays, medium access and routing protocols, among others, should be carefully addressed in these systems. Wireless sensor network (WSN) is a growing technology with a high number of potential applications. A WSN consists of a large number of spatially distributed low-cost and lowconsumption devices (nodes) with computing, perception and communication capabilities. WSN usually provide low bandwidth communications but, on the other hand, they have significant flexibility, scalability and tolerance to failures of nodes and sensors. WSN use network formation and packet routing algorithms that allow flexible and adaptive topology self-reconfiguration, [4].

978-1-4244-4195-2/09/$25.00 (c) 2009 IEEE

The nodes of the WSN can be equipped with a growing variety of sensors including light intensity sensors, optical barrier, presence sensors, gas sensors and GPS. Also, the standardization of communication protocols for WSN, such as IEEE 802.15.4 and ZigBee, has facilitated the effort to extend its range of applications and has recently attracted significant research and development efforts. WSN have been applied in applications such as building control [5], environmental monitoring [6] and manufacturing automation [7], among others. The method proposed in this paper uses WSN of camera sensors for object detection, localization and tracking. Although WSNs have been applied for localization and tracking using Radio Signal Strength Intensity (RSSI) or time of flight (TOF), [8], camera-based WSN have been applied very scarcely. Also, methods based on RSSI and TOF require that the moving object transmits at a frequency suitable for the WSN, typically using a mobile WSN node attached to the moving object, [9], which can constrain the problems in which they can be applied. The system proposed in this paper not only exploits the distributed sensing capabilities of the WSN. It also overcomes typical multi-camera network problems exploiting the communication infrastructure of the WSN. Also, it benefits from the parallel computing capabilities of the distributed nodes. To reduce the bandwidth requirements, in the proposed system, the images captured synchronously by the cameras are processed at each node with the objective of extracting the essential information of the object. To cope with the usual low bandwidth of WSN, only this distilled information is sent through the WSN. The measures from all the cameras are integrated using information fusion methods such as Maximum Likelihood and Extended Kalman Filters. The rest of the paper is as follows. Section II briefly presents the system proposed in the paper pointing out hardware and software issues. Section III focuses on the description of the detection, localization and tracking methods. Section IV shows some experimental results. The last sections are conclusions and acknowledgements.

II.

DESCRIPTION OF THE SYSTEM

Figure 1 depicts the architecture of the proposed system. The WSN consist of a high number of nodes with sensing, computing and wireless communication capabilities. For interface with other systems or networks the WSNs are typically connected to one or more gateways with base nodes that act as information sink.

bandwidth-reduced information from each camera node is transmitted through the WSN. As in any other decentralized vision-based object tracking method, all the entities should be synchronized in order to allow the reconstruction of the trajectory of the moving object. Thus, we included the node synchronization method described in [17] as part of the WSN communication network layer. The following sections briefly describe hardware and software characteristics of the proposed system.

Figure 1: Network scheme of the proposed system.

The wireless communication infrastructure provided by the WSN allows the information interchange among the nodes. WSN communication algorithms such as network formation and information routing have been extensively analyzed in the literature with the objective of minimizing the energy consumption [10], minimizing the communication delays expressed as the number of hops [11] or optimizing the network reconfiguration and reliability to failures [12]. Many routing algorithms are based on the establishment of nodes clusters that are “sensing” the same event, see for instance [13]. Several criteria are used to select the structure of the cluster and, particularly, the cluster leader, such as the node with best connection to the rest of the network, or the node that best senses the event. A survey of several wireless sensor networks routing protocols can be found in [14]. Although WSN provide easy-to-use communication capabilities, its bandwidth is often low. For instance, Zig-bee nominally provides 250 Kbps, but in operation conditions this rate can be reduced to an effective bandwidth of 110 Kbps [15], which can be insufficient for some problems. The nodes of the WSN can be equipped with different sensors depending on the application. For instance, sensors to measure temperature, light intensity, and gases concentration can be used for environment monitoring. The system proposed in this paper uses cameras as main sensors of the WSN. Some approaches such as [16] transmit the images gathered by the cameras through the WSN. This approach has severe bandwidth constrains in problems that require images of certain resolution and at a certain rate. In our case camera nodes are equipped with extra computational capabilities able to execute image-processing algorithms in order to extract from the images the required information, for instance the location of an object on the image plane. Hence, only the essential

A. Hardware Requirements such as size, weight and energy consumption are very important in our system. The camera nodes should be durable and easy to be deployed. In our system we use Mica2 motes from Crossbow Inc. These motes use a 916 MHz transceiver, whose rate is 9.600 Kbps, ATMega128 8-bit microprocessor at 7 MHz. Mica family has different sensor boards such as the MTS400 with accelerometers and temperature and light sensors; or the MTS420 with GPS. It has also the possibility of adding a wide range of external sensors/actuators using adaptation boards such as the MDA100, whose input/output ports include, among others, A/D converters, interruption inputs, I2C bus and serial communication ports, [18]. The ATMega128 microprocessor used in Mica2 motes can execute low computer-burden algorithms but it is not capable of applying image processing methods with sufficient image resolution or frame rate. Furthermore, the RAM of Mica2 (only 128 KB) is not suitable for image processing problems. The solution adopted is to select low-cost cameras that implement extra computing capabilities such as the CMUCam2 by Carnegie Mellon University. It includes hardware-based image processing computing capabilities implemented in a SX52 microprocessor (Ubicom) at 75 MHz. The CMUCam2 includes a low-cost camera with a 255x176 pixel 1/4’’ CMOS detector and a M12 0.5 mm optics [19]. The CMUCam2 includes several simple image processing methods implemented in hardware such as color segmentation, which provides the region of the image which pixels colors are inside a predefined RGB subspace; or a frame-diff segmentator, which provides the region that differs from a reference image, typically static. The computing capabilities of the CMUCam2 warrant real-time image processing execution (up to 50 fps). Each CMUCam2 is controlled by a single mote. The mote commands the CMUCam2 to perform image processing tasks though a RS-232 protocol, which is detailed in the next section. Figure 2 shows a camera node with the CMUCam2 camera attached to a Mica2 node. The CMUCam2 were mounted on a small tripod to facilitate deployment and orientation. B. Software Two main software modules were considered. The first one deals with the node synchronization. The second is responsible for information acquisition and processing.

frame-diff segmentation for identifying moving objects in a static scenario. Among their outputs, in both cases the CMUCam2 provides the centroid of the segmented region on the image plane.

Figure 2: Detail view of a camera node with a CMUCam2 and a Mica2 mote.

The algorithm selected for node synchronization is the socalled Flooding Time Synchronization Protocol (FTSP). This scalable protocol establishes hierarchies among the WSN nodes. The leader node, usually the base node, periodically sends a time synchronization message. Each node that receives the message resends it following a broadcast strategy. The local time of each node is corrected depending on the time stamp on the message and on the sender of the message. The synchronization error of this protocol is demonstrated to be of a few milliseconds, [17]. The results of our synchronization tests confirm this error. From the acquisition point of view, each node runs a program that reads its sensors, processes the obtained measures and sends the information to the WSN. This program is different depending on the sensor considered. Figure 3 depicts a simplified scheme of the software acquisition and processing executed at each camera node. Camera nodes execute a program that implements the command interface with the CMUCam2 through the serial port using low-level routines of TinyOS. In the following the node-CMUCam2 communication is briefly summarized. The node commands enable the CMUCam2 to start capturing images at a certaing rate and to apply object segmentation methods to the captured images. After applying the image capture and segmentation, the CMUCam2 responses providing data of the region segmented, such as its centroid and size. The data obtained are sent through the WSN to a node, responsible for collecting the measures from all camera nodes and of applying the sensor fusion techniques. It should be noticed that the synchronization among the nodes warrants that the measures of the object are obtained by all the cameras at the same time instants, which facilitates the application of sensor fusion techniques, as is discussed in Section III. III. COOPERATIVE PERCEPTION TECHNIQUES The objective of the developed techniques is to use measurements obtained from distributed camera nodes to cooperatively locate and track an object in the scenario. Two of the algorithms implemented in the CMUCam2 were particularly interesting for object location and tracking: color segmentation for identifying objects of a predefined color and;

Figure 3: Simplified flow diagram of the processes executed at the nodes: a) camera nodes request image capture and image segmentation and send processed data; b) localization and tracking using measures from the distributed camera nodes.

Before transmitting the information for sensor fusion, each camera node should correct the optical distortions originated by the often low-cost lens of the cameras. Let Pp=[xp yp]T be the projection on the image plane of point P, the centroid of the segmented object of interest. First, the pin-hole projection Pd is obtained using: Pp=f·Pd+C, where f is the focal length and C is the coordinates of the principal point of the lens. Pd can be affected by optical distortions of the lens. Assuming the model in [20] these distortions are corrected using the expression:

Pn = 1 d r (Pd − d t ) ,

(1)

where Pn is the distortion-corrected pin-hole projection measure, and dr and dt are simplified radial and tangential distortions terms, [20]:

2

⎡c ( r 2 + 2 x d 2 ) ⎤ dt ≈ ⎢ ⎥, ⎢⎣ 2cxd yd ⎥⎦

4

d r ≈ 1 + ar + br ,

(2)

where a, b and c are optical parameters determined in cameras calibration and r2=xd2 + yd2. These efficient expressions should be applied only to the centroid of the segmented region of interest. These distortion-corrected pin-hole projections are transmitted by each camera node for sensor fusion. A. Maximum Likelihood (ML) Maximum Likelihood methods (ML) estimate the state S maximizing a statistical likelihood function:

Sˆ = arg max{P(m | S )} S

(3)

The ML method intends to estimate the state that best justifies the observations. In this method we fuse the projections on the scenario obtained from all camera nodes. Let Pi=[xn,i yn,i]T be the distortion-corrected measures of a point viewed from camera node i. Assuming a pin-hole model, the projection on the scenario of Pi at a distance zi from camera i is expressed by:

⎡ xi ⎤ ⎡ xn,i zi ⎤ mi = ⎢⎢ yi ⎥⎥ = ⎢⎢ y n,i zi ⎥⎥ ⎢⎣ zi ⎥⎦ ⎢⎣ zi ⎥⎦

(4) Figure 4: Simple example using statistically intersection method.

mi represents the location of the object on the scenario measured by camera i expressed in the reference frame local of camera i. The following expression transforms the projection to a common reference frame:

⎡mic ⎤ −1 ⎡ mi ⎤ ⎢ ⎥ = Ti ⎢ ⎥ , ⎣1⎦ ⎣1⎦

(5)

where Ti is the transformation matrix from the reference frame of camera i w.r.t. a common reference frame. Thus, mic represents the location of the object measured by camera i in a common reference frame. Assume that mic contain errors that can be modeled as Gaussians with zero mean noise and covariance matrix Covi. Assume that measures from different cameras can be considered statistically independent. If both conditions are met, the ML method estimates the state S fusing the measures mic using the expression:

⎛ N ⎞ S = ⎜⎜ ∑ Covi−1 ⎟⎟ ⎝ i =1 ⎠

−1 N

∑ (mic Covi−1 )

(6)

i =1

Covi can be decomposed in an eigenvector matrix and an eigenvalue matrix, Covi=PΛP-1. The matrix, Λ, can be constructed taking into account that the eigenvalues, at the diagonal of Λ, are the values of variance of each axis of the camera local reference frame. The eigenvectors form the columns of the eigenvalue matrix P. The eigenvectors are orthonormal vectors that represent the axes of the camera local reference. P and Λ, and thus, Covi can be easily constructed knowing the orientation of camera i and estimating the noise in the measure. Figure 4 shows an illustration of the method with two cameras. The object location probability values corresponding to Camera 1 and 2 are in blue color. The resulting probability of the fused estimation, in red color, is significantly higher (which implies lower covariance), which denotes an increment in location accuracy. Notice that for applying (4), the value of zi, location of the object in the local camera reference frame, should be assumed. A typical approach is to set zi with an average value and compensate the error assuming a high value for the variance of the error at z axis.

This simple method is very efficient and can be executed in a Mica2 taking less than 20 ms. On the other hand, it is sensitive to losses of WSN packets, which can be frequent in some cases. This sensor fusion method relies totally on measures and its performance degrades when measures are lost. Other sensor fusion techniques such as Kalman Filters rely on measures and on models, which are very useful in case of lack of measures. B. Extended Kalman Filter (EKF) Extended Kalman Filters require adopting a prediction model and a measurement model. These models should be efficient enough for allowing real-time processing on a single mote. We selected the following prediction model: Sk+1=ASk+wk, where wk is noise as will be described later, and S is the state vector, which for efficiency contains only the information of the object location. We selected A=I. This model can represent simple and irregular motions and has been extensively applied in KF and EKF systems, e.g. [21] or [22]. This model is enough for slow moving objects. Also, it should be noticed that besides increasing the computation burden, the adoption of more complex prediction models requires knowledge of the motion, which cannot be considered available in advance in our system, mainly oriented to tracking of people with potentially unpredictable motion. The measurement model used in the proposed EKF uses as measures the distortion-corrected pin-hole projections from camera node i. This is derived as follows. Let P be the location of the object in a common reference frame. The location of the object in the reference frame of camera i, Pi, can be obtained using: [Pi 1]T=Ti[P 1]T, where Ti is the transformation matrix for the reference frame of camera i. Using pin-hole projection the distortion-corrected pin-hole projection for camera i follows the expression:

⎡ t1,i [P 1]T t3,i [P 1]T ⎤ mi = ⎢ T T⎥ ⎣⎢t2,i [P 1] t3,i [P 1] ⎥⎦

(7)

where tj,i represents the j-th row of Ti. Notice that tj,i[P 1]T is scalar. In our EKF we used mi as measures. The measurement model, which relates the state at time k, Sk, and the measures obtained at time k, mk,i, is obtained from (7) substituting P by Sk and adding observation noise:

mk ,i

⎡ t1,i [S k 1]T t3,i [S k 1]T ⎤ + vk , = h( S k , vk ) = ⎢ T T⎥ ⎣⎢t2,i [S k 1] t3,i [S k 1] ⎥⎦

(8)

where vk summarizes the errors sources at the observation step including those in the location and orientation of the cameras. This simplified error formulation is adopted for computer efficiency. We also assume that wk and vk are Gaussian additive noises of zero mean and with covariance matrices Q and R, respectively. This observation model is non-linear. At the updating stage the EKF requires using the Jacobian matrices of h, Hk and Vk:

H k = J (h, S k ) =

∂h , ∂S k

Vk = J (h, vk ) =

adopted required selecting higher values of the covariance for the prediction noise, R=10 I3. Figure 6 compares the results obtained (thin-dotted line) with the ground-truth data (wide line) when the object has two different trajectories. The initialization of the state vector is of high interest for iterative estimation methods such as the EKF. We set an arbitrary initial value -location of Camera 3 in the example of Fig. 6- and gave high initial values to P reflecting the lack of accuracy on the initial position of S. It can be noticed that, after only one iteration, S reduces this initial error and converges to the real location. Another approach is to set as initial value for S the value resulting from ML method described in Section IIIA.

∂h ⎡1 0⎤ (9) = ∂vk ⎢⎣0 1⎥⎦

Figure 5 shows the algorithm implemented. The simplifications adopted improve the efficiency of the EKF. The prediction model simplifies the prediction stages at lines 1 and 2. The size of the state Sk is 3x3. The simple observation noise model adopted results in Vk=I, which avoids matrix products at the updating stage at line 3. Also, the computation of the Kalman gain at line 3 requires inverting a matrix of only 2x2. Each set of synchronized measures from the cameras requires only one prediction step and one updating step for each measure. Assuming 3 cameras the execution of one iteration of the EKF requires approximately 600 products and 500 sums (less than 60 ms.).

Figure 6: Results of the EKF (in blue) VS ground-truth trajectory (in red) for two experiments with different object trajectories. Figure 5: Pseudo-code of the EKF algorithm.

The probabilistic nature of EKF provides high robustness in the system. For instance, if the object moves out of the cameras field, and the object cannot be segmented in the images, only the prediction part of the EKF algorithm (1 and 2) are applied and, thus, the covariance matrix (uncertainty of the state) is increased more and more until the object is found again. This behavior increases the robustness in case of loss of the tracked object.

EKF is significantly robust to losses of measures. If there are no measures there is no update and the static prediction model makes no change on the estimated state. In the ML method described in Section IIIA if there are no measures, there is no estimation. Figure 7 shows the results obtained for the experiments shown in Fig. 6 top.

IV. SOME EXPERIMENTAL RESULTS The objective of the experiments was to cooperatively locate and track one person moving in a scenario using three camera nodes. The camera nodes were configured to segment the color of the costumes of the persons to be tracked. The cameras location and orientation were set to cover a great part of the scenario. The coordinate frame of each camera is depicted on Fig. 6. The variance for the observation noise was experimentally measured in σ2m=1,26. The simplicity of the prediction model

Figure 7: Results of the ML method for the experiment in Fig 6top.

With the objective of assessing the errors of the system we also carried out experiments with a static object, whose real position was measured. Figure 8 shows that for both axes the state of the EKF converges to the real location (straight line) in two iterations. Also the values of covariance (top and bottom lines) soon become very low.

Figure 8: Results of EKF with a static object for axes X and Y. The EKF converges to the real location after two iterations.

V.

CONCLUSION

This paper proposes to exploit WSN capabilities such as distributed sensing, communication and parallel computing to develop a vision-based system for object detection, localization and tracking. The system uses synchronized WSN nodes with cameras as main sensors. To reduce the bandwidth requirements, the images are processed at each camera node and only the information of interest is transmitted through the WSN. The measures from all the camera nodes are inputs of sensor fusion techniques such as Maximum Likelihood (ML) and Extended Kalman Filter (EKF), adapted to reduce their computer burden. Experimental results in the paper show the validity of the proposed system. In case of moving objects, the trajectory provided by the system is very near to the real one. With static objects, the system provided the object location virtually with no errors. The proposed system can be executed in real time in one single mote. The sensor fusion process applied has bad properties for distributed execution in multiple motes. Then, more efficient techniques for distributed implementation, such as Information Filters are being investigated. The automatic determination of nodes location to facilitate rapid deployment is also object of current research. ACKNOWLEDGMENT The authors thank Francisco Pazos and Jesús Capitán for their contribution preparing the hardware and their help in the development of the EKF. This work was supported in part by the European Commission IST Programme under projects AWARE (FP6-IST-2006-33579) and URUS (FP6-EU-IST045062) and the Junta de Andalucía (Spain) under project DETECTRA (P07-TIC-02966).

REFERENCES [1]

J. R. Martínez-de Dios, B. Ch. Arrue, A. Ollero, L. Merino and F. Gomez-Rodriguez F, “Computer Vision Techniques for Forest Fire Perception”, Image and Vision Computing, Vol. 26, no. 4, 2008, pp 550-562. [2] B. Grocholsky, J. Keller, V. Kumar and G. Pappas, “Cooperative air and ground surveillance”, IEEE Robotics & Automation Magazine, vol.13, no.3, 2006, pp. 16-25. [3] W. Qu, D. Schonfeld and M. Mohamed, “Decentralized Multiple Camera Multiple Object Tracking”, IEEE Intl. Conf. Multimedia and Expo 2006, 2006, pp.245-248. [4] F. Akyildiz et al., “A Survey on Sensor Networks”, IEEE Communications Magazine, 2002, pp. 102-114. [5] J. S. Sandhu, A.M. Agogino and A.K. Agogino, “Wireless sensor networks for commercial lighting control: decisioni making with multi-agent systems”, Proc. of the AAAI Workshop on Sensor Networks, 2004. [6] J. Polastre, R. Szewcyk, A. Mainwaring, D. Culler and J. Anderson, “Analysis of wireless sensor network for habitat monitoring”, Wireless Sensor Networks, Kluwer Academic Pub, 2004, pp. 399-423. [7] M. Hanssmann, S. Rhee and S. Liu, “The Applicability of Wireless Technologies for Industrial Manufacturing Applications Including Cement Manufacturing”, IEEE Cement Industry Technical Conference 2008, pp.155160. [8] S. Lanzisera, D.T. Lin and K.S.J. Pister, “RF Time of Flight Ranging for Wireless Sensor Network Localization”, Intl. Workshop on Intelligent Solutions in Embedded Systems, 2006, pp.1-12. [9] F. Caballero, L Merino, P. Gil, I. Maza and A. Ollero, “A probabilistic framework for entire WSN localization using a mobile robot”, Robotics and Autonomous Systems, Vol. 56, no. 10, 2008, pp. 798-806. [10] J.-C. Cano and P. Manzoni, “A performance comparison of energy consumption for Mobile Ad Hoc Network routing protocols”, Proc. Intl. Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2000, pp.57-64. [11] H. Y. Lim et al, “Maximum Energy Routing Protocol Based on Strong Head in Wireless Sensor Networks”, Intl. Conf. ALPIT 2007, 2007, pp.414419. [12] P.K.K. Loh, S.H. Long and Y. Pan, “An efficient and reliable routing protocol for wireless sensor networks," Proc. IEEE WoWMoM 2005, pp. 512516. [13] Y.-C. Chang, Z.-S. Lin and J.-L. Chen, “Cluster based self-organization management protocols for wireless sensor networks”, IEEE Trans. on Consumer Electronics, vol.52, no.1, 2006, pp. 75-80. [14] J.N. Al-Karaki and A.E. Kamal, “Routing techniques in wireless sensor networks: a survey”, IEEE Wireless Communications, vol.11, no.6, 2004, pp. 6-28. [15] F. Garcia, A. J. García and J. Carcía, “Estimación de la capacidad efectiva de las redes IEEE 802.15.4 en interferencia de redes IEEE 802.11", Univ. Politécnica de Cartagena, Telecoforum 2007. [16] T. Wark et al., “Real-time Image Streaming over a Low-Bandwidth Wireless Camera Network”, Intl Conf on Intelligent Sensors, 2007, pp.113118. [17] M. Maróti, B. Kusy, G. Simon and A. Lédeczi, “The Flooding Time Synchronization Protocol”, SenSys’04. [18] “Mica2 Datasheet”, Crossbow Inc., http://www.xbow.com [19] “CMUCam2 Vision Sensor, user manual”, CMU, USA, http://www.cs.cmu.edu/~cmucam2 [20] http://www.vision.caltech.edu/bouguetj/calib_doc/htmls/parameters.html [21] N. P. Papanikolopoulos, P. K. Khosla and T. Kanade, “Visual Tracking of a Moving Target by a Camera Mounted on a Robot: A Combination of Control and Vision”, IEEE Trans. On Robotics and Automation, vol. 9, no. 1, 1993, pp. 14-25. [22] R.J. Qian, M.I. Sezan and K.E. Matthews, “A robust real-time face tracking algorithm”, Proc. Intl. Conf. on Image Processing, ICIP 98, vol.1, 1998, pp. 131-135.