A PARTICLE FILTER BASED FUSION FRAMEWORK FOR VIDEO-RADIO. TRACKING IN SMART SPACES. Alessio Dore Andrea F. Cattoni and Carlo S.
A PARTICLE FILTER BASED FUSION FRAMEWORK FOR VIDEO-RADIO TRACKING IN SMART SPACES Alessio Dore Andrea F. Cattoni and Carlo S. Regazzoni Department of Biophysical and Electronic Engineering, University of Genova Genova - Italy
Abstract
known that visual tracking performances dramatically fall in the case of occlusions, i.e. the overlapping of objects in the image plane that affects the target. In fact the lack of observations renders the position estimate, and then the association to a track, the more difficult the more the occlusion lasts. Moreover, if some zones of the scene are not covered by any field of view of the cameras, assumptions have to be made to have a continuous track (see Makris et al. in [5]). For what concerns the issue of occlusions, in the last years a large body of research has been addressed to methods able to integrate data gathered from multiple, also heterogeneous, sensors. Because of the increase of robustness requirements and new functionalities, researchers moved from multi-camera systems to the usage of heterogeneous sensors for acquiring information from the scene. This allows one to enhance the quantity of information that can be extracted and the redundancy provided by different signals typologies. The contribution of the paper is a new framework, based on the Particle Filter algorithm, that is able to track users, moving in a SS, employing jointly video and radio data. This approach aims at exploiting complementary benefits provided by the two types of data. In fact visual tracking commonly outperforms radio localization in terms of precision but inefficiency arises in case of occlusions or when the scene is too vast. Conversely, radio measurements, gathered by a user’s radio device, are unambiguously associated to the respective target through the “virtual” identity (i.e. MAC/IP addresses) and they are available in extended areas. Then, the proposed framework allows the localization even where/when either video or radio observations miss or are not satisfactory and it renders the AmI system setting up more flexible and robust.
One of the main issues for Ambient Intelligence (AmI) systems is to continuously localize the user and to detect his/her identity in order to provide dedicated services. A video-radio fusion methodology, relying on the Particle Filter algorithm, is here proposed to track objects in a complex extensive environment, exploiting the complementary benefits provided by both systems. Visual tracking commonly outperforms radio localization in terms of precision but it is inefficient because of occlusions and illumination changes. Instead, radio measurements, gathered by a user’s radio device, are unambiguously associated to the respective target through the “virtual” identity (i.e. MAC/IP addresses). The joint usage of the two data typologies allows a more robust tracking and a major flexibility in the architectural setting up of the AmI system. The method has been extensively tested in a simulated and off-line framework and on real world data proving its effectiveness.
1. Introduction Ambient Intelligence (AmI) can be defined as the addition of interactive and proactive capabilities to an environment, that can be called Smart Space (SS). Main goal of such a system is to provide innovative and immersive services to the user through advanced Human/Machine Interfaces (HMI) and terminals. Even if SSs can be seen as an innovative extension of surveillance systems (in fact main problems to be faced are the same), there is a focal translation, from the system itself and the user: in fact they are also defined as user-centred. Among all the open issues in designing SSs, in the present paper the problem of target identification and tracking is faced. Tracking is one of the basic tasks of most of AmI systems as they usually need information regarding the location of the users to provide them with services. However, whereas the localization of an isolated single target moving in a limited area can be rather easily handled by visual tracking, when the number of objects to be tracked grows and there is the necessity to determine the position of a target in an extended area many difficulties arise. It is well
2
System Framework
Under the AmI paradigm, each ambient could be able to proactively supply different kind of services. An example of a SS is the one described in [8] which provides virtual guidance services in an open outdoor area, like a university 1
campus. Anyhow, the proposed architecture, is designed to be completely scalable in terms of number of sensors and area extension coverage and to be transportable to many others applications. The user is free to move inside the SS. The area is observed by a set of video-cameras, whose signal is directly sent to a system server. It is also devoted to acquire the radio measures of the user terminal regarding the radio environment. In the case here presented, radio signals are generated by a set of WLAN IEEE 802.11b Access Points (APs). This information is used to Localizeand-Track (LaT) the user, in order to provide him/her the virtual guidance, through an appropriate HMI, to the desired destination. The logical architecture of the LaT module of the server is shown in Figure 1. The system ac-
It is possible to notice that this ID is time-varying, because it is assigned by the system and it can vary depending on the environmental clutters. RAM output is instead composed by: R R Om (k) = [RSSR (2) m (k), idm ] where RSSR m (k) is a set of power measures on the radio WLAN signal (Received Signal Strength-RSS) gathered by the user terminal at the k-th instant and transmitted to the server, while idR m can be either the MAC (Medium Access Control) or the IP address. It is clear in the formalization, that the radio ID is not time-varying. In fact, each transmitted packet contains a univocal sender address. The following module, the Data Association stage, is responsible of assigning each measurements gathered by the VAM and the RAM to the correspondent target track to make the state estimation possible. This module is also able to detect if there are no video Meta-data for the m-th target. In this case the stage sets up the Hybenabler flag that drives the State Estimation/Particle Filter level. The proposed estimation system is based on the usage of multiple observation models: in fact, if the Hybenabler is active, the Radio observation model is used instead of the Video one. A more detailed description of all the functionalities of the filter, together with how the Model-Switching Function (MSF) works, will be provided in Section 3. The proposed architecture can be considered a general framework. In fact each stage hosts different problems that have to be faced in order to obtain a functioning system. Among all the open issues, some of the most remarkable can be considered: 1) how to manage a multi-target tracking system; 2) how to perform data association in presence of multiple interacting users [1]; 3) how to manage a multi camera system with non-overlapped fields of view. In particular, in the paper the attention is focused on the tracking of a single target and on what are the performances of the hybrid system, that exploits the joint usage of heterogeneous vectors of information, when video observations lack or are unreliable. In the following, the basic tasks that have to be performed in terms of processing and joint usage of video and radio data for tracking are presented.
Figure 1: Logical architecture of the LaT module quires a set of measurements M ES = {mesci } about the interacting object/observed scene, where c = {V, R} is related to the video or radio sensing, and i = {1, ...NV } or i = {1, ...NR } identifies the sensor that has gathered the measurement. The proposed architecture follows the multilevel data-fusion model described in [3] and already used by Marchesotti et al. [6] in a similar context. All the measurements have firstly to be temporally aligned, in order to maintain the temporal coherence of the fusion process. Given the heterogeneity of the used data, two modules, one dedicated to Video data and the other one to Radio data, that extract Meta-data, are present: the Video Analysis Module (VAM) and the Radio Analysis Module (RAM). Output of the former one is a synthetic representation of the interacting entity: V Om (k) = [pVm (k), idVm (k)] (1)
2.1
Video Localization and Tracking
The Video Analysis Module (VAM) firstly extracts, from the digital image sequence, the image plane target position (pVmi (k) = (xI , yI )), where i = {1, . . . , NV } indicates the camera, k the time instant and m refers to the object detected by the Vi camera. To do that, raw video data are digitalized using a frame grabber (if the device is analog), then objects of interest are identified by a motion detection algorithm. This process leads to have groups of pixels (blobs) that are bounded by a rectangular (bounding box) whose middle point of the horizontal bottom segment is considered ˜ Vi is coheras the object position pVmi (k). An identifier id m
where pVm (k) is the Meta-datum representing the position of the target onto a proper reference system at the discrete time instant k, and idVm (k) is VAM internal object identifier. 2
H(Di (Pg )). In order to provide an analytic version of pdf, defined as p(RSS i |Pg ) : (R) → (R), the histogram has been previously modelled through a multi-modal asymmetric Gaussian (MAG):
ently associated to each bounding box for successive time instants. In this way a target track in the image plane is obtained. However as the environment is guarded by more than one camera, a process to get a common representation for all the views is necessary. This task is obtained by means of the camera’s calibration (e.g. [12]) that establishes a correspondence between image plane coordinates (xI , yI ) and the map coordinates pVm (k) = (xM , yM ). If two or more cameras share the same field of view a spatial data association process has to be fulfilled to gain a unique map measurement for each object of interest. Moreover, an identifier (idVm ) for each target is assigned for the overall map representation. The tasks mentioned above are common in many visual tracking systems. In the presented framework, the video measurement pVm (k) is associated to the target whose prediction is the closest, in order to obtain an estimate of the position (x, y) (see Sect. 3.2.1) by means of the Particle Filter updating step. When no pVm (k) can be related to a track, the updating can not be fulfilled rendering the State Estimation less reliable.
2.2
i
p(RSS |Pg ) '
J X
AGj (RSS i |Pg )
(4)
j=1
where J is the number of modes and AGj (RSS i |Pg ) is a mono-modal asymmetric Gaussian [11]. In Fig. 2 an example of the Bhattacharyya pdf extraction used to compute the theoretical pdf is shown.
Radio Map Construction Figure 2: Qualitative evaluation of the pdf approximation of Eq. 4
A wireless channel is one of the most unstable medium used to transmit information. The time-variability is due to the electro-magnetic absorption/reflection phenomena by which the transmitted signal is affected during its path from the transmitter to the receiver. If the environment changes (it is sufficient that a person moves inside the area) then also the channel characteristics will vary. It is hence possible to consider all the characteristics of the radio signal as statistical variables. Different features can be extracted from the radio signal in order to compute user location [10]. In the present work the Received Signal Strength (RSS), i.e. the instantaneous power of the radio signal, has been chosen. Let’s now consider a discrete grid of points P = {Pg } with g = 1, ...G number of points, where the radio signal is analyzed in order to extract its statistics. In fact in each point, and for each AP active in the considered environment, a set of measurements is acquired:
For each point Pg ∈ P a similar process is repeated. In order to obtain a continuous pdf all over the considered environment, an interpolation is performed: interp
{p(RSS i |Pg )} → p(RSS i , xM , yM )
(5)
where g = 1, ..., G and (xM , yM ) are the map coordinates, already introduced in Sect. 2.1. The interpolation coefficients, that define p(RSS i , xM , yM ), are called Radio Map. In this paper a Radio Map, obtained by measures acquired in a real environment, will be considered. The Radio Map will be extensively used both in defining the target model for the Hybrid Particle Filter Sect. 3.2.2 and in simulating the radio data (see Sect. 4.1).
Di (Pg ) = {RSS i (Pg , k), ..., RSS i (Pg , k + Tacq )} (3)
3
where i represents the considered AP, k is the initial time instant and Tacq is the acquisition time window. It is hence possible to define the Histogram function as H : Z → N. Then H(Di (Pg )) represents the histogram of the set of measures acquired in Pg . If the time window is sufficiently wide to acquire a significant number of RSS samples, it is possible to approximate the ensemble statistics with the temporal statistics. Let’s therefore assume that the probability density function (pdf) of the RSS is proportional to a normalized version of the histogram: p(RSS i |Pg ) ∝
3.1
Hybrid Particle Filter Particle Filter Algorithm
Particle Filter algorithm [9] is a numerical approach to recursive Bayesian state estimation (Bayesian filtering) where the posterior density function is estimated as a finite set s of weighted samples χk = {xik , wki }N i=1 computed by the importance sampling approach. Bayesian filtering operates basically in two steps: prediction and update. The system transition model p(xk |xk−1 ) and the set of available observations z 1:k−1 = {z 1 , . . . , z k−1 } provide the posterior 3
prediction as:
In this work, the Received Signal Strength (RSS) gathered by a mobile device connected to an AmI system is employed jointly with visual data to track users. However, radio localization and tracking suffers for not satisfactory precision performances for the application domain here considered. In fact, radio localization systems based on 802.11x provide a position estimation with an error of around 4 meters [2, 4]. It has been decided then to exploit, when available, the visual tracking, more accurate, and to use radio information to overcome its limitations in handling occlusions lasting for long time and video coverage. The contemporary usage of both data in situations not affected by the above-stated issues is not fulfilled since the noise on RSSs is too high to guarantee an effective data fusion. However, when visual observations are not available or reliable the radio localization provides useful information both on position and, more important, on the identity of the target that is automatically given by the initial target authentication in the WLAN network. This possibility implies that the motion of a target can be estimated continuously regardless long time occlusions or passages in blind areas for cameras. The proposed method uses a SIR Particle Filter that operates alternatively on video and radio data according to the availability of video observation of the target. When the object is not visible, an observation model for RSSs of the target is used to update the prediction of the object position obtained by means of the transition model. Moreover the particles computed in the last iteration of the filter with visual measurements, that memorize the target motion as far as that moment, are used to draw the candidates to be updated.
Z p(xk |z 1:k−1 ) = p(xk |xk−1 )p(xk−1 |z 1:k−1 )dxk−1 (6) The new observations z k at time k and the observation model provide the likelihood probability p(z k |xk ), used to correct the prediction by means of the update process, that is: p(z k |xk )p(xk |z 1:k−1 ) (7) p(xk |z 1:k ) = p(z k |z 1:k−1 ) The Particle Filter provides an approximated solution to Bayesian filtering, also in the case of non-linear and nonGaussian transition and observation models. The set of Ns s candidate samples (i.e. particles) {˜ xik }N i=1 representing the prediction are drawn from the so called proposal distribution (or importance distribution) qk = (xk |x1:k−1 , z 1:k ). In many applications the proposal distribution can be reasonably obtained by the transition model so that particles are drawn from p(xk |xik−1 ). The values of the associated weights are obtained by means of the equation: wki =
p(z k |xik )p(xik |xik−1 ) i w q(xik |xi0:k−1 , z 0:k ) k−1
(8)
When the proposal distribution is given by the transition model, the weight computation is simplified to be wki = i p(z k |xik )wk−1 . However, this choice, whereas computationally efficient, generates a degeneration of performances, under the form of the propagation of several particles with very low weight. To overcome this issue a resampling procedure is required to eliminate the particles with low weight and to replicate the probable ones. This realization of the Particle Filter algorithm is called Sequential Importance Resampling (SIR).
3.2.1
3.2
Integration of Video and Radio Information
Target Models for Video Data
The state of the Particle Filter is given by the kinematic characteristics of the target, i.e. x = [x, y, vx , vy ] where x and y are the coordinates of target location in the map and vx and vy give the velocity. In the SIR Particle Filter the predicted samples are drawn from the transition probability. The dynamic model of the targets motion is second order autoregressive as movement of the target is supposed to be fairly regular:
The Particle Filter algorithm has been demonstrated to be particularly flexible and successful in many applications. In recent years a wide body of research showed the capability of efficiently handling the problem of fusing information coming from heterogeneous types of data. In [13], Particle Filter was used to integrate target cues regarding shape, colour and position. In [7] Particle Filter algorithm is employed to fuse motion and colour information acquired with one camera and sound detected by two microphones. It is shown that the joint processing of the two sources of signals allows to improve tracking performances in terms of recovering of lock, due to rapid movement of the head of a talker, and to switch the tracking procedure between two subjects speaking alternatively, as it frequently happens in video conferencing applications.
1 0 T 0 1 0 xk = 0 0 1 0 0 0
2 0 T /2 0 T T 2 /2 xk−1 + 0 ω (9) 0 T 0 k−1 1 0 T
where T is the time interval and ω k−1 is a Gaussian zeromean two dimensional vector with diagonal covariance matrix Σω . The observation model of visual cues is assumed 4
to be linear, affected by Gaussian noise, i.e: x 1 0 0 0 ν zk = x + ky 0 1 0 0 k νk
3.2.3
A signal Hybenabler coming from the Data Association module is sent when visual measurement are not available or sufficiently reliable to use the update employing the likelihood for radio measurements (Eq. 12). The filter for radio localization is initialized using the p(xk |z k ) estimated by the Particle Filter at k = klastV ideoObs with the last video measurement. Then the update is obtained with the likelihood of Eq. 12. When a video observation z k is again available a new Particle Filter is instantiated from an initial Gaussian probability centred in z k . Worth of noticing is that in this work the association of each observation with the correspondent track has not been taken into account as this problem has been already addressed by the authors in [1].
(10)
where the measurement vector is z k = [x, y], the first matrix is the observation matrix H, and the random noise (.) 2 ν k = [νkx νky ] is Gaussian νk = N (0, σ(.) ). The likelihood is then obtained as: ! z k − Hxik|k−1 i (11) p(z k |xk ) = exp − σx2 · σy2 3.2.2
Model Switching Function
Target Models for Radio Data
When no video observations are available or when the occlusion is not handled reliably by the video tracker, radio signal information is used in the Particle Filter to help the localization. The transition model has the same form as the one used in the video Particle Filter (Eq. 9) in order to represent the target dynamic. The observation model has to associate the RSSs measurements to the position of the user in the environment. To do that the radio map, obtained as described in Sect. 2.2, is exploited. The measurement vector z k is composed by the RSSs of each Access Point; herein the definition is limited to three RSSs to be compliant with the experiments fulfilled, then z k = [RSS AP 1 , RSS AP 2 , RSS AP 3 ]. Worth noticing is that the method is general in terms of the number of Access Points used to locate a user. The likelihood probability is computed as follows: 1) a predicted target position is given by Hxik|k−1 = (x, y); 2) from the radio map the pdfs of the RSSs of each h-th Access Point in the predicted position of the user are extracted; 3) the likelihood p(RSS AP h |xik ) of the observed RSS AP h concerning the h-th Access Point is given by its probability in the correspondent pdf coming from the Radio Map; 4) considering that each Access Point transmits the signal independently with respect to the others, the likelihood for the i-th particle is then calculated as: Y p(z k |xik ) = p(RSS AP h |xik ) (12)
4 4.1
Simulations and Results Simulated Results
In order to prove the effectiveness of the proposed method, a simulation framework has been built up. A trajectory generator provides a sequence of spatially-coherent points in a pseudo-random way. This trajectory is used to generate both a video track (a noisy version of ground truth) and a radio vector signal (basing on the Radio Map). The aim of these simulations is to test the robustness of the Hybrid Particle Filter in the case of static occlusions. In these experiments, since this is not the focus of the paper, the problem of data association between tracks and observations has not been considered, that is, just one trajectory per time is processed. A large number of trajectories has been generated in order to provide a statistical robustness in testing the proposed method; this is the reason why a simulative framework has been employed. Different occlusion times, from 5 to 13 seconds, has been taken into account: 300 tracks for each period of absence of video measures have been simulated. Among all the possible representations for quantitative results, the Root Mean Square Error-based (RMSE) between the estimated track and the Ground Truth is here considered. In the described experiments, Particle Filters use 1000 particles though even with few hundreds of particles the performances do not fall down. Table 1 shows the obtained results: it is possible to infer that, augmenting the occlusion time, accuracy has a generally decreasing behavior.
h
A pictorial representation explaining the procedure used to compute the likelihood is presented in Fig. 3.
4.2
Real World Results
The method has also been tested on a real world sequence. To fulfil this experiment, a person, handling a tablet PC connected to a WLAN network composed by three Access
Figure 3: Computation of the likelihood with radio measurements
5
posed method. Given the generality level of the considered architecture, on-going researches are hence focused on the application of the Hybrid Particle Filter in multi-camera system, where the joint information position-ID is useful where non-overlapped fields of view are present.
Table 1: RMSE and Percentage of Accuracy obtained for different occlusion times RMSE [m] 0.25 0.5 0.75 1.5 2 4
Percentage of accuracy [%] 5 sec. 7 sec. 9 sec. 11 sec. 0.7 0 0 0 8.2 1.4 0 0 20.6 3.2 0 0 61.7 27.3 13.8 5.3 72,3 46.5 31.6 16 81.6 95 81.5 68.4
References [1] A. F. Cattoni, A. Dore, and C. S. Regazzoni. Video-radio fusion approach for target tracking in smart spaces. In International Conf. on Information Fusion, Qu´ebec City, Qu´ebec, Canada, July 2007. [2] E. Elnahraway, X. Li, and R. P. Martin. The limits of localization using RSS. In Proc. of the 2nd Int. Conf. on Embedded Networked sensor systems, SenSys ’04, pages 283–284, 2004. [3] D. L. Hall and J. L. Llinas. Multisensor data fusion. In D. L. Hall and J. L. Llinas, editors, Handbook of Multisensor Data Fusion, chapter 1, pages 1–10. CRC Press, 2001. [4] J. Hightower and G. Borriello. Location systems for ubiquitous computing. IEEE Computer, 34(8):57–66, 2001. [5] D. Makris, T. Ellis, and J. Black. Bridging the gaps between cameras. In Proc. of IEEE Conf. on CVPR, pages 205–210, 2004. [6] L. Marchesotti, R. Singh, and C. S. Regazzoni. Extraction of aligned video and radio information for identity and location estimation in surveillance systems. In Int. Conf. of Information Fusion, Stockholm, Sweden, 2004. [7] P. P´erez, J. Vermaak, and A. Blake. Data fusion for visual tracking with particles. Proceedings of the IEEE, 92(3):495– 513, March 2004. [8] S. Piva, C. Bonamico, C. S. Regazzoni, and F. Lavagetto. A flexible architecture for ambient intelligence systems supporting adaptive multimodal interaction with users. In G. Riva, F. Davide, F. Vatalaro, and M. Alca˜ niz, editors, Ambient Intelligence: The Evolution of Technology, Communication and Cognition Towards the Future of HumanComputer Interaction, pages 97–120. IOS Press, 2005. [9] B. Ristic, S. Arulapalam, and N. Gordon. Beyond the Kalman Filter. Artech House Publishers, 2004. [10] G. Sun, J. Chen, W. Guo, and K. J. R. Liu. Signal processing techniques in network-aided positioning: a survey of stateof-the-art positioning designs. IEEE Signal Processing Magazine, 22(4):12–23, July 2005. [11] A. Tesei and C. S. Regazzoni. The asymmetric generalized gaussian function: A new hos-based model for generic noise pdfs. In Proc. of the 8th IEEE Sig. Proc. Workshop on SSAP, Washington, DC, USA, 1996. [12] R. Tsai. A versatile camera calibration technique for highaccuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE Journal of Robotics and Automation, 3(4):323–344, August 1987. [13] Y. Wu and T. S. Huang. A co-inference approach to robust visual tracking. In Proc. of ICCV, pages 26–33, Vancouver, Canada, 2001.
Points, moves in a parking lot monitored by a static camera. The sequence lasts 40 seconds and three occlusions of respectively 5, 2 and 2 seconds occur. Visual tracking is realized by a simple tracker performing a background differencing and a frame by frame blob association technique; world coordinates video measurements are provided through camera calibration. Radio measurements are continuously acquired at each second, temporarily aligned to video measurement by means of a common system clock. The RMSE over the whole sequence is 2.05 meters whereas during the occlusions it is equal to 2.43. In Fig. 4 the track and the ground truth on map coordinates are shown. It can be noticed that the proposed approach allows to maintain continuous tracking despite occlusions by using radio data when no video measurements lacks.
Figure 4: Real world track with 3 occlusions
5
Conclusion and Future Works
In the current paper, after a brief introduction to AmI systems, the problem of localization and tracking of an interacting user has been defined. In order to solve the problem a possible general architecture has been presented. It exploits multiple and heterogeneous measures to Localize and Track (LaT) the target. The joint usage of video and radio data, in the considered example, allows a robust and stable relationship between user position and identity. The Hybrid Particle Filter, which is the focus of the paper, is the core of the LaT module, and it has been extensively tested in a completely simulated environment and on real world data. Obtained results prove the effectiveness of the pro6