Machine Vision and Applications (2011) 22:505–520 DOI 10.1007/s00138-010-0264-1
ORIGINAL PAPER
Efficient detection and tracking of moving objects in geo-coordinates Yuping Lin · Qian Yu · Gérard Medioni
Received: 14 December 2008 / Revised: 15 March 2010 / Accepted: 25 March 2010 / Published online: 20 April 2010 © Springer-Verlag 2010
Abstract We present a system to detect and track moving objects from an airborne platform. Given a global map, such as a satellite image, our approach can locate and track the targets in geo-coordinates, namely longitude and latitude obtained from geo-registration. A motion model in geo-coordinates is more physically meaningful than the one in image coordinates. We propose to use a two-step geo-registration approach to stitch images acquired by satellite and UAV cameras. Mutual information is used to find correspondences between these two very different modalities. After motion segmentation and geo-registration, tracking is performed in a hierarchical manner: at the temporally local level, moving image blobs extracted by motion segmentation are associated into tracklets; at the global level, tracklets are linked by their appearance and spatio-temporal consistency on the global map. To achieve efficient time performance, graphics processing unit techniques are applied in the geo-registration and motion detection modules, which are the bottleneck of the whole system. Experiments show that our method can efficiently deal with long term occlusion and segmented tracks even when targets fall out the field of view.
Electronic supplementary material The online version of this article (doi:10.1007/s00138-010-0264-1) contains supplementary material, which is available to authorized users. Y. Lin (B) · Q. Yu · G. Medioni Institute for Robotics and Intelligent Systems, University of Southern California, 3737 Watt Way, PHE 101, Los Angeles, CA 90089, USA e-mail:
[email protected] Q. Yu e-mail:
[email protected] G. Medioni e-mail:
[email protected]
Keywords Airborne video surveillance · Geo-registration · Moving object detection and tracking · Real-time GPU processing
1 Introduction One of the goals in video surveillance is to identify and track all the relevant moving objects in the scene, and to generate exactly one track per object. This may involve detecting the moving objects, tracking them while they are visible, and re-acquiring the objects once they emerge from an occlusion to maintain identity. This is a very difficult problem, even more so when the sensor is moving, as in aerial surveillance scenarios. To track from a moving camera, we need to describe the motion of moving objects in common coordinates. The mosaic space, which is derived from the image coordinates of one frame, is usually selected as the tracking coordinates. Without further refinement, accumulated errors are inevitable if fixed coordinates are selected. More importantly, an object motion in image coordinates is not physically meaningful. Here, we propose to use a global map (a satellite image) as the reference frame. By registering images with a satellite image, we can generate the absolute geo-location of targets. Also, tracking in geo-coordinates provides physically meaningful measurement of object motion. In airborne surveillance applications, long term occlusion is common. We introduce a hierarchical procedure to track moving objects with occlusion. In temporally local association, detected regions within a sliding window are associated to generate tracklets. In a large temporal span (called global data association), the tracklets are associated to form longer tracks and maintain track IDs. Local association is critical for successful tracking since errors in local association cannot be
123
506
Y. Lin et al. UAV image Sequence
Map
Moving object detection
Geo registration Geo-location of targets
Target Tracking in Geo-coordinate
Local data association
tracklets Tracklets association tracks
Global data association
Fig. 1 Overview of the framework
rectified in the global one. We formulate the local association as finding the best partition of observations. In global association, given the knowledge of the maximum speed and acceleration of targets in the geo-coordinates, we define the compatibility of tracklets and this reduces ambiguity in tracklet association. In addition, we adopt a rotation invariant appearance descriptor [13] to represent both color and shape distribution of targets in each tracklet. To achieve an efficient time performance, we use a graphics processing unit (GPU) to implement the bottlenecks of the whole system, which include geo-registration and background modeling. The flowchart of our framework is shown in Fig. 1. The paper is organized as follows. We introduce the related work in geo-registration and aerial surveillance in Sect. 2. We present the two-step Geo-registration approach and its GPUimplementation in Sect. 3. The method to detect moving regions from a moving camera and its GPU implementation are presented in Sect. 4. A hierarchical tracking approach at the temporally local and global levels is presented in Sect. 5, followed by experimental results for real data sets, and the overall time performance of the whole system in Sect. 6. 2 Related work Numerous algorithms and approaches have been proposed to address different aspects of airborne video analysis over decades. Irani et al. [9] proposed to convert a video stream into a series of mosaics that provide a complete coverage of the whole video. The mosaics can be further applied to facilitate compression, storage and indexing of the original video. Beyond 2D image registration, Kumar et al. [7] proposed a coarse to fine approach for geo-registration with depth information using digital elevation model (DEM). The coarse initialization is implemented with local appearance matching using normalized correlation. The fine geo-registration is acquired by estimating the projection matrix of
123
camera given DEM. Throughout this paper, we assume the scene is planar and only 2D geo-registration is considered. After compensating for the camera principal motion, motion detection and tracking techniques can be applied for airborne videos. For example, the background subtraction method in [11] was applied in [1], which a UAV tracking system implemented in Matlab. Many specific approaches are proposed to deal with various aspects in airborne video analysis. In [32], a global parametric method is proposed to compensate illumination changes, especially useful for thermal UAV videos. In [34], the authors proposed to use motion history image to detect and segment motion that is more robust than segmenting objects in one frame. Kang et al. [12] proposed to use a sliding window method to establish dynamic background model. In this paper, we adopt Kang’s method for background modeling and use motion history image to segment objects. After motion segmentation, moving objects are represented as primitive rectangles, i.e., bounding boxes. UAV tracking approaches usually select the mosaic space as tracking coordinates, such as in [1,14], where the first frame is used as the common coordinates in tracking. Instead, we use progressively updated geo-coordinates as the tracking coordinates. Most of the existing approaches address one particular issue in airborne video analysis. Not much attention has been paid on developing a complete system that can automatically and efficiently detect and track multiple moving objects in geo-coordinates. In this paper, we aim to design a system that streamlines robust and efficient modules to achieve a complete capability to detect and track moving objects in geo-coordinates. Recently, general purpose GPU (GPGPU) framework has been successfully applied in many computer vision problems to achieve high performance computation, for example, real-time stereo [25,33], feature extraction and matching [27], foreground segmentation [6]. Some open source GPU-based libraries, such as OpenVidia [10] and MiniGPU [2], have come into the computer vision community. The video Orbits algorithm [20] has been implemented on GPU and parallel GPU in OpenVidia with real-time performance. The processes of feature extraction, feature matching and parametric transformation estimation by RANSAC, which are involved in motion compensation, have GPU-based implementations as well, such as in [10,27]. Due to the computation cost in airborne video analysis, robust but complex methods suffer from slow time performance. Thus, we use a GPU to achieve efficient implementations of the bottlenecks in the whole system.
3 Two-step geo-registration using mutual information In our UAV environment, we assume the scene to be locally a planar surface. This is a reasonable assumption while the structure on the ground plane is relatively small compared
Efficient detection and tracking of moving objects in geo-coordinates
with the camera height to the ground plane, which is the map image that we align to. Under this assumption, we use a homography to represent both the transformation between two consecutive UAV images (Eq. 1), and the transformation between an UAV image and the map (Eq. 2) x j = Hi, j x i x
M
= Hi,M x
507
UAV image sequence
Reference image (map)
(1) i
xi
(2)
1) Register
to
and derive
xM
represent the same point in the UAV frame where and j Ii and the map M, respectively. Similarly, we use Ii and IiM to represent the images by warping Ii to the coordinates of the UAV frame I j and the map M, respectively, i.e., j
Ii = Hi, j Ii
(3)
IiM = Hi,M Ii
(4)
3) Register to
Manually labeled correspondences in and
2) Produce a local mosaic by warping previous images to
directly and derive
The geo-registration task then aims at finding the best Hi,M that aligns Ii and M together. This is challenging since the map and the UAV images are taken at different times, from different sensors and view points, and may have different dynamic contents, such as vehicles and shadows. As a result, it is difficult to directly match each incoming image to the map. Instead, we propose to use a two-step procedure as illustrated in Fig. 2. In the first step, we register consecutive frames to estimate Hi,i−1 . Given the frame Ii−1 has already been registered to the map, Hi,M can be estimated as: Hi,M ≈ Hi−1,M × Hi,i−1 .
(5)
In the second step, with the estimated Hi,M , we first produce a local mosaic in the coordinate frame of Ii . Then we register it directly to M, and derive H so that: Hi,M = H × Hi−1,M × Hi,i−1 .
(6)
The use of a local mosaic enhances the robustness, which will be discussed later. In the following sections, we first describe the method we use to register Ii to its previous image Ii−1 . Then we introduce our method to register the local mosaic that derives H . 3.1 Registration of consecutive UAV images To compute Hi,i−1 , we extract feature points from both Ii and Ii−1 and match them to form correspondences. After trying many kinds of features, we selected Scale Invariant Feature Transform (SIFT) [19] features. SIFT features are invariant to image scale and rotation, and provide robust descriptions across changes in 3D viewpoint. In the feature matching step, we use nearest neighbor matching [4]. Then we use RANSAC [5] to filter outliers among the set of correspondences and derive Hi,i−1 . With Hi,i−1 and the given H0,M , we can roughly register the UAV image to the map by estimating Hi,M as
INPUT
OUTPUT
Fig. 2 Overview of the geo-registration framework
Hi,M = Hi−1,M × Hi,i−1 = H0,M ×
i
Hk,k−1 .
(7)
k=1
This shows that if there exists a subtle transformation error in each Hk,k−1 , these errors accumulate and result in a significant error. Thus, a direct registration by establishing correspondences between the UAV image and the map and refining the homography is necessary. 3.2 UAV to map registration Registering an aerial image to a map is a challenging problem [8,21]. Due to significant differences in lighting conditions, resolution, and 3D viewpoints between the UAV image and the map, the same point may yield quite different SIFT descriptors, respectively. Therefore, poor feature matching and poor registration can be expected. Since it is difficult to register an UAV image to the map directly, we make use of Hi,i−1 derived from UAV to UAV registration and approximate Hi,M (Eq. 5). Our goal is to derive a tuning homography H that compensates for the error introduced in the composition (Eq. 6), so that the image is accurately aligned to the map. The advantage of this approach is that with the approximated Hi,M , Ii is roughly aligned to the map. We can then
123
508
Y. Lin et al.
Fig. 3 The correspondences between an UAV image and a map. The red points are correspondences whose spatial distances are over a threshold, which are excluded from the input for RANSAC. The green points and blue points are the RANSAC inliers and outliers, respectively
perform a local search for correspondences under the same scale and orientation. Therefore, the ambiguity of matching and the computation time is far less than directly registering Ii to the map.
mutual information becomes a critical issue. We have ported the computation to a GPU platform and improved the speedup to a factor of 400, as described later in Sect. 3.3. 3.2.2 Extracting reasonable correspondences
3.2.1 Finding correspondences between UAV image and map To derive H , we have to find correspondences between I˜iM and the map M, where I˜iM is the image by warping Ii using the approximated homography, i.e., I˜iM = (Hi−1,M × Hi,i−1 )Ii
(8)
We use a patch-based method to establish correspondences. A point p in M is matched to the point q in I˜iM that has the highest similarity measure, defined using mutual information [31]. Namely, (9) q = arg max mi w x, I˜iM , w( p, M) , x
where mi(·) is the function that computes the mutual information of two image patches, and w(x, I ) is an image patch in I centered at point x. The search can perform efficiently by looking at patches in the local area since I˜iM and M are roughly aligned. The use of mutual information as the similarity measure can be justified since the UAV images and the reference map are captured at different times and in different views. The illumination, and the dynamic content could be very different. Mutual information of two random variables measures the dependence of the two variables. Taking two images as the random variables, it measures how much information two images share, or how much an image depends on the other. As it requires no a priori model of the relationship between scene intensities in different views, such measurement is more robust in the conditions we have compared to measures such as cross-correlation or sum of square differences. However, mutual information is computational demanding, as it needs to compute the entropy and the joint entropy of two variables, which involves estimating the marginal probabilities and joint probabilities of each sample pixel. Our goal is to provide a stabilized image sequence for real time moving object detection. Therefore, the speed-up of computing
123
We can generate as many correspondences as we want by matching different pixels in the map. In our experiments, the correspondences generated could be classified into three categories, which are the outliers, the correspondences not on the referenced planar surface, and the correspondences on the reference plane. For image patches in homogeneous areas, such as those contain only road or field, the similarity measurement is meaningless and tends to result in outliers. Even if the correspondences are correct, they may come from structures that do not belong to the reference plane, such as roofs, which should be discarded when deriving the homography. Therefore, we need to filter the first two types of correspondences, and use only the correct ones on the ground plane in deriving H . To achieve this, we rely on the approximated homography. As mentioned in Eq. 9, with a good approximation of Hi,M , a point p in M should have its correspondence q in I˜iM closely. Hence we only take the correspondences that have their spacial distance under a threshold in deriving H . As shown in Fig. 3, the red points illustrate the correspondences that are not close enough to each other. They are most likely to appear on structures that cause parallax, or homogeneous regions. 3.2.3 Local mosaic In addition to using the filtered correspondences as described in the previous sections, we can include the inlier correspondences computed in the previous frames for estimating H . Namely, we are registering a local mosaic, defined as {IiM } (we use 10 for the set size), to the map. There are several advantages in registering a local mosaic instead of registering only a single frame to the map. First, it provides a large number of correspondences as input to RANSAC, so the derived H is more robust. Second, since we use correspondences in multiple frames to derive H , the correspondences in Ii only have marginal effect, which implies Hi,M will not change abruptly. Most importantly, we
Efficient detection and tracking of moving objects in geo-coordinates
509
Fig. 4 The left figure is the registration result of registering only a single frame. The dashed lines highlight places of discontinuities. The right figure is the result of registering a local mosaic in each iteration, which has a smooth transition
can consider the correspondences in the previous frames as a prior knowledge to deriving H , so even if the correct correspondences in Ii are not dominant after extracting good correspondences as described in the previous section, they will still stand out after performing RANSAC. As shown in Fig. 4, registering only a single frame results in significant discontinuities, while registering a local mosaic results in a smooth transition. 3.3 Mutual information computation with GPU acceleration As described earlier, the computation of mutual information is computational demanding. We focus on the speed-up of Viola’s approach [31] to computing mutual information. In his approach, the number of exponential computations is n 2 , where n is the number of samples required to estimate the entropy of a variable. The simplest way to reduce the number of exponential computations is by precomputing the Gaussian densities of all possible intensity values into a lookup table. However, if the intensity range is large, the precomputation becomes an overhead. Meihe [22] presents an approach that maps all the intensities into a smaller dynamic range, and computes/stores the Gaussian densities for them. Nevertheless, such efforts produce limited speed-up, which can still be impractical for many applications. In addition, extra interpolation is required if the intensity values are in floating point precision. We find that Viola’s method is parallelizable, and achieve a significant speed-up using a GPU. Specifically, we use CUDA [23], a new hardware and software architecture for computing on a GPU. We describe Viola’s approach to approximating mutual information, followed by the implementation detail in the next section.
From information theory, the entropy of a random variable is defined as: p(v) ln p(v)dv,
The mutual information of two random variables u, v is then defined as mi(u, v) = h(u) + h(v) − h(u, v).
(10)
(12)
In [31], the probability density p(v) is estimated using Parzen Window method: p(v) ≈
1 G ψ (v − v j ), NA
(13)
v j ∈A
where N A is the number of samples in A, and G ψ denotes a Gaussian function with variance ψ. Parzen Window is a widely adopted technique to estimate the probability density since it directly uses the samples drawn from an unknown distribution, and uses Gaussian mixture model to estimate its density, which is robust to noise. Equation 10 can be expressed as a negative expectation of ln p(v), i.e., h(v) = −E v (ln p(v)). Together with Eq. 13, the entropy of a random variable v is approximated as: h(v) ≈
1 −1 ln G ψ (vi − v j ), NB NA vi ∈B
(14)
v j ∈A
where B is another sample set. The joint entropy can approximated in the same manner by drawing pairs of corresponding samples from the two variables. Hence the mutual information is approximated as: ⎧ 1 −1 ⎨ ln G ψu (u i − u j ) mi(u, v) ≈ NB ⎩ NA u i ∈B
3.3.1 Mutual information approximation
h(v) = −
and the joint entropy of two random variable is defined as: h(u, v) = − p(u, v) ln p(u, v)dudv. (11)
+
vi ∈B
−
u j ∈A
1 ln G ψv (vi − v j ) NA
wi ∈B
v j ∈A
⎫ ⎬ 1 ln G ψuv (wi − w j ) ⎭ NA
(15)
w j ∈A
123
510
Y. Lin et al.
in which w = [u, v]T . There exist other approximation formulas for computing mutual information [16,29], which we have not considered here. Algorithm 1 implements Eq. 15 to compute the mutual information of two images. Iu and Iv denote the two images, while I (x) is the intensity value at x in image I.Bu , Bv , Buv and Au , Av , Auv are the values of the outer summation and inner summation in Eq. 14, respectively. Note that we assume the two images to be independent and approximate the joint probability as the product of the two marginal probabilities (line 9). We also assume the two images have the same variance ψ, i.e., ψu = ψv = ψ. From Algorithm 1, it is not hard to establish that the statistics computed for each element i ∈ B are independent from the others, which can be computed efficiently on a GPU.
Fig. 5 Mutual information computation time (1000 iterations) with respect to different patch size
3.3.2 Experimental results The experiments are performed on a workstation with Xeon Quad Core 2.33 GHz with 4 GB DDR2 (667 MHz). The graphic card we use with CUDA support is the GeForce 8800GTX.1 We run a thousand iterations of both implementations and compare the average time with respect to different sizes of an image patch. For each pixel in the image patch, we use 225 pixels around it for its entropy approximation, namely |A| = 225. The result is shown in Fig. 5. As the size of the image patch increases, the growth of computation time for the GPU is relatively constant compared with that of the CPU. As the patch size equals 25, the size which we use to extract correspondences, the computation on GPU is faster than CPU by a factor of 40.
If multiple mutual information of patches in the same pair of images are to be computed, the GPU computation time can be further reduced since the image data only needs to be transferred once. This results in a huge speed boost in registering an UAV image to the map when hundreds of correspondences are to be computed. In our implementation, we use 384 (24 horizontally and 16 vertically) point correspondences for estimating the homography from an UAV image to the map. For each point in an UAV image, we search in a 15 × 15 neighborhood in the map image for the best match as described in Sect. 3.2.1. This amounts to 86,400 times of mutual information computation, which only takes 30 s for a GPU whereas CPU needs almost 3 h. 4 Efficient motion segmentation on a moving platform
input : Image Iu and Iv . Sample sets A and B output: Mutual Information of Iu and Iv 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Initialize Bu , Bv , Buv to 0; foreach i in B do Initialize Au , Av , Auv to 0; foreach j in A do G u ← G ψ (Iu (i) − Iu ( j)); G v ← G ψ (Iv (i) − Iv ( j)); Au ← Au + G u ; Av ← Av + G v ; Auv ← Auv + G u × G v ; end Bu ← Bu + ln(Au /N A ); Bv ← Bv + ln(Av /N A ); Buv ← Buv + ln(Auv /N A ); end return −(Bu + Bv − Buv )/N B ;
Algorithm 1: Mutual information
1
128 streaming process, 575 MHz, 768MB DDR3 1.8 GHZ and 16 KB shared memory per block.
123
The goal of motion segmentation is to differentiate independent moving regions from background. Motion detection from a stationary camera has been explored extensively and is regarded as a computationally efficient solution for many applications. However, it is still computationally demanding to achieve the same goal from a moving platform. 4.1 Approach For a moving camera, such as airborne cameras, the camera motion needs to be compensated for prior to estimating of the background model. We assume the camera motion can be approximately compensated by a 2D parametric transformation. As presented in Sect. 3, a 3 × 3 homography is used to represent this 2D parametric transformation. This assumption is exact for Pan-Tilt-Zoom (PTZ) cameras and is a good approximation where the scene is far from the optic center and the depth is small compared to the distance, which is the case in airborne image sequences. Given the
Efficient detection and tracking of moving objects in geo-coordinates
511
Fig. 6 Background modeling using a sliding window
(a)
(b)
(e)
frame
frame
transformation between any two frames, a sequence of frames can be warped to a reference frame. In the warped frames, the camera motion is stabilized relative to the reference frame. Then the background model are computed by collecting good statistics among the appearance of each 2D location. We use the same technique as described in Sect. 3.1 to estimate the homography Hi−1,i between two consecutive frames and register any two frames by concatenating the estimated pairwise transformations as: ⎧ j−1 ⎨ k=i Hk,k+1 i < j (16) Hi, j = I i= j ⎩ i> j (H j,i )−1 To avoid accumulated errors, we adopt Kang et al.’s [12] sliding window-based method, where a number of frames are warped to a reference frame. The center frame of the sliding window is used as the reference frame. This reduces the accumulated errors by half in terms of the length of the sliding window. Also, even a wrong registration result will not influence the motion detection in the entire sequence, but only within the sliding window. After motion compensation, the decision is made according to the statistics on the corresponding pixels. Suppose the size of the sliding window is w, the background model Ibg for the reference frame Ir is
(c)
(f)
frame
frame
(d)
frame (reference)
(g)
frame
Ibg = mode(Hr −w/2,r Ir , . . . , Ir , . . . , Hr +w/2,r Ir )
(17)
Figure 6a illustrates the correspondence of pixel locations prior to the characterization of the background model. Figure 6b–g shows warped images and the reference image. From Fig. 6b–g, as we can see, the 2D image motion induced by the moving camera is compensated. However, the sliding window method is quite computationally expensive, as the reference frame changes when the sliding window moves and all the warped images have to be recomputed. This involves many float-precision interpolation operations, which is a heavy burden for CPU, such as a 91-frame sliding window. 4.2 GPU-based implementation In the GPGPU framework, the fully programmable vertex and fragment processors provides powerful computational tools for general purpose computations. Vertex shaders allow us to manipulate the data that describes a vertex. Fragment shaders serve to manipulate a pixel. To implement an algorithm on the GPU, different computational steps are often mapped to different vertex and fragment shaders. We present a GPU implementation of the sliding window-based method
123
512
Y. Lin et al.
input: (I1 , . . . , Iw ) , (H1,r , . . . , Hw,r ) output: Difference image Idi f f for Each frame Ir in the sliding window do 1. Generate texture coordinates according to Hi,r 2. Build a histogram for each pixel end for Compute the mode by reduction Compute the average of samples in the mode bin as the background intensity. Fig. 7 Overview of the GPU-based implementation
and separate the process into two steps, warping images in the vertex profile and computing the background model in the fragment profile. Also, we minimize memory transfer between GPU and CPU: the inputs for our implementation include a sequentially input image frame and the corresponding homography transformation, and the output is the background image for each frame. The overview structure of the implementation is shown in Fig. 7. Our implementation stores sequential frames in the sliding window as 2D textures in the GPU memory. The most recent frame is loaded and the oldest frame is moved out of the texture pool. The warping involves changing the texture coordinates and is implemented in the vertex profile. The vertex profile takes the 3 × 3 homograph matrix as input and outputs the warped texture coordinates by a simple matrix multiplication. The transformed texture coordinates are used in the fragment profile. To compute the background model, different statistic functions can be applied in Eq. 17, such as the mode, median and the mean of the samples in the sliding window. For motion detection from a moving platform, the mode is usually preferred to the other two. This is due to limited samples that can be collected for each pixel. The mean and the median do not differentiate the foreground and background pixels and lead to a biased background model when limited samples are available. The computation of the mode requires building a histogram and then finding the bin with the largest number of samples (there is another way to establish a dynamic histogram by constructing a binary search tree, which involves too many branching operations, thus not proper for GPU implementation). One histogram is built for each location in the reference frame and each bin records the number of hits over the sliding window. This is different from the normal concept of a normal histogram that is built on the whole image. The overview of the GPU implementation is shown in Algorithm 2. We use a fixed number of bins n = 16 to construct a histogram. Using a RGBA texture, a histogram in each channel of n √ √ bins leads to a tile size of n/4 × n/4 texture elements. For each pixel, we need to construct such a histogram, there√ √ fore, the size of the RGBA texture is W n/4 × H n/4,
123
Algorithm 2: Overview of GPU implementation
where W and H are the width and the height of original images, respectively. Suppose n = 16 (this is enough for an eight bit dynamic range), the histogram texture memory is 4 times the size of the original one. In a RGBA texture with n = 16, (H (0), H (4), H (8), H (12)) is stored in a float4 vector in GPU with the same texture coordinates. After one frame texture is loaded, the hits in each bin are updated. A bin is indexed by an intensity value, e.g., between [0,15] for n = 16. It is not efficient to use “if–else” type statements to determine the placement of one intensity value in the histogram. For efficiency, it is preferable to use standard Cg functions (http://developer.nvidia.com/object/cg_toolkit. html) rather than using branching statements. We adopt the approach in [10], which uses the function δb (x) = cos2 (α(x − b)) to indicate whether the value x belongs to the bin b and α is used to keep (x − b) within (−π/2, π/2). The cos(·) function can be squared repeatedly or filtered by a floor operation to yield a more concentrated impulse. To find the mode of the histogram, we use the reduction operation as in [17]. The required number of reduction itera√ tions is log2 ( n/4). The procedure of computing the mode is illustrated in Fig. 8, where each grid corresponds to one bin. Updating the histogram is implemented in the “ping-pong” way [10]. After we find the mode, i.e., the bin with the largest number of samples, we use the average of the samples that fall in the mode bin as the background model. This refinement is necessary to avoid the quantization error in building histograms. The difference between the background model and the reference image is the output that is transferred to CPU. For color videos, we compute the background model independently for each channel. 4.3 Experiments and results The GPU and CPU versions are implemented on a workstation with a Intel Xeon Qualcore CPU 3.0 GHz, 4 GB RAM DDR2 (667 MHz) and NVIDIA Quadro FX 3500. Both CPU and GPU versions take the same input, namely original frames and homographies. The CPU version uses Intel IPL 2.5 (Image Processing Library) to perform warping with
Efficient detection and tracking of moving objects in geo-coordinates
Fig. 8 Procedure to compute the mode. a Construct a histogram: a 16-bin histogram is built in one RGBA texture with the doubled width and height; b compute the mode: the mode is computed among all bins in the RGBA histogram texture
linear interpolation (the interpolation method affects timing). For the GPU version, RGBA floating point textures are used for storing the frame textures on the GPU. This is supported on most GPU profiles. For color videos, we use three RGBA textures to compute the background model in RGB channels in parallel. Most of the modern GPUs support at least four
Fig. 9 Visual results of background modeling using the mode approach. a Reference frame; b background model in the reference frame; c motion mask in the reference frame; d mosaic view of the background models over time
(a)
513
texture attachments. For both CPU and GPU versions, 16 bins are used in constructing histograms. All time measures only focus on background modeling, exclude loading images from videos, extracting features and estimating homography. We first evaluate the quality of the background model computed by GPU. This evaluation is in general difficult to perform, as it would require ground truth background models. Hence, we compare the background model output by GPU with the one output by the CPU version. The average differGPU −ICPU ) ence of 1,000 frame GPU and CPU outputs 2abs(I IGPU +ICPU is 0.3% and the variance is 0.1%. Thus, we can regard that GPU and CPU versions provide the same results. Some of the background model results are shown in Fig. 9. We show a mosaic view of many background images over time to demonstrate the quality of background models in Fig. 9c. When we generate the mosaic, we simply overwrite the mosaic view with new frames without any blending operation. From the mosaic view, we can see that the quality of the background model is consistently good over time. We evaluate the speed-up that the GPU version achieves over the CPU implementation, with different size of sliding windows and at different resolutions. The timing comparison between the CPU and GPU implementations is shown in Fig. 10. The timing of both methods is computed by averaging multiple runs. Figure 10 shows the time performance is basically proportional to the image size and the sliding window size. A larger sliding window provides more samples and generates a better background model (we usually use 91 frames in a sliding
(b)
(c)
(d)
123
514 7
2 1.8
6
Average time (seconds)
Average time (seconds)
Fig. 10 GPU timings compared with its CPU counterpart for a range of sliding window size and different resolution. a Mode background models mode on 640 × 480 videos; b mode background models mode on 320 × 240 videos
Y. Lin et al.
5 GPU version using mode CPU version using mode
4 3 2 1
1.6 1.4 1.2
GPU version using mode CPU version using mode
1 0.8 0.6 0.4 0.2
0
0 61
71
81
91
101 111 121 131
61
Sliding window size, resolution 640 × 480
71
81
91
(a) window). For 320 × 240 resolution videos, GPU version can run at 10 fps and achieve a speed-up over CPU version by a factor of 15.
(b) The posterior distribution for the partition with unknown number of targets and observations over T frames can be modeled as: P(ω|Y ) ∝
5 Hierarchical tracking of multiple moving objects
Given a set of observations Y over time T , the local data association problem is formulated as maximizing a posterior (MAP) of a partition ω = {τ0 , τ1 , τ2 , . . . , τ K } such that: (18)
where τ0 is the set of false alarms, τk is the track k among K tracks from the given partition. We use a graph representation of all measurements within the time frame [0, T ]. Let yt = yti : i = 1, . . . , n t denote the observations at time t, Y = ∪t∈{1,...,T } yt is the set of all the observations during [0, T ]. The partition can be explicitly drawn from this measurement graph (V, E), where each measurement yti is represented by a node in V , and each edge corresponds to a temporal association reflecting spatial properties such as spatial overlap between detected regions. We define a neighborhood in the graph (V, E) where edges are defined between any two neighboring nodes: j j (19) N = yti1 , yt2 : yti1 − yt2 < t · vmax ) where vmax is the maximum speed of targets.
123
ψ(τk )
φ(τk , τ j )
(20)
j=k
where ψ(τk )is the temporal compatibility within one track, and φ(τk , τ j ) is the spatial compatibility between different tracks, respectively. The posterior distribution in Eq. 20 can be viewed as having two distinct components: (i) ψ(τk ) controls the inner-smoothness for each track encoded by the joint motion and appearance likelihood, (ii) φ(τk , τ j ) encodes the interaction between different tracks. We will now discuss each one of these in turn. 5.1.1 Motion and appearance model
5.1 Local data association
ω∗ = arg max( p(ω|Y ))
K k=1
In airborne video analysis, long term occlusion is common. Also, objects may fall out of the field of view (FoV) due to the camera motion. We introduce a two-step procedure for tracking to cope with these issues. The first step (called local association) links detected moving regions within a sliding window and generates tracklets. The second step (called global association) links the tracklets to form longer tracks and maintain tracks ID.
101 111 121 131
Sliding window size, resolution 320 × 240
Here targets are represented by image blobs. Once a partition ω is chosen, the tracks {τ1 , . . . , τ K } and false alarms τ0 are determined and for each track the assigned observations are determined. To make full use of the observations for target tracking, we consider a joint probability framework for incorporating both motion and appearance information. Therefore, ψ(τk ) in Eq. 20 can be represented as follows. ψ(τk ) =
|τ k |−1
Pmot (τk (tl+1 )|τ¯k (tl ))Papp (τk (tl+1 |tl ))
(21)
l=1
Given the geo-registration result, we can map an image blob from UAV image to the map. We denote xtk the state vector of the target k at time t to be l x , l y , w, h, l˙x , l˙y (centroid’s position, width, height and velocity in the 2D map). We consider a linear kinematic model of constant velocity dynamics: k = Ak xtk + w k xt+1
(22)
where Ak is the transition matrix, and we assume w k to be a normal probability distribution, w k ∼ N (0, Q k ). The
Efficient detection and tracking of moving objects in geo-coordinates
observation ytk = [l x , l y , w, h] contains the measurement of a target position and size in 2D map. Since observations often contain false alarms, the observation model is represented as: k k H xt + v k if it belongs to a target ytk = (23) δt false alarm where ytk represents the measurement which may arise either from a false alarm or from the target. We assume v k to be normal probability distributions, v k ∼ N (0, R k ).δt is a 2D random variable with uniform distribution on the map. Let τˆk (ti ) and Pˆt (τk ) denote the posterior estimated states (i.e., xkt in Eq. 22) and posterior covariance matrix of the estimated error at time t of τk .τk (t) is the associated observation (i.e., ykt in Eq. 23) for track k at time t. The motion likelihood of track τk of one edge (τk (t1 ), τk (t2 )) ∈ E, t1 < t2 can be represented as Pmotion (τk (t2 )|τˆk (t1 )). Given the transition and observation model in a Kalman filter, the motion likelihood then can be written as: −e T Pˆt−1 (τk )e 1 2 exp (24) Pmot (·) = 2 (2π )2 det( Pˆt2 (τk )) where e = τk (t2 ) − H At2 −t1 τˆk (t1 ) and Pˆt2 (τk ) can be computed recursively by a Kalman filter as Pˆt2 (τk ) = H (A Pˆt2 −1 (τk )A T + Q)H T + R. To model the appearance of each detected region, we adopt a histogram-based appearance of the image blobs. All RGB bins are concatenated to form a one-dimension histogram. The appearance likelihood between two connected image blobs (τk (t1 ), τk (t2 )) ∈ E, t1 < t2 in track k, is measured using the symmetric Kullback–Leibler distance (KL) is defined as follows, where P(c) is the bin value of normalized histogram. ⎛ ⎞ 1 (c) P i ⎠ (Pi (c) − P j (c)) log Papp (·) = exp ⎝ 2 P j (c) c=r,g,b
(25) 5.1.2 Interaction model The motion and appearance likelihood models provide the inner-smoothness constraint for each track independently. However, without an a priori knowledge of the number of targets, the inner-smoothness constraint favors shorter paths, and therefore tends to split a trajectory into a large number of sub-tracks. To overcome this overfitting problem, commonly a prior knowledge on the detection and the targets’ behavior (such as detection and false alarm rate, termination and birth rate, etc.) is assumed known [3,24]. We propose to use an interaction model that penalizes object overlapping based on Markov Random Fields (MRFs) [15,28] defined in the neighborhood graph. The joint interaction between all existing nodes over time is factored as
515
the product of local potential functions at each node. In this MRF, the cliques are pairs of nodes that are connected in the graph (V, E). The interaction potential between τk and τ j is defined by: φ(τk , τ j ) =
|τ j −1| k −1| |τ l=1
exp(−λρ(τk (l), τ j (m))
(26)
m=1
where ρ is the spatial overlap between two observation nodes. The interacting potential is minimum when the observations have a large spatial overlap and maximum when they do not overlap. The introduction of the inter-track exclusion avoids smoothness overfitting, e.g., splitting tracks into smaller tracks. 5.1.3 MCMC data association algorithm We use data-driven MCMC for estimating the best partition of the space . The sampling is guided by the posterior distribution defined in Eq. 20. Here the sampling is similar to the one in [24]. The difference is that we drive the sampler, in a probabilistic manner, using both motion and appearance likelihoods. Moreover, to make the sampler more flexible, we draw samples in both temporal directions: looking forward and backward in time. This bi-directional sampling gives more flexibility and reduces significantly the total number of samples in terms of convergence. We use the following notations on the graph structure: N (·) is the neighbor set of an observation, i.e., N (yti1 ) = j
j
j
{yt2 , (yti1 , yt2 ) ∈ E}; Observation yt2 ∈ N (yti1 ) belongs to the parent set N c (yti1 ), child set N p (yti1 ) exclusively, when t2 < t1 or t2 > t1 . Extension/reduction: The purpose of the extension/reduction move is to extend or shorten the estimated trajectories given a new set of observations. For the forward extension, we select uniformly at random (u.a.r) a track τk from K available tracks, τ1 , . . . , τ K . Let τk (end) denote the last node in the track τk . For each node y ∈ N c (τk (end)), we have the τ¯k (end)) . We associate association probability p(y) = p(y| z p(y|τ¯k (end)) y and track τk according to this normalized probability, and then append the new observation y to τk with a probability γ , where 0 < γ < 1. Similarly, for a backward extension, we consider a node y ∈ N p (τk (start)) and use reverse dynamics for estimating the association probability p(y). The reduction move consists of randomly shortening a track τk (u.a.r from K available tracks, τ1 , . . . , τ K ), by selecting a cutting index r u.a.r from 2, . . . , |τk | − 1. In the case of a forward reduction the track τk is shortened to {τk (t1 ), . . . , τk (tr )}, while in a backward reduction we consider the subtrack {τk (tr ), . . . , τk (t|τk | )}. Birth/death: This move controls the creation of new track or termination of an existing trajectory. In a birth move, we
123
516
Y. Lin et al.
select u.a.r a node y ∈ τ0 , associate it to a new track and increase the number of tracks K = K + 1. The birth move is always followed by an extension move. From the node y we select the extension direction forward or backward u.a.r to extend the track τ K . Similarly, in a death move we choose u.a.r a track τk and delete it. The nodes belonging to the deleted track are added to the unassigned set of measurements τ0 . Split/merge: In a split move, we u.a.r select a track τk , and a split point ts , which is selected according to the normalized joint probability between two consecutive connected nodes in the track: (1 − p(τk (ti+1 )|τ¯k (ti ))) /
|τ k |−1
5.2 Global tracklets association (1 − p(τk (ti+1 )|τ¯k (ti )))
i=1
And we split τk into two new tracks τs1 = {τ (t1 ), . . . ,τ (ts )}
and
τs2 = {τ (ts+1 ), . . . ,τ (t|τk | )}
Often, due to missing detection or erroneous detection, trajectories of objects are fragmented. The merge move provides the ability to link these fragmented sub-tracks according to their joint likelihood of appearance and motion and the interaction based on spatial overlapping. The merge move operates on the candidate set of track pairs, for which the start node of one track is the child node of the end node of the other track and is defined by the set: t = {(τk1 , τk2 ) : τk2 (start) ∈ N c (τk1 (end))} Cmerge t We select u.a.r pairs of tracks from Cmerge and merge the two tracks into a new track τk = {τk1 } ∪ {τk2 }. Switch: In a switch move, we are probing the solution space for better labeling of nodes that belong to multiple tracks. We consider the following candidate set of track pairs. t = {(τk1 (t p ), τk2 (tq )) : τk1 (t p ) ∈ N p (τk2 (tq+1 )), Cswitch
τk2 (tq ) ∈ N p (τk1 (t p+1 )), k1 = k2 } t We u.a.r select a candidate node from Cswitch and define two new tracks as:
τk 1 = {τk1 (t1 ), . . . , τk1 (t p ), τk2 (tq+1 ), . . . , τk2 (t|τk2 | )} and
τk 2 = {τk2 (t1 ), . . . , τk2 (tq ), τk1 (t p+1 ), . . . , τk1 (t|τk1 | )}.
Online processing: The complexity of local data association depends on the size of the observation graph. Moreover, the algorithm is performed in a deferred logic way. The decision is made when all observations in the graph are available. Thus we implemented the proposed association algorithm as an online one within a sliding window which contains the latest 45 frames and only observations within this sliding window are stored in the observation graph. When the sliding window moves, the partition of the graph at the previous time is used as initialization.
123
Fig. 11 Computing spatio-temporal consistency between two tracklets, derived from the concept of casting vote in tensor voting
Although merge/split operation in local data association can deal with missing detection, it only considers observations within a short time span. Some situations, such as long occlusions, may cause the tracker to lose target identification. Increasing the size of sliding window cannot solve the problem all the time and increases the complexity. Thus, we introduce the global data association to associate tracklets to maintain track identification. 5.2.1 Spatio-temporal consistency First, we define the consistency of temporal and spatial relationship between tracklets. Given two tracklets τ1 and τ2 , which start at time s1 , s2 and terminate at time t1 , t2 . If the condition s1 ≥ t2 or s2 ≥ t1 holds, the two tracklets are temporally consistent. For two temporally consistent tracklets τ1 and τ2 , say s2 ≥ t1 , the terminating position and velocity of τ1 on the global map is Pt1 and Vt1 The starting position and velocity of τ2 on the global map is Ps2 and Vs2 . If the ||Pt1 − Ps2 || ≤ vmax × (s2 − t1 ) and ||Vt1 − Vs2 || ≤ amax ×(s2 −t1 ), the two are spatially consistent as well, where vmax and amax represent the maximum speed and acceleration of objects on the map. For all spatio-temporal consistent candidates, we use a vote-casting method inspired from Tensor Voting [30] to calculate the spatio-temporal consistency between tracklets as shown in Fig. 11. Let O denote one end of a tracklet and N denote its normal in 2D space. We want to compute its consistency with another end P from a different tracklet. The consistency should consider both orientation and strength. As can be seen in Fig. 11, the ideal orientation N O→P (gray arrow starting from P) is given by drawing a big circle whose center C is in the line of N and it passes both O and P while preserving the normal N. The ideal orientation ensures the smoothest connection between two ends, O and P. The actual normal at P is N P . The consistency between O and P is computed by the following function: (27) S(O, P) = exp −|s|2 − ck 2 (N O→P · N P )
Efficient detection and tracking of moving objects in geo-coordinates
+
The appearance distance between two compatible tracklets is computed using the Kullback–Leibler (KL) divergence. For the color descriptor, since each bin is modeled by a Gaussian model, the KL distance is reduced to: ! 1 1 1 2 + 2 (μi − μ j ) d(τi , τ j ) = 2N σ j2 σi N " σ j2 σ2 + i2 + 2 (29) σj σi
...
+
Radial direction R:
G:
B:
(a)
+
+
517
...
where μi , μ j , σi and σ j are the parameters of the color Gaussian model. For the edge descriptor we use the following similarity measure:
Radial bins E:
(b)
1 Er (τi ) (Er (τi ) − Er (τ j )) log dedge (τi , τ j ) = 2 Er (τ j ) 6
Fig. 12 Appearance descriptor of tracklets. a Appearance color model; b appearance shape model
Here, |s| is the arc length, k is the curvature, c is the decay rate. Note that besides the introduction of the dot product, the scale σ in Tensor Voting’s decay function [30] is gone, since there is no concept of neighbors, i.e., any two ends from consistent candidate tracklets can be associated. Tracklets that are not consistent have zero spatio-temporal consistency. 5.2.2 Tracklet descriptor To associate the temporally and spatially consistent tracklets, we adopt the appearance model proposed in [13]. This descriptor is invariant to 2D rotation and scale change, and tolerates small shape variations. Instead of applying this descriptor on a single image blob, we use the descriptor on a tracklet, which contains a sequence of image blobs. For each detected moving blob within a tracklet, the reference circle is defined as the smallest circle containing the blob. The reference circle is delineated as the 6 bin images in 8 directions depicted in Fig. 12. For each bin i, a Gaussian color model is built on all the pixels located in bin i for all eight directions and for all image blobs within the tracklet. Thus, the color model for each tracklet is then defined as a 6D vector by summing the contribution of each bin image in all 8 directions and for all image blobs. We can similarly encode the shape properties of each blob by using a uniform distribution of the number of edge pixels within each bin, namely a normalized vector [E 1 (τ ), E 2 (τ ), . . . , E 6 (τ )]. The appearance likelihood between two compatible tracklets can be defined: papp (τi , τ j ) = exp −λ dcolor (τi , τ j ) + dedge (τi , τ j ) (28) where τi , τ j are tracklets on which the appearance probability model is defined.
(30)
r =1
Due to the existence of both target motion and camera motion, the target’s orientation could be quite different in different tracklets, thus the rotation-invariant property of the descriptor is quite important for our tracklets association. After calculating the motion and appearance consistency between each pair of tracklets, we use the Hungarian algorithm [18] to find the best associations. Suppose there are n tracklets, the similarity matrix A2n×2n used in the Hungarian algorithm is a matrix of size 2n × 2n.A(1,...,n)×(1,...,n) , except its diagonal elements, contains the similarity between any pair of tracklets, the diagonal of A(n+1,...,2n)×(1,...,n) stores the termination probability; the diagonal of A(1,...,n)×(n+1,...,2n) stores the birth probability. All the other elements in A are zero. By expanding the similarity matrix, we can deal with the cases that new tracks emerge and existing track terminate.
6 Experimental results We have shown partial experiments in geo-registration and background modeling. Here, we show the overall time performance in Table 1. The geo-registration and motion detection, which were the bottlenecks of the whole system, are
Table 1 Overall time performance of the streamlined system Procedure (resolution = 320 × 240)
Average time (s)
Image registration
∼ 0.25
Geo-registration (GPU)
∼ 30 every 25 frames
Motion detection and segmentation in a 91-frame sliding window (GPU) Tracking
∼ 0.22
Total
∼1.9
∼ 0.2
123
518
Y. Lin et al.
Fig. 13 Comparison of with and without geo-registration. a tracklet snapshot at frame 80; b tracklet snapshot at frame 950; c trajectory of traklets without geo-registration; d trajectory of tracklets with geo-registration
implemented on GPU. This improves the overall time performance of the system. We show tracking results on the following two UAV sequences. Using the longitude and latitude information coming with image sequences, the map is acquired from Google Earth. The homography between the first frame and the map H0M is manually computed offline. Figure 13 shows the tracking result on a sequence with one moving object. Considering the computation cost, the geo-registration refinement with the map is performed every 50 frames. Figure 13c, d displays the tracking result on the map. The trajectory of tracklets in Fig. 13c is generated using the initial homography between UAV image and map without refinement. Figure 13d is generated using our geo-registration. It is clear that the trajectories of tracklets without geo-registration are out of the road boundary. Since the target is fully occluded by the shadows of trees, the trajectory of the single target breaks into tracklets. In real scenarios, the moving shadow may affect the target’s appearance. We apply the deterministic nonmodel-based method [26] working in HSV space to remove the strong moving shadow. Figure 14 shows tracking results on the sequence with multiple moving targets. Again when targets are occluded by shadows, local data association may lose track identification and thus tracklets are formed. The missing detection caused by occlusion even lasts for longer than the sliding window of local data association (45 frames). However in global data association, the tracklets are associated with correct ID throughout the video. The different tracks are listed in the Z direction in different colors. Figure 14b–e shows the
123
(a)
(b)
(c)
(d)
beginning frame of the tracklets of the red truck. Although the appearance of the white van and the white SUV in Fig. 14b is quite similar, the temporal and spatial constraints on the global map avoid to associate them together.
7 Conclusions and future work We have proposed a framework to detect and track moving objects from airborne platforms for surveillance purposes. Our efforts are twofold, robustness and real-time performance. In improving robustness, we adopt geo-registration with a global map that provides us reference coordinates to geo-locate targets with physical meaning. We have proposed a two-step (local and global) data association algorithm. The local data association, which provides the ability to form tracklets, takes care of short-term occlusion and miss detection. Then in geo-coordinates, association between tracklets produced in the local data association algorithm is achieved using spatial-temporal consistency and similarity of appearance. Experiments show the local and global association can maintain track IDs across long-term occlusion. In improving the time performance, we have implemented the two most time consuming modules (geo-registration and dynamic background modeling) on the GPU platform. Experiments show that GPU implementation achieves significant speedup against the CPU implementation. In the future, we will continue our investigation in the following directions:
Efficient detection and tracking of moving objects in geo-coordinates
519
(a)
(b)
(c)
(d)
(e)
Fig. 14 Tracklets and tracks obtained using the local and global data association framework. a The tracking results with geo-mosaicing the UAV images on the satellite image; b, c, d, e 2D tracking snapshots
1. Scene-awareness airborne surveillance: Thanks to georegistration, we have the ability to make use of the geographic information system (GIS) information, such as road network, infrastructures, etc., to assist our tracking system. This information is very appropriate for tracklet association at the global level. 2. Motion pattern analysis: Motion segmentation or object detection may fail in very crowded scenarios. Instead of tracking each individual object in the scene, we can first analyze motion patterns that are caused by all objects. This knowledge can provide us several benefits: (1) motion patterns define global motion characteristics that can be used as motion priors or constraints with which all objects within the pattern comply; (2) motion patterns can be used to remove false alarms caused by motion segmentation or parallax; (3) motion patterns of a single vehicle or vehicles define the concept of “road” even without the road network from GIS. Also motion patterns can be directly used for querying a single vehicle or a group of vehicles that undergo special motion without the need of tracking each individual vehicle. 3. Further GPU (CUDA) implementation of motion segmentation and object detection. We plan to port the entire detection modules onto GPU to achieve real-time performance.
4. Geo-registration failure detection: Our current approach to geo-registering UAV images to a map relies on RANSAC for robust estimation of the established UAV to map correspondences. There are cases where the number of inliers is too small for RANSAC to produce accurate results, and lead to very wrong alignments that are not recoverable. We will work on detecting such failures so that further correction process can be applied to avoid successive failures.
Acknowledgments This work was supported in part by MURIAROW911NF-06-1-0094.
References 1. Ali, S., Shah, M.: Cocoa—tracking in aerial imagery. In: SPIE (2006) 2. Babenko, P., Shah, M.: Mingpu: A minimum gpu library for computer vision. http://server.cs.ucf.edu/~vision/MinGPU/ 3. Bar-Shalom, Y., Fortmann, T., Scheffe, M.: Joint probabilistic data association for multiple targets in clutter. In: Proceedings of Conference on Information Sciences and Systems (1980) 4. Brown, M., Lowe, D.G.: Recognizing panoramas. In: ICCV’03. Proceedings of Ninth IEEE International Conference on Computer Vision, pp. 1218–1225 (2003)
123
520 5. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981) 6. Griesser, A., et al.: Real-time, gpu-based foreground-background segmentation. In: Vision, Modeling, and Visualization, pp. 319– 326 (2005) 7. Hanna, K., Sawhney, H., Kumar, R., Guo, Y., Samarasekara, S.: Annotation of video by alignment to reference imagery. In: ICCV’99, pp. 253–264 (1999) 8. Huang, X., Sun, Y., Metaxas, D., Sauer, F., Xu, C.: Hybrid image registration based on configural matching of scale-invariant salient region features. In: CVPRW ’04: Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’04), vol. 11, pp. 167 9. Irani, M., Anandan, P., Hsu, S.: Mosaic based representations of video sequences. In: ICCV, pp. 605–611 (1995) 10. James Fung, C.A., Steve Mann: Openvidia: Parallel gpu computer vision. In: Proceedings of the ACM Multimedia 2005, pp. 849–852 (2005) 11. Javed, O., Shafique, K., Shah, M.: A hierarchical approach to robust background subtraction using color and gradient information. In: Proceedings of the Workshop on Motion and Video Computing, pp. 23–28 (2002) 12. Kang, J., Cohen, I., Medioni, G.: Continuous tracking within and across camera streams. In: CVPR, vol. 1, pp. 267–272 (2003) 13. Kang, J., Cohen, I., Medioni, G.: Object reacquisition using invariant appearance model. In: ICPR, pp. 759–762 (2004) 14. Kaucic, R., Perera, A.G.A., Brooksby, G., Kaufhold, J., Hoogs, A.: A unified framework for tracking through occlusions and across sensor gaps. In: CVPR, pp. 990–997 (2005) 15. Khan, Z., Balch, T., Dellaert, F.: MCMC-based particle filtering for tracking a variable number of interacting targets. IEEE PAMI 11, 1805–1918 (2005) 16. Kim, J., Kolmogorov, V., Zabih, R.: Visual correspondence using energy minimization and mutual information. In: ICCV’03, p. 1033 17. Kruger J., Westermann, R.: Linear algebra operators for gpu implementation of numerical algorithms. In: International Conference on Computer Graphics and Interactive Techniques, pp. 908–916 (2003) 18. Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955) 19. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
123
Y. Lin et al. 20. Mann, S., Picard, R.W.: Video orbits of the projective group a simple approach to featureless estimation of parameters. IEEE Trans. Image Process. 6, 1281–1295 (1997) 21. Medioni, G.: Matching of a map with an aerial image. In: Proceedings of the 6th International Conference on Pattern Recognition, pp. 517–519 (1982) 22. Meihe, X., Srinivasan, R., Nowinski, W.L.: A fast mutual information method for multi-modal registration. In: Information Processing in Medical Imaging, pp. 466–471 (1999) 23. NVIDIA CUDA Programming Guide 1.1. (2007) 24. Oh, S., Russell, S., Sastry, S.: Markov chain monte carlo data association for general multiple-target tracking problems. In: Proceedings of the 43rd IEEE Conference on Decision and Control, pp. 735–742 (2004) 25. Patrick Labatut, R.K., Pons, J.-P.: A gpu implementation of level set multiview stereo. In: International Conference on Computational Science, pp. 212–219 (2006) 26. Prati, A., Mikic, I., Trivedi, M.M., Cucchiara, R.: Detecting moving shadows: algorithms and evaluation. PAMI 25(7), 918–923 (2003) 27. Sinha, S.N., Frahm, J.-M., Pollefeys, M., Genc, Y.: Gpu-based video feature tracking and matching. Technical report. Department of Computer Science, UNC, Chapel Hill (2006) 28. Smith, K., Gatica-Perez, D., Odobez, J.-M.: Using particles to track varying numbers of interacting people. In: CVPR, pp. 962–969 (2005) 29. Studholme, C., Hill, D.L.G., Hawkes, D.J.: An ovelap invariant entropy measure of 3d medical image alignment. Pattern Recognit. 32(1), 710486 (1999) 30. Tang, C., Medioni, G., Lee, M.: N-dimensional tensor voting, application to epipolar geometry estimation. In: PAMI, pp. 829–844 (2001) 31. Viola, P., William, M.W.I.: Alignment by maximization of mutual information. Int. J. Comput. Vis. 24(2), 137–154 (1997) 32. Yalcin, H., Collins, R., Hebert, M.: Background estimation under rapid gain change in thermal imagery. In: Object Tracking and Classification in and Beyond the Visible Spectrum (2005) 33. Yang, R., Pollefeys, M.: Multi-resolution real-time stereo on commodity graphics hardware. In: CVPR, pp. 211–217 (2003) 34. Yin, Z., Collins, R.: Moving object localization in thermal imagery by forward–backward mhi. In: CVPR Workshop on Object Tracking and Classification in and Beyond the Visible Spectrum (OTCBVS) (2006)