of video sequences and the spatial registration of all the temporally cor- ... mapping from the time domain of the first sequence to the second one, such that ... aligns one frame with the other, to the extent that they can be compared pixel wise.
Synchronization of video sequences from free-moving cameras ´ Joan Serrat, Ferran Diego, Felipe Lumbreras and Jos´e Manuel Alvarez Computer Vision Center & Computer Science Dept. Edifici O, Universitat Aut` onoma de Barcelona, 08193 Cerdanyola, Spain
Abstract. We present a new method for the synchronization of a pair of video sequences and the spatial registration of all the temporally corresponding frames. This is a mandatory step to perform a pixel wise comparison of a pair of videos. Several proposals for video matching can be found in the literature, with a variety of applications like object detection, visual sensor fusion, high dynamic range and action recognition. The main contribution of our method is that it is free from three common restrictions assumed in previous works. First, it does not impose any condition on the relative position of the two cameras, since they can move freely. Second, it does not assume a parametric temporal mapping relating the time stamps of the two videos, like a constant or linear time shift. Third, it does not rely on the complete trajectories of image features (points or lines) along time, something difficult to obtain automatically in general. We present our results in the context of the comparison of videos captured from a camera mounted on moving vehicles.
1
Introduction
Image matching or registration has received a considerable attention for many years and is still an active subject for its role in segmentation (background subtraction), recognition, sensor fusion, construction of panoramic mosaics, motion estimation, etc. Video matching shares with still image matching a great deal of potential applications. It requires simultaneous alignment in the temporal and spatial dimensions. Temporal alignment or synchronization means to find out a mapping from the time domain of the first sequence to the second one, such that corresponding frame pairs, each from one sequence, show ’similar content’. The simplest notion of similar content is that a warping can be found which spatially aligns one frame with the other, to the extent that they can be compared pixel wise. But it is not unique, as we will comment. Several solutions to the problem of video synchronization can be found in the literature. Here we briefly review those we consider the most significant. This is relevant to put into context our work, but also because, under the generic label of temporal alignment, they try to solve rather different problems. The distinction is based on the assumptions made by each method. For instance, some methods
[1,2,3,4,5,6] assume the temporal correspondence to be a simple constant time offset c(t1 ) = β or linear [7,8] c(t1 ) = αt1 + β, the later due to the different frame rate of the two cameras, whereas others [9,10] let it be of free form. More importantly, some methods [1,2,3,7,4,5] are tailored to videos acquired simultaneously, in order to show exactly the same motion or keep constant the relative position and orientation of the two cameras. Others [10,9,8,6], instead, can also deal with sequences recorded at different times, showing slightly different object motions, like one same action performed by different people. A few works [9,4] address the case of free moving cameras, where no fixed geometric relationship exists among them. Each method needs some input data which can be more or less difficult to obtain and hamper its practical applicability. For instance, feature–based methods require tracking one or more characteristic points along the two whole sequences [7,3,10,5,6], or points and lines in three sequences [4]. In contrast, direct methods are those built just on the image intensity values [7,9,8]. What’s more, some methods need to estimate quantities for which not very robust techniques exist, like the fundamental matrix [2,6] and the trifocal tensor [4]. Concerning the basis of these methods, most of them rely on the existence of a geometric entity which somehow constraints the relationship between the coordinate systems of two frames if they are corresponding: an affine transform [8], a plane–induce homography [1,7], the fundamental matrix [2,6], the trifocal tensor [4], and a deficient rank condition on a matrix made of the complete trajectories of tracked points along a whole sequence [10,3,5]. This fact allows either to formulate some minimization over the time correspondence parameters (e.g. α, β), to perform an algorithmic search for them, or at least to directly look for pairs of corresponding times. A few methods, in our opinion more realistic from the point of view of practical applicability, are based on the image intensities instead of point trajectories [7,9,8]. Our goal is to synchronize videos recorded simultaneously or at different times, which can thus differ in intensity and even in content, i.e., show different objects or actions (motion), up to an extent. Videos can be recorded by a pair of free moving cameras, but their motion is not completely free. For the video matching to be possible, there must be some overlapping in the field of view of the two cameras, when they are at the same or close positions. Thus, we require that they follow approximately coincident trajectories and, more importantly, that the relative camera rotations between corresponding frames are not too large. Note that, even in this case, free motion precludes the use of a constant epipolar constraint. The scene is 3D: we do not impose the condition of planar or very far away scenes, so that the constant homography constraint can not be applied. Neither do we want to depend on error–free and complete feature trajectories, provided manually or by an ideal tracker. Finally, the time correspondence is free form: anyone of the cameras can stop while the other keeps moving. Our work is most closely related to [9] in the sense of striving for generality and applicability. Beyond this, each of the former steps is completely different. For instance, they do not adopt any explicit motion field model for corresponding
frames, as we do. Also, their frame matching measure is based on point (Harris corners) correspondences, computed with an EM–like algorithm plus a Kanade– Lucas–Tomasi local motion optimization. We guess this makes their method dependent on having a number of such characteristic points evenly distributed on the images, along the whole sequences, as shown in their results. In contrast, we will be able to synchronize videos with a much more sparse structure (e.g. night sequences). We propose a method which replaces the former constraints on the coordinates of every pair of corresponding frames (provided by a certain fixed affine transform, homography, fundamental matrix or trifocal tensor) by a specific image motion field model (Sect. 2.1). Its five parameters can vary from pair to pair due to the free moving cameras assumption, but some dependencies exist among them that we enforce. For each candidate pair of frames, the estimation of these parameters allows to compute an spatial alignment error (Sect. 2.2). Based on it, an efficient divide–and–conquer procedure searches the corresponding frame in the second video for all the frames in the first one (Sect. 2.3). We present some results in the context of a realistic and challenging application (Sect. 4). Imagine a car, equipped with a forward facing camera, which repeatedly drives along one same track, for building surveillance. We want to compare two videos, at different times, because differences are potential signs of intruders: office lights switched on or off, parked cars etc. which have changed from the previous round. Finally, Sect. 5 draws the conclusions.
2 2.1
Method A motion model for corresponding frames
Two frames, one from each video sequence, are corresponding if the cameras were on the same 3D location at the time they were recorded. Thus, ideally, only the camera pose could vary, that is, their relative orientation which is expressed by a rotation matrix R. Let be P1 = K1 [I | 0] and P2 = K2 [R | 0] the projection matrices of the two cameras, having centered the reference coordinate system in the first camera. It can be seen then that the coordinates of the two frames are related by the homography H = K2 RK1−1 . We aim at defining a simple, linear parametrized model for the image coordinate difference (or motion vector) of two corresponding points, both in space and time. To this end, we are going to state several simplifying, yet reasonable assumptions: 1. The two cameras have the same intrinsic parameters, that is, K1 = K2 = K. Then H = KRK −1 , a conjugate rotation. 2. The principal point (the origin of the image coordinate system) is at the image center, and the focal lengths for the x and y axis are equal, fx = fy = f . Hence, K = diag(f, f, 1).
3. Let the rotation R be parametrized by the Euler angles Ωx , Ωy , Ωz (respectively pitch, yaw and roll). If they are all small enough, R can be substituted by its first order approximation, R ≈ [(Ωx , Ωy , Ωz )]× . Accordingly, 1 −Ωz f Ωy 1 −f Ωx H ≈ Ωz (1) −Ωy /f Ωx /f 1 Note that the relationship between coordinates of corresponding frames is linear but in homogeneous coordinates. Thus, the motion vector between a point x from the first to the second frame is · ¸ · ¸ 1 u(x) (H1 − xH3 )x u(x) = x2 − x1 = = (2) v(x) H3 x (H2 − yH3 )x where Hi denotes the i–th row of H. Let us add a final assumption to obtain a linear dependence in non–homogeneous coordinates. 4. For f large enough (that is, a medium to narrow camera field of view), since the rotation angle is small, H3 x = −xΩy /f + yΩx /f + 1 ≈ 1. Finally, we obtain a parametric motion field model which is called quadratic for its dependence on the terms x2 , y 2 [11] but linear with regard its parameters pi :
p1 0 · ¸ p 0 2 1 y x2 xy 0 p3 , p = SΩ = 0 u(x; p) = Xp = 2 0 −x xy y 1 −1/f p4 p5 −f 2.2
f 0 1/f 0 0
0 −1 Ωx 0 Ωy (3) 0 Ωz 0
Spatial frame matching
We need a measure of spatial registration of a pair of frames, in order to choose the frame K in the second sequence that best matches a given frame J from the first one. To this end, we have devised the motion field model of Eq. (3), which parametrizes the motion field between frames if they are corresponding. Consequently, we need to estimate the parameters p that minimize some registration error measure and use its magnitude. We have chosen the sum of squared linearized differences (i.e., the linearized brightness constancy): X¡ ¢2 X ¡ ¢2 K(x) − J(x + u(x; p)) ≈ K(x) − J(x) − ∇J(x)T Xp x
(4)
x
∂J T where ∇J(x) = ( ∂J is the spatial gradient of J. It has been ∂x (x), ∂y (x)) widely used in the past in the context of image matching, for instance to build panoramic mosaics or matching neighbour frames in sequences of planar scenes [12,13]. The reason to choose this technique is that it does not depend on characteristic points/regions, that is, we do not require images to have a prominent
structure (distinct objects well distributed on the image). In addition, we intend to synchronize sequences recorded at night, where often there is not much ’content’. The error minimization is achieved by deriving with respect to the unknown p and setting to zero. This leads to a system of five linear equations in the five unknowns C p = b, C =
X
X T ∇J(x)∇J(x)T X, b =
x
X
(K(x) − J(x))X T ∇J(x) (5)
x
In practice, we can not directly solve for p because the first order approximation of Eq. 4 holds only if the motion field u(x; p) is small. Instead, p is successively estimated in a coarse–to–fine manner. A Gaussian pyramid is built for both J and K and at each resolution level p is re–estimated based on the value of the previous level. This means that K is successively warped towards J. At the same time, at each pyramid level, several iterations of this process are performed. For a detailed description we refer the reader to [12,13,7]. 2.3
Correspondence finding
In the former section we have explained how to assess the matching between a pair of frames. But how to use it to determine the correspondence c from the first to the second sequence ? Obviously, the brute force approach of an exhaustive test of all possible pairs is infeasible since our target videos may have hundreds to thousands of frames. Less costly, trying a fixed number of frames in the second sequence, for every frame in the first one, misses out the chance to cut down the number of comparisons, as we will see. We propose for this task a divide–and–conquer procedure. Let be S1 and S2 two sequences m and n frames long, respectively. We impose the condition that the first sequence is contained within the second one S2 , that is, first and last frames of S1 have some corresponding frame within S2 . The mapping c must be defined for all the time instants t1 = 1 . . . m and be monotonically increasing, c(t1 + 1) − c(t1 ) ≥ 0. Equality accounts for the fact that the first camera may move slower than the second one, or even stop while the other keeps moving. The reverse case is c(t1 + 1) > c(t1 ) + 1. Suppose that, somehow, we decide frames S1 (t1 ) and S2 (t2 ) are corresponding. Then necessarily, 1 ≤ t ≤ t1 ⇒ 1 ≤ c(t) ≤ t2 and t1 ≤ t ≤ m ⇒ t2 ≤ c(t) ≤ n. This means that each time we augment c with a pair of corresponding time instants, the possible pairs may be strongly reduced. Consider the particular case that each camera was moving at (may be different) constant speed and we already know the corresponding frames of t1 = 1, m. Then c would be the line t2 = c(1) + t1 (c(m) − c(1))/m (Fig. 1a). The largest possible reduction, to a half, is achieved by looking for the correspondence of t1 = m/2. In Fig. 1a the possible correspondences prior to this decision are within the lighter rectangle and posterior to it the two darker ones. Based on it, the procedure for correspondence finding, illustrated by Fig. 1b, is:
Fig. 1. Divide–and–conquer correspondence search, see text.
1. Set a maximum time offset ∆T (height of thin bars) 2. For t1 = 1 try t2 = 1 . . . ∆T /2 and for t1 = m, t2 = n − ∆T /2 . . . n, and choose in each case the one of minimum error as c(1) and c(m), respectively 3. Look for the corresponding frame to t1 = m/2 : first interpolate a line l1,m (t) between (1, c(1)) and (m, c(m)) and then try t2 = max{c(1), l(m/2)− ∆T /2} . . . min{c(m), l(m/2) + ∆T /2}, taking the time of minimum error as c(m/2) 4. Repeat step 3 for the two resulting intervals (now [1, m/2], [m/2, m]) if the interval length is greater than 2. Fig. 1b shows the intervals and their bounds for the first two subdivisions, the darker the later. Fig. 1c illustrates a real case: vertical bars represent the tried pairs and ∆T = 120. In another experiment, for m = 521, n = 421 frames and ∆T = 200 the number of evaluated pairs was only 3423.
3
Efficiency
Efficiency can not be an afterthought in this problem. For the method to be of practical use, it must be able to synchronize videos of hundreds or thousands of frames in a reasonable time. We briefly report two ways to improve efficiency, without an important loss of precision. The first is to speed up the spatial registration of a pair of frames. It can be done by not iterating the linear system of Eq. 3 at the lowest level of the image pyramid, that is, at maximum resolution. This achieves a gain because at each iteration one of the images must be warped according to the newly computed parameters p. The second way is to reduce the number of frame pairs to match by sampling the temporal dimension of the first video: instead of finding the correspondence for each frame in the first video, do it just for each tenth, for instance, and interpolate the correspondence and parameters for frames in between. Just to provide some specific figures, with our current Matlab implementation two videos of 720×288 pixels/frame, around 720 frames each, were synchronized in 3 hours 45 min. With the former two approximations the computation time was reduced to 24 min.
Fig. 2. Registration of corresponding frames. From left to right: frame of first video, warped frame of the second video and contrast inverted difference.
4
Results
The application motivating this research was to compare videos recorded at night from a moving vehicle, with the camera forward facing. This may be a complement to the surveillance of parkings, warehouses or widespread facilities. We have successfully synchronized relatively long parts of day and night videos (hundreds of frames), even with significant differences in content. The main limitation of our method is the ability to deal with large initial misalignments due to significant camera translation and/or relative rotation, which may give rise to synchronization errors. The reason is that, to deal with this situation, we rely only on the hierarchical estimation of the motion parameters. Another source of synchronization errors is the hard (irreversible) decisions of the correspondence finding algorithm: wrong correspondences introduce errors locally because they determine a bound for the following correspondences to be found. And the sooner they are computed, the wider their influence. Fig. 2 shows some examples from which small differences could be detected by substraction. Figures however, are a poor reflex of synchronized videos. They can be viewed at www.cvc.uab.es/ adas/projects/sincro/IbPRIA07.html.
5
Conclusions
We have presented a new method for video synchronization, which includes spatial in addition to temporal registration. Compared to most of the previous works, we try to solve an under constrained version of this problem: free camera motion and no need of tracked features or geometric entities difficult to estimate like the fundamental matrix. Efficiency is an issue in this problem, which we
address through the correspondence search procedure, the interpolation of the correspondence and motion parameters along time and the fast registration of frame pairs. The main limitation of our method is the registration errors due to large misalignments. In spite of this, it can synchronize many sequences without paying special care to the camera motion (driving speed and style), including night sequences where the structure is sparse. Acknowledgments. This research has been partially funded by grant TRA200406702/AUT of the Spanish Ministerio de Educaci´ on y Ciencia.
References 1. Stein, G.: Tracking from multiple view points: self calibration of space and time. In: Proc. DARPA Image Understanding Workshop. (1998) 521–527 2. Carceroni, R., P´ adua, F., Santos, G., Kutulakos, K.: Linear sequence–to–sequence alignment. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Washington DC (2004) 746–753 3. Tresadern, P., Reid, I.: Synchronizing image sequences of non–rigid objects. In: British Machine Vision Conf., Norwich, UK (2003) 629–638 4. Lei, C., Yang, Y.: Trifocal tensor–based multiple video synchronization with subframe optimization. IEEE Trans. Image Processing 15(9) (2006) 2473–2480 5. Wolf, L., Zomet, A.: Wide baseline matching between unsynchronized video sequences. Int. Journal of Computer Vision 68(1) (2006) 43–52 6. Tuytelaars, T., VanGool, L.: Synchronizing video sequences. In: Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition. Volume 1., Washington DC (2004) 762–768 7. Caspi, Y., Irani, M.: Spatio–temporal alignment of sequences. IEEE Trans. Pattern Analisys and Machine Intelligence 24(11) (2002) 1409–1424 8. Ukrainitz, Y., Irani, M.: Aligning sequences and actions by maximizing space–time correlations. In: Proc. European Conf. on Computer Vision, Graz, Austria (2006) 9. Sand, P., Teller, S.: Video matching. ACM Transactions on Graphics (Proc. SIGGRAPH) 22(3) (2004) 592–599 10. Rao, C., Gritai, A., Sha, M., et al.: View–invariant alignment and matching of video sequences. In: Proc. IEEE Int. Conf. Computer Vision, Nice, France (2003) 939–945 11. Irani, M.: Multi–frame correspondence estimation using subspace constraints. Int. Journal of Computer Vision 48(3) (2002) 173–194 12. Szeliski, R.: Image alignment and stitching: A tutorial. Technical Report MSRTR-2004-92, Microsoft Research (2006) 13. Zelnik-Manor, L., Irani, M.: Multi–frame estimation of planar motion. IEEE Trans. Pattern Analisys and Machine Intelligenc 22(10) (2000) 1105–1116