IMAGES USING THE SPHERICAL FOURIER TRANSFORM. Timo Schairer, Benjamin Huhle, Wolfgang StraÃer. University of Tübingen, WSI/GRIS. ABSTRACT.
INCREASED ACCURACY ORIENTATION ESTIMATION FROM OMNIDIRECTIONAL IMAGES USING THE SPHERICAL FOURIER TRANSFORM Timo Schairer, Benjamin Huhle, Wolfgang Straßer University of T¨ubingen, WSI/GRIS ABSTRACT Orientation estimation based on image data is a key technique in many applications and robust estimates are possible in case of omnidirectional images. A very efficient technique is to solve the problem in Fourier space. In this paper we present a fast and simple method to overcome one of the main drawbacks of this approach, namely the large quantization steps. Due to high memory demands, the Fourier-based solution can be computed on low-resolution input only and the resulting rotation estimate is given on an equiangular grid. We estimate the mode of the likelihood density based on the grid values in order to obtain a rotation estimate of increased accuracy. We show results on data captured with a spherical video camera and validate the approach comparing the orientation estimates of the real data to the ground-truth values. Index Terms— image registration, omnidirectional vision, spherical fourier transform, orientation estimation 1. INTRODUCTION The problem of estimating the motion of a camera between a pair of images has been studied extensively over the last years. While the approaches differ greatly in the way translation and rotation are estimated, most algorithms have been typically developed for conventional perspective images and were later adapted to new image modalities, such as panoramic images. The larger information content inherently present in images with a large field of view (FOV) has made omnidirectional vision become popular for different applications, e.g. in robot vision and scene acquisition. Like Ferm¨uller and Aloimonos [1] conclude, a large FOV on a spherical image may be optimal for the recovery of egomotion. In the field of 3DTV, omnidirectional vision can be expected to play an important role for background model acquisition (e.g. for setups like [2]) as well as in different semi-3D applications like image-based rendering or realistic image-based lighting in mixed-reality applications. Typically, local salient features of the image, mostly points, are detected in the two images. Examining the correspondences between these features (e.g. SIFT) allows for an estimation of quite large camera motions. For an application on panoramic images see, e.g., [3]. On a global scale the optical
flow can be computed to extract the motion parameters if the motions are sufficiently small (differential motions). Only recently, optical flow was measured at a single pair of antipodal points on the image sphere to gain motion estimates [4]. The fact that panoramic images can be straightforwardly mapped onto the unit sphere allows for the use of spherical signal analysis. A method of rotation estimation directly from images defined on the sphere was presented by Kostelec and Rockmore [5]. Their approach is related to the method of estimating the relative translational movement between two planar images by the use of phase correlation in the Fourier domain (see e.g. [6]). Since these techniques do not rely on correspondences of local image features they are less affected by small changes in dynamic environments and in contrast to optical-flow methods perform well even if the images change significantly (i.e., in case of large movements). Furthermore, the Fourier-based approach is fast to compute since in one step a whole set of orientation hypotheses are evaluated. However, since for spherical images the memory consumption scales cubically with the image resolution, the orientation estimate suffers from a significant quantization effect due to the fact that the orientation hypotheses are given on an equiangular grid. This grid structure is inherently determined by the Fourier-based approach. Methods to achieve sub-pixel accuracy have been proposed for the 2D problem of planar image registration (e.g. [7]), where typically a parametric function is fitted to the grid-based values. This solution does not directly extend to spherical images where the grid-based estimate is given in the 3D space of Euler angles. Our approach is to estimate the mode of the likelihood density based on the grid values in order to obtain a rotation estimate of increased accuracy. In the remainder of this paper we first review the Fourier-based orientation estimation in Section 2. In Section 3 we present our refinement method and discuss results on real image data in Section 4. 2. ROTATION ESTIMATION 2.1. Spherical Fourier Transform Images captured with a catadioptric sensor or, as in our case, with a spherical camera system can be mapped onto a sphere given the intrinsic parameters of the camera system. There-
fore, these omnidirectional images can be considered a function f (θ, φ) = f (ω) on the 2-sphere where θ denotes the colatitude angle in the range [0, π] and φ the azimuth defined in [0, 2π). Since the spherical harmonic functions Yml form a complete orthonormal basis over the unit sphere, any squareintegrable function f (ω) ∈ L2 (S2 ) can be expanded as a linear combination of spherical harmonic functions l Yml (ω). (1) fˆm f (ω) = l∈N |m|≤l
Yml
Here, denotes a spherical harmonic function of degree l and order m and is given by Yml (θ, φ)
= (−1)
m
(2l + 1)(l − m)! l Pm (cos θ)eimφ , (2) 4π(l + m)!
l where fˆm ∈ C are the complex expansion coefficients l and Pm (cos θ) denote the associated Legendre polynomials. Since we are dealing with omnidirectional images, the spherical functions f (ω) exist on a uniformly sampled equiangular grid. Driscoll and Healy [8] proved that a perfect reconstruction from a 2B × 2B grid is possible when bandlimiting f (ω) to B. To summarize, by using the Spherical Fourier Transform (SFT), any spherical image of size 2B × 2B can be represented by (and perfectly reconstructed from) B 2 many coml . plex expansion coefficients fˆm
2.2. Fourier Transform on the Rotation Group Similar to the phase correlation method on planar images, Kostelec and Rockmore presented a method of rotation estimation directly from images defined on the sphere [5]. They showed that the correlation between two images g (signal) and h (pattern) is a function of rotations C(r) = g(ω)Λ(r)h(ω)dω (3)
S2
that can be efficiently evaluated in the Fourier domain as well. Here, Λ(r) denotes the rotation operator given a rotation r = r(α, β, γ) where α, β, γ are the Euler angles (ZYZ) defining the rotation. Because the spherical harmonic functions Yml form an orthonormal basis for the representations of SO(3), the SO(3) Fourier transform (SOFT) coefficients of the correlation of two spherical functions can be obtained directly by calculating the outer product of their individual SFT coefficients. Taking the inverse SOFT yields C(r) evaluated on the 2B × 2B × 2B grid of Euler angles and its maximum value indicates the rotation separating the two images. The accuracy of the rotation estimation is directly related to the resolution of the likelihood grid which in turn is specified by the number of bands used in the SFT. Given images of bandwith B, the resolution of the likelihood grid implicates 180 ◦ 90 ) in α and γ and ±( 2B+1 )◦ in β. an accuracy up to ±( 2B+1
3. REFINING THE ROTATION ESTIMATION The accuracy of the rotation estimation is severely restricted to smaller bandwidths by its memory requirements. The O(B 3 ) SOFT coefficients and the likelihood grid have to be kept in memory, so a pixel correct rotation estimation of spherical images is only feasible for small image resolutions (e.g. 512 × 512 pixels). While Makadia and Daniilidis [9] present a more efficient algorithm for the computation of the correlation function, this formulation is still dominated by the inverse SOFT. Typically, the correct orientation is estimated by searching for the maximum in the correlation grid given by Eq. 3. Since this will suffer from quantization errors, our approach is to refine the estimate using several grid points with their associated correlation measures. Given the correlation grid, our task is to estimate the maximum of the corresponding continuous likelihood function. Since the grid-based samples are given in the space of Euler angles, averaging across the grid boundaries is problematic. Therefore, we transform the samples to a unit quaternion representation. The resulting data space is better suited for interpolation and averaging operations [10]. Still, common interpolation techniques are not applicable in a straight-forward manner. This is due to the fact that the samples lie on the unit sphere in 4D quaternion space. Additionally, the grid contains many samples: about 2 million for bandwidth B = 64 or 32, 768 samples for B = 16. Therefore, a fast algorithm is essential for real-time applications. As our experiments show, a simple heuristic allows us to handle the refinement problem. We assume that the grid values give us a good guess of the true maximum and that locally the density function is symmetric and unimodal. In this case, the mode can be determined by the weighted mean, i.e., we compute the mean rotation q˜ =
1
i∈S
wi
wi qi ,
(4)
i∈S
where S denotes the subset of the N samples qi with a correlation above a certain quantile and the weights wi are based on the correlation Cˆ normalized by the maximum correlation of the grid. This corresponds to the idea that the density function is dominated by the highest mode. In case that some of the chosen samples originate from different modes, this can be compensated for by penalizing the distance to the median q of S: wi = gσC Cˆ (rqi ) dσd (qi , q ). (5) For both functions g and d we use a Gaussian kernel function with empirically chosen bandwidths σC and σd , respectively. Note, that the refinement can be computed in linear time with respect to the grid size.
the refined estimates. The estimated orientation using bandwidth B = 16 is close to the grid-based estimate of four times higher resolution (B = 64). The quantization effect on the higher resolved grid is still observable, this is especially disturbing when the camera is static (frames 1-11). In contrast, the refined estimate is free from these artifacts. A video file showing a comparison of the grid-based and refined estimations on the whole sequence can be downloaded from www.gris.uni-tuebingen.de/˜tschairer/ omnivision/. Fig. 1. The spherical camera (red) mounted on a high-accuracy pan/tilt unit which can be rotated around the depicted axes (laser scanner and standard camera not used).
4. RESULTS We validate the proposed algorithm using spherical images acquired with a LadyBug2 camera system by PointGrey1 . This system consists of six wide angle cameras covering about 360◦ × 130◦ of the entire sphere. The stitching of the single XGA-frames is computed in hardware in real-time. For ground-truth comparison we mounted the camera on a pan-/tilt-unit which allows to rotate the camera with high accuracy (see Fig. 1). Note that the camera origin does not coincide with the rotation axis. Therefore, an additional translational movement (up to ≈ 20cm and ≈ 50cm in the first and second experiment, respectively) is induced which however does not affect the comparison. Contrarily, a correct rotation estimate approves a certain robustness to translational movements. In the first experiment we rotate around the pan-axis, in the second one around the tilt-axis. While the first plot in Fig. 2(a) shows a very smooth reconstruction of the rotation angle that is very close to the correct values, in the second experiment the estimate is less exact. This is due to the incomplete spherical field-of-view of the camera, i.e., a rotation around the tilt-axis introduces a great amount of new image content. The mean squared errors (MSE) are given in Table 4. MSE [deg2 ] Experiment 1 Experiment 2
max. grid value 11.69 8.67
refined 2.25 4.97
Table 1. Mean squared error for estimate based on the maximum grid value and the refined solution.
To illustrate the quantization effect in comparison to our refined solution, we show different frames from a video sequence captured with the same camera. The registration is done pairwise with respect to the preceding frame which allows larger translational movements over time, yet accumulating registration errors. Fig. 3 shows selected frames of the sequence. The plot in Fig. 2(b) shows the superiority of 1 see
www.ptgrey.com
5. CONCLUSION We presented a method for refining spherical Fourier-based rotation estimates. Experiments have shown that our method is robust to translational movements and changes in the scene. The increase in accuracy was validated using higher resolution data as well as ground-truth measurements. We also presented image data for visual comparison. As a key technique, the orientation estimation using the Fourier-based approach with an additional refinement step makes interesting real-time applications possible. Further improvements could likely be achieved by fusing estimates of the camera movement over several consecutive frames. 6. REFERENCES [1] C. Ferm¨uller and Y. Aloimonos, “Ambiguity in structure from motion: Sphere versus plane,” Int. J. Comput. Vision, vol. 28, no. 2, pp. 137–154, 1998. [2] S. Fleck, F. Busch, P. Biber, and W. Straßer, “Graph cut based panoramic 3d modeling and ground truth comparison with a mobile platform - the w¨agele,” Image Vision Comput., vol. 27, no. 1-2, pp. 141–152, 2009. [3] M. Fiala, “Structure from motion using sift features and the ph transform with panoramic imagery,” in Proc. Canadian conference on Computer and Robot Vision (CRV), 2005. [4] J. Lim and N. Barnes, “Directions of egomotion from antipodal points,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8. [5] P.J. Kostelec and D.N. Rockmore, “Ffts on the rotation group,” Tech. Rep., Fe Institutes Working Paper Series, 2003. [6] E. De Castro and C. Morandi, “Registration of translated and rotated images using finite fourier transforms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 9, no. 5, pp. 700–703, 1987. [7] H. Foroosh, J.B. Zerubia, and M. Berthod, “Extension of Phase Correlation to Subpixel Registration,” IEEE Transaction on Image Processing, vol. 11, pp. 188–200, 2002. [8] J.R. Driscoll and D. M. Healy, Jr., “Computing fourier transforms and convolutions on the 2-sphere,” Adv. Appl. Math., vol. 15, no. 2, pp. 202–250, 1994. [9] A. Makadia and K. Daniilidis, “Rotation recovery from spherical images without correspondences,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 7, pp. 1170–1175, 2006. [10] C. Gramkow, “On Averaging Rotations,” Internation Journal of Computer Vision, vol. 42, no. 1/2, pp. 7–16, 2001.
120
100
α[deg]
80
60
40
20
0 0
(a) Estimates and ground-truth (black) of the rotation angles of the first (left) and second (right) experiment. The proposed method (red) corrects the large quantization errors of the maximum SOFT-estimate (blue). In the left plot, deviations (cyan) from zero in the other two angles of rotation (θ,ψ) also vanish using the refinement method (magenta). The estimation was performed using a bandwidth of B = 16.
10
20
30 frame
40
50
60
(b) Estimates of the rotation angles using a
bandwidth of B = 16 (blue: maximum grid value, red: refined) and B = 64 (black) for the same image sequence as in Figure 3.
Fig. 2. Orientation estimates
Fig. 3. Images from a sequence including translational movements. Reference frame shown in the first, frame 36 in the second and frame 45 in the third row (left: input, middle: grid-based, right: refined).