The International Journal of Robotics Research http://ijr.sagepub.com/
Rotation estimation and vanishing point extraction by omnidirectional vision in urban environment Jean-Charles Bazin, Cédric Demonceaux, Pascal Vasseur and Inso Kweon The International Journal of Robotics Research 2012 31: 63 DOI: 10.1177/0278364911421954 The online version of this article can be found at: http://ijr.sagepub.com/content/31/1/63
Published by: http://www.sagepublications.com
On behalf of:
Multimedia Archives
Additional services and information for The International Journal of Robotics Research can be found at: Email Alerts: http://ijr.sagepub.com/cgi/alerts Subscriptions: http://ijr.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://ijr.sagepub.com/content/31/1/63.refs.html
>> Version of Record - Jan 12, 2012 What is This?
Downloaded from ijr.sagepub.com by guest on October 11, 2013
Rotation estimation and vanishing point extraction by omnidirectional vision in urban environment
The International Journal of Robotics Research 31(1) 63–81 © The Author(s) 2011 Reprints and permission: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0278364911421954 ijr.sagepub.com
Jean-Charles Bazin1 , Cédric Demonceaux2 , Pascal Vasseur3 and Inso Kweon4 Abstract Rotation estimation is a fundamental step for various robotic applications such as automatic control of ground/aerial vehicles, motion estimation and 3D reconstruction. However it is now well established that traditional navigation equipments, such as global positioning systems (GPSs) or inertial measurement units (IMUs), suffer from several disadvantages. Hence, some vision-based works have been proposed recently. Whereas interesting results can be obtained, the existing methods have non-negligible limitations such as a difficult feature matching (e.g. repeated textures, blur or illumination changes) and a high computational cost (e.g. analyze in the frequency domain). Moreover, most of them utilize conventional perspective cameras and thus have a limited field of view. In order to overcome these limitations, in this paper we present a novel rotation estimation approach based on the extraction of vanishing points in omnidirectional images. The first advantage is that our rotation estimation is decoupled from the translation computation, which accelerates the execution time and results in a better control solution. This is made possible by our complete framework dedicated to omnidirectional vision, whereas conventional vision has a rotation/translation ambiguity. Second, we propose a top-down approach which maintains the important constraint of vanishing point orthogonality by inverting the problem: instead of performing a difficult line clustering preliminary step, we directly search for the orthogonal vanishing points. Finally, experimental results on various data sets for diverse robotic applications have demonstrated that our novel framework is accurate, robust, maintains the orthogonality of the vanishing points and can run in real-time.
Keywords catadioptric vision, omnidirectional vision, parallel lines, rotation estimation, vanishing points
1. Introduction This paper aims to estimate the complete rotation for robotic applications from omnidirectional vision data in real time. Rotation estimation refers to the computation of the rotation angles of a system and is a fundamental step for various robotic tasks such as ground/aerial vehicle control (Jones et al. 2006; Rondon et al. 2010), humanoid stabilization, motion estimation and 3D reconstruction (Campbell et al. 2005). It is now well established that traditional navigation equipments, such as global positioning systems (GPSs) or inertial measurement units (IMUs), suffer from several disadvantages. For example, GPS is sensitive to signal dropout and hostile jamming. The drawback of the IMU is that its error accumulates over time and may cause large orientation/localization errors. In order to overcome these disadvantages, many researchers suggested a vision-based approach of the navigation problem which aims to estimate the localization and/or orientation when GPS or inertial guidance is not available (Ettinger et al. 2003; Wang et al.
2005; Rondon et al. 2010). Most of these works utilize conventional cameras and thus have a relative small field of view, which drastically limits the amount of information that could be obtained from the environment. In contrast, omnidirectional vision systems have a much wider field of view. That is why they are gaining popularity for both ground (Chang and Hebert 1998; Winters et al. 2000) and aerial (Hrabar and Sukhatme 2003, 2004; Benhimane and Malis 2006; Demonceaux et al. 2006) robotic applications. 1 Ikeuchi
Laboratory, Institute of Industrial Science, The University of Tokyo, Tokyo, Japan 2 Le2i UMR 5158, University of Burgundy, France 3 LITIS EA 4108, University of Rouen, France 4 RCV Lab, KAIST, Korea Corresponding author: Jean-Charles Bazin, Ikeuchi Laboratory, 3rd Department, Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan. Email:
[email protected]
64
Rotation estimation from conventional and omnidirectional images has been studied extensively and the existing methods can be divided into three main categories. The first is based on feature correspondence (e.g. Harris corner, scale-invariant feature transform [SIFT], lines) and epipolar geometry (see Akbarzadeh et al. (2006) and Agarwal et al. (2009) for conventional cameras, and Corke et al. (2004), Miˇcušík et al. (2004), Doubek and Svoboda (2002), Lhuillier (2007), Tardif et al. (2008), Torii et al. (2009), Scaramuzza and Siegwart (2008), and Scaramuzza et al. (2009) for omnidirectional vision). This category also includes all the simultaneous localization and mapping (SLAM)based methods such as those of Davison (2003) and Karlsson et al. (2005) for conventional cameras and Lemaire and Lacroix (2007) for omnidirectional vision. Whereas impressive results can be obtained, this common approach still suffers from several important limitations and difficulties: matching feature points is not an easy task (repeated textures, blur, illumination changes, high distortions due to the mirror, etc.) and is time consuming, different apparent motions exist in dynamic scenes, the essential matrix is unstable in planar environment, the homography is not defined for non-planar scenes (it is defined only in the particular case of pure rotation), and so on. Moreover these works are indirect in the sense they estimate the complete ego-motion (rotation and translation), although only rotation might be required (e.g. humanoid walk stabilization or vehicle control). The second category works without feature correspondences and contains some different approaches. For conventional cameras, Dellaert et al. (2000) presented a probabilistic method that samples from the distribution over all correspondence sets using a Markov chain Monte Carlo algorithm. However, the technique assumes that the number of features is known and does not treat outliers and occlusion. For omnidirectional (catadioptric) vision, Makadia and Daniilidis (2006) converted the image in the frequency domain by spherical Fourier transform. Then the rotation estimation is refined from the conservation of harmonic coefficients in the rotational shift theorem. Since the whole image is analyzed, the method is computationally expensive. Moreover, it is sensitive to translation and dynamic environment. The third category relies on the points at infinity since their motion depends only on the rotation. Scaramuzza and Siegwart (2008) proposed an appearance-based approach where a part of the image pointing in front of the car is tracked. However, it estimates only the yaw angle, is sensitive to occlusion (e.g. another vehicle driving in front of the car), and assumes that the tracked image part is far away (to be translation invariant), the car follows a planar motion and the camera is perfectly vertical. An important set of points at infinity is the vanishing points (VPs), that are the intersection of the projection of world parallel lines in the image. VPs and infinite homography have been studied for conventional cameras (Rother 2000; Kosecka and Zhang 2002) and
The International Journal of Robotics Research 31(1)
for omnidirectional vision (Antone and Teller 2000). Existing methods based on lines and VPs will be reviewed in Section 3. In a control point of view, points at infinity, and thus infinite homography, yield rotation decoupling properties that respect the vehicle dynamics and result in a better behaved control solution (Rives and Azinheira 2004). The approach proposed in this paper belongs to this third category. Our work takes advantage of the structure of urban (indoor and outdoor) environments, that is the presence of numerous lines (e.g. building facades, walls, sideways, etc.) in orthogonal directions, which creates orthogonal VPs. These scenes are often referred to as Manhattan world (Coughlan and Yuille 2000, 2003) and imposing the VP orthogonality permits to respect the structure of urban environments. In this paper, we show how to overcome the limitations of the existing methods, especially how to explicitly impose the VP orthogonality and reach real-time execution. We have privileged omnidirectional vision because it provides two important properties for our approach. First, thanks to the wide field of view, many parallel lines lying in different directions can be observed. Second, the VPs usually lie inside the omnidirectional image. Therefore, it is possible to detect the VPs more robustly and accurately. Existing methods usually detect the VPs and then compute the rotation (bottom-up approach). In contrast, we introduce a topdown approach that, given a rotation estimation, computes the consistency of the associated vanishing points (Bazin et al. 2008a). As we explain in detail, this new framework provides several advantages: it runs in real time, ensures orthogonality of vanishing points, can use a priori rotation information (e.g. from the current GPS/IMU data or the previous processed frame) and is robust. The rest of this paper is composed of four main parts. First, we recall the popular concept of equivalent sphere for omnidirectional vision. In the second part, we review the existing methods related to VP-based rotation estimation. Third, we introduce our proposed top-down approach. Finally, we present numerous experimental results on real omnidirectional videos for rotation estimation. In addition, we perform omnidirectional video stabilization and also apply our method to 3D reconstruction by combining a 2D laser and an omnidirectional camera.
2. Omnidirectional vision and equivalent sphere projection Omnidirectional cameras refer to the vision sensors that can observe in all (or almost all) directions. Compared with traditional cameras, they provide several advantages such as a much wider field of view, a larger amount of information shared between images, the handling of the traditional ambiguity of rotation-translation inherent to traditional cameras, etc. The acquired images usually contain a large distortion, but this effect can be easily overcome by
Bazin et al.
65
Fig. 1. Examples of omnidirectional views obtained by a polydioptric camera. These images are part of the Google Street View data which is copyright and provided by Google.
Fig. 2. Examples of catadioptric views obtained by a hand-held catadioptric camera.
the proposed framework, as we explain in the remainder of this paper. In this paper, we focus on two kinds of central omnidirectional vision systems: camera clusters and catadioptric cameras. A camera cluster (also called a polydioptric device) is composed of several synchronized cameras each pointing in different directions. The acquired pictures can then be stitched together to build a panoramic image, as illustrated in Figure 1. Google Street View is probably the most famous application based on a polydioptric device nowadays (see http://maps.google.com/). A popular commercial device in the computer vision and robotics communities is
the Point Grey’s Ladybug composed of six aligned cameras. Our second omni-sensor of interest refers to the central catadioptric vision systems. They are composed of a convex mirror with a specific shape, a lens and a conventional camera. Their main advantage is that a wide field of view can be acquired using only one camera. Typical catadioptric images are shown in Figure 2. Mirrors can have various shapes (parabolic, hyperbolic) and can be purchased easily. It has been shown that, when the camera calibration parameters are known (Hartley and Zisserman 2004), images can be represented in a convenient Gaussian sphere. This spherical representation has been used since (Barnard
66
1983) for pinhole images and has been recently extended to the concept of equivalent sphere to handle various types of central cameras such as fisheye, camera cluster (polydioptric), catadioptric and so on (Ying and Hu 2004; Mei and Rives 2007). When the camera is not perfectly central (i.e. does not have a single center of projection), the sphere equivalence makes an approximation that is acceptable in most situations (Kim et al. 2010). For example, Geyer and Daniilidis (2001) have demonstrated the equivalence for central catadioptric system with a two-step projection via a unitary sphere centered on the focus of the mirror (the single viewpoint). For camera clusters, (Kim et al. 2007, 2008) considered the separate perspective views acquired by the cameras and projected them onto the sphere independently. In this paper, we prefer working on the combined panoramic image since it permits to avoid view discontinuities (e.g. the lines are not cut and the features can be tracked along one image). Then it is possible to project the panoramic image onto the sphere by a linear mapping between the 2D image coordinate values ( u, v) of the panoramic image and the two spherical angles ( α, β) (Banno and Ikeuchi 2009). Among the existing sphere representations, we selected the unified projection model proposed by Mei (2007), a slightly modified version of the models developed by Geyer and Daniilidis (2001) and Barreto and Araujo (2005). This model encompasses all of the central projection devices including perspective cameras, fisheye lenses (Ying and Hu 2004), aligned camera cluster systems and catadioptric cameras. The projection from the original image onto the sphere depends on some calibration parameters intrinsic to the camera and also related to the camera type. Readers are invited to refer to Mei (2007) for details about the projection equation. This sphere equivalence is interesting in several aspects. First, it allows us to work in a general framework, i.e. the sphere, independently on the sensor which is used: for example a catadioptric camera with hyperbolic or parabolic mirror, a fish eye lens, a camera cluster, etc. A second reason is that it greatly simplifies the formalism to take distortions inherent to wide vision into account. A third important aspect is that the sphere space provides some interesting projection properties. First, a line segment Li in the world is projected onto a great circle in the equivalent sphere space. A great circle is the intersection between the sphere and a plane passing through the sphere center and can be represented by a unit normal vector ni . Second, the great circles associated with a pencil of 3D parallel lines intersect in two antipodal points in the sphere which correspond to the vanishing points, as depicted in Figure 3. This intersection is computed by v = ni × nj where ni and nj are the normal vectors of two world parallel lines.
3. Existing methods based on vanishing points This section reviews the existing methods that estimate rotation from lines and VPs. Some approaches worked on
The International Journal of Robotics Research 31(1)
Fig. 3. Equivalent sphere projection: a world line (L1 ) is projected onto a great circle (C1 ) in the sphere and the projection of parallel lines (L1 and L2 ) intersect in two antipodal points (I1 and I2 ).
the raw edge pixels, rather than on the extracted lines, such as those of Denis et al. (2008), Martins et al. (2005), and Antone and Teller (2000). In our work, we privileged lines because they provide higher level information. The common approach consists of three consecutive main steps: first, the lines are extracted (denoted as step 1); second, the parallel lines are grouped together and the corresponding vanishing points are computed (denoted as step 2); and third, the rotation is estimated from the infinite homography (denoted as step 3). Steps 1 and 3 can be efficiently handled by common algorithms. The line can be extracted by our algorithm (Bazin et al. 2007), which will be explained in Section 5.1. Regarding step 3, once a set of VPs has been extracted and tracked, their relative rotation motion can be computed by Horn’s method based on unit quaternions (Horn 1987). We now review the existing methods related to step 2. Their input is the set of lines extracted in step 1 and represented by a list of normal vectors. Most of the methods perform on the equivalent sphere where parallel lines are projected in great circles that intersect in two antipodal points (cf. Section 2). Existing methods can be divided into three main categories. The first category relies on an exhaustive or quasiexhaustive study of the line intersections. For all line pairs, Magee and Aggarwal (1984) compute a cross-product which corresponds to a direction parameterized by the azimuth and elevation angles (θ , φ) in sphere space. The dominant vanishing point is obtained by sampling the (θ , φ) parameter space and selecting (θmax , φmax ) that contains the highest number of entries. This basic approach is sensitive to the sampling of the parameter space, which might lead to multi-detection, and does not directly impose the orthogonality constraint on vanishing points since they are independently detected. Several similar methods are based on the Hough transform where the VP detection is performed on a quantized Gaussian sphere (Barnard 1983; Quan and Mohr 1989). However, these methods are sensitive to the parameter sampling (φ and θ ), which is a traditional Hough limitation, and also do not permit the VP orthogonality to be
Bazin et al.
imposed. These limitations have been confirmed by Shufelt (1999), which showed that Hough transform also leads to spurious VPs. The second category is based on the simple yet powerful RANSAC framework (Fischler and Bolles 1981), which avoids the exhaustive search of the previous category. RANSAC is a very popular method to detect the inliers/outliers under an unknown model and to estimate this model by maximizing a consensus set. This approach has been applied by Rother (2002), Aguilera et al. (2005), and Wildenauer and Vincze (2007) in the context of line clustering. Two lines are randomly selected to create a vanishing point hypothesis. This hypothesized VP is tested for all of the lines and the number of lines clustered by this VP (consensus set) is counted. This process is repeated during several iterations and returns the VP (and the associated line clustering) that maximizes the consensus set. Whereas interesting results can be obtained, the detected VPs are usually not orthogonal because they are independently detected. Moreover the RANSAC-based methods are non-deterministic: different runs on the same data might produce different results. The third category relies on alternative schemes between the clustering and VP estimation steps (Antone and Teller 2000; Bosse et al. 2003). The most popular method of this category is the expectation–maximization (EM) algorithm. Given some initial VPs, EM alternates between performing an expectation (E) step, which computes the expectation of the line clustering evaluated for the current VPs, and a maximization (M) step, which computes the VPs from the data clustering found at the E step. This process iterates with the new VPs until convergence. However, it is well known that EM-based methods require a precise initialization; otherwise they return unreliable results. Moreover, they need some probabilistic distributions that are complicated to obtain and often based on heuristics. In conclusion, this review shows that existing methods are too computationally expensive, are sensitive to sphere quantization (Hough), cannot use a priori information and/or do not impose orthogonality constraints. This last aspect is of key importance for urban environments in order to correctly respect the scene structure.
4. Proposed method for rotation estimation 4.1. Our approach As reviewed, the existing methods consist in independently extracting the VPs and then computing the rotation (bottom-up approach). In order to overcome their limitations, we propose to invert the problem: we first make a hypothesis on the rotation and then we evaluate the consistency of the associated hypothesized vanishing points (top-down approach) (Bazin et al. 2008a). Before writing the formulation of the problem, let us fix some notation. Let nj be the normal associated with the jth detected line in sphere space (cf. Section 5.1) and N the
67
number of detected lines. Here α, β, γ correspond to the roll, pitch, and yaw angles and Rα,β,γ is the associated 3 × 3 rotation matrix. This rotation corresponds to a unique set of orthogonal VPs. Indeed, by applying a rotation R to an initial orthonormal coordinate system ( e1 , e2 , e3 ), we obtain a new orthonormal coordinate system ( v1 , v2 , v3 ) that can represent the orthogonal VPs in sphere space. This equivalence is illustrated in Figure 4. Given the set of extracted lines, we aim to estimate the rotation (and the corresponding VPs) and determine which line is associated to which VP. When a line nj is associated to a VP vi , the line–VP pair ( nj , vi ) is considered an inlier, and otherwise an outlier. To distinguish inliers/outliers, we follow the popular ‘residual tolerance method’ (Fischler and Bolles 1981). Concretely, we define that the line–VP pair ( nj , vi ) is an inlier if its geometric distance is lower than a residual tolerance τ , i.e. | arccos( nj · vi ) −π/2| ≤ τ . This τ can be set easily, since it is what most of inlier/outlier detection methods do (e.g. RANSAC (Fischler and Bolles 1981)). Experiments have shown that its value does not play an important role as long as it is physically meaningful, such as τ ∈ [0.5◦ , 3◦ ]. We fixed τ = 2◦ for all of the experiments shown in this paper. The VP consistency is defined as the number of lines passing through the hypothesized VPs, i.e. the number of line–VP pair inliers. The goal is to find the rotation that maximizes the VP consistency, which can be considered a consensus set maximization. Our problem can now be mathematically formulated as arg max α,β,γ
with
δ( u, v) =
M N
δ( Rα,β,γ ei , nj )
(1)
i=1 j=1
1 if | arccos( u · v) − π2 | ≤ τ , 0 otherwise,
(2)
where M corresponds to the number of VPs, set to M ∈ {1, 2, 3} and discussed in Section 4.3.1. Here ei represents the ith axis of the Cartesian coordinate system (i.e. e1 = ( 1, 0, 0)). After computing the geodesic distance between nj and vi = Rα,β,γ ei , the Dirac δ function returns 0 or 1 depending on whether the pair ( nj , vi ) is consistent (inlier) or not (outlier). In other terms, it counts the number of inliers, like consensus set maximization approaches. When maximization of Equation (1) is completed, we obtain • • •
the rotation angles ( αmax , βmax , γmax ) and the associated rotation matrix Rαmax ,βmax ,γmax ; the lines that are in the same direction as the M rotated axis; these M sets of lines correspond to the clusters of parallel lines; the M vanishing points vi = Rαmax ,βmax ,γmax ei , with i ∈ [1 . . . M]; note that they are mutually orthogonal by construction, since the basis ( e1 , e2 , e3 ) is orthogonal.
For fast execution purposes, one might prefer the alternative algebraic condition |u · v| ≤ τ , where τ defines the
68
The International Journal of Robotics Research 31(1)
e3
v3 v2
e2 e1
v1 R
Fig. 4. An initial orthonormal coordinate system ( e1 , e2 , e3 ) is transformed, by a rotation R, to a new orthonormal coordinate system ( v1 , v2 , v3 ) that can represent the orthogonal VPs in sphere space. It depicts the important fact that searching for orthonormal VPs is similar to look for a particular rotation.
dot product similarity threshold. In the experiments shown in this paper, we used the geometric condition of Equation (2). A discussion about the original world coordinate system ( e1 , e2 , e3 ) and the number of observable directions is performed in Section 4.3.3.
4.2. Algorithm In an initial approach, we had maximized Equation (1) by a simple exhaustive search on α, β, and γ angles. At each rotation sample, we computed the number of lines satisfying the inlier function in an accumulator. A priori rotation information (e.g. the results of the previous frame) provides initial values αest , βest and γest . We defined the associated search space of each estimated angle angleest as {angleest − D + kT} with integer k ∈ [0, 2D/T], i.e. the interval centered at angleest with an offset of ±D and a sampling rate of T. Therefore, the total number of dot products to compute is MN( 2D/T)3 , where N is the number of detected lines. As T is fixed, we refer this method as the fixed sampling algorithm (FS). Experiments have shown that FS can run very fast with some parameters (such as T = 2◦ and D = 5◦ ) but is too slow for parameters defining a large interval with a very fine sampling (e.g. T = 0.1◦ and D = 10◦ ). That is why, in our final implementation, we have preferred a multiscale sampling approach: the function (1) is maximized by successively refining the discretization of the rotation angles until a desired precision is reached (cf. Algorithm 1). This method allows us to avoid an exhaustive search on very large and fine intervals while maintaining a given accuracy. At any accuracy level J , the search space of an angle is split in K intervals, thus the complexity is MN( K + 1)3 . The maximum accuracy level to reach an accuracy Tgoal T is Jmax = ceil( ln( goal ) / ln( 2/K)) = ceil( log2/K ( Tgoal /D)) D where ceil is the round toward infinity operator. Therefore, 3 the total complexity is JJmax =1 MN( K + 1) .
Algorithm 1 ( αmax , βmax , γmax ) = FnRotationstimation ( ListN, ( αest , βest , γest )) Input: ListN : list of normals in the current frame Input: (αest , βest , γest ) : a priori rotation estimation Doriginal = offset K = sampling Tgoal = accuracy to reach J = 0 level of accuracy (initialization) while J == 0 or T > Tgoal do MaxValue = 0 J =J +1 D = Doriginal ∗ ( 2/K)J −1 T = Doriginal ∗ ( 2/K)J for α = αest − D : T : αest + D do for β = βest − D : T : βest + D do for γ = γest − D : T : γest + D do M N δ( Rα,β,γ ei , ListNj ) Value = i=1 j=1
if Value > MaxValue then MaxValue = Value (αmax , βmax , γmax ) =(α, β, γ ) end if end for end for end for (αest , βest , γest ) =(αmax , βmax , γmax ) end while Output: (αmax , βmax , γmax ) : estimated rotation
Once the parallel lines are clustered by Algorithm 1, the associated VPs can be, if necessary, refined using this line clustering and least-squares fitting (e.g. the M-step of the EM algorithm, in the same way as Antone and Teller (2000)) to filter the great circle noise.
Bazin et al.
69
To obtain αest , βest and γest , we use a dynamic model to predict the rotation at the current frame. Using quaternion representation, this can be mathematically written as qt = qt−1 ⊗ q( ( + ω ) t )
(3)
where q is the quaternion representation of any rotation Rα,β,γ , ⊗ represents the quaternion multiplication, is the angular velocity (rotation speed), t is the time interval between t − 1 and t, and ω represents some velocity noise (Davison et al. 2007). For the experiments, we set the angular velocity = 0 and the noise ω = D/ t , with D = 10◦ . With this model, the search space is centered at the results of the previous frame with an offset D. In the case of sudden motions, some modifications are possible. The simplest one consists of increasing the offset D. It is also possible to combine it with the approach of Bazin et al. (2008b) where we investigated the extraction and matching of VPs in highly dynamic sequences. A more complex prediction model could also be applied, taking into account the vehicle type (e.g. car, airplane or bicycle), the vehicle dynamic/constraints (e.g. non-holonomic constraints), the application scenario and the available extra sensors. For example, if an IMU is available, it can provide an estimate of the angular velocity , and thus inserted into the prediction model (e.g. Equation (3) or an advanced Kalman filter). This kind of dynamic model-based approach can help smoothing the rotation angle estimation. Moreover, an appropriate model has a low noise ω , which permits us to reduce the search interval and thus diminishes the execution time. Intensive experiments on various sequences have demonstrated the validity of the proposed approach (cf. Section 5).
4.3. Discussion 4.3.1. Observability of lines The optimization function (1) counts the number of lines that are in the same direction as the M rotated axis. For general motion, the minimum number of VPs that have to be observed to define a rotation is M = 2 (Horn 1987). For planar motion, M = 1 VP is sufficient because the rotation has only one degree of freedom in this case. In terms of line support, the only assumption deals with the possibility of detecting at least this minimal number of parallel line bundles (i.e. two for general motion and one for planar motion), each bundle being composed of at least two lines (to define their intersection, i.e. their associated VP), in the same way as Antone and Teller (2000). The simplest solution for rotation estimation is to detect only the minimum number of VPs. Nevertheless, it is well known that any estimation is more robust when more data are available. This is the main reason why we prefer to detect, if available, the maximum number of orthogonal VPs: M = 3 VPs. In the case where less VPs exist in the scene, then the sets of lines associated with the extra
z y x
x
z
z
y x
y
y
x z
Fig. 5. Three possible configurations (2nd, 3rd, and 4th figure) can align the original world coordinate system with the three main directions of the image (1st figure).
VPs will simply be empty, which does not corrupt the algorithm. In some circumstances (time constraint, knowledge of the scene structure, constrained motion, etc.), the value of M can be manually set to a specific value: 1 for planar motion (cf. Section 5.2.3), 2 for a priori two main orthogonal directions (cf. Figure 8) or 3 for general cases (cf. Figure 6).
4.3.2. Line projection ambiguity All of the world lines belonging to the same 3D projective plane are projected into the same great circle. Therefore, if this projective plane passes through one of the VPs, then some lines non-parallel to the lines defining the main VPs might be ‘luckily’ clustered in this VP. This line projection ambiguity is also habitual for the existing VP-based works (cf. Section 3). It is not an issue in the environment targeted in our application (i.e. a typical urban scene) and it has not affected our algorithm during our numerous experiments with real data. The only consequence is that a very few lines might be, in some rare situations, not correctly clustered; and since they verify the inlier function of Equation (2), they do not corrupt the VP estimation.
4.3.3. Original reference frame The optimization function (1) aims to compute the rotation matrix that best aligns an original world coordinate system ( e1 , e2 , e3 ) with the dominant directions of the image. It is usually considered that e1 and e2 define the ground plane and e3 the vertical. Therefore, one might expect that the vertical direction of the image will be aligned with e3 . In practice, that does not always have to be the case. Indeed, as depicted by Figure 5, there exist three possible configurations that can align the original world coordinate system with the three main directions of the image, having the same orientations (if sign/direction modification is allowed, 24 permutations exist (Martins et al. 2005)). In the case of two main directions, six configurations are possible. We tested our algorithm with the three possible direct basis: ( e1 , e2 , e3 ), ( e2 , e3 , e1 ), and ( e3 , e1 , e2 ). As expected, we have obtained different alignments (since the basis is different), exactly the same number of lines in the directions of the rotated axis and the same vanishing points. Therefore, these multiple configurations do not have any consequences for our applications.
70
The International Journal of Robotics Research 31(1)
Fig. 6. Experimental results obtained by the proposed method for the extraction of vanishing points and parallel lines. Each conic corresponds to a detected line and all parallel lines have the same color.
5. Experimental results In this section we present numerous experimental results concerning the rotation estimation using the proposed approach. First, we explain how to extract lines in omnidirectional images, which provides the input data of our VP extraction algorithm. Then we perform experiments about rotation estimation in catadioptric and polydioptric images. Finally, we present some additional results in the context of video stabilization and 3D reconstruction to illustrate the robustness and generality of our method.
(i.e. |Ps · n|), it is possible to compute the geodesic (angular) distance (i.e. | arccos( Ps · n) −π/2|). This geometric distance corresponds to a more meaningful measure and respects the distortions. The algorithm returns a list of great circle normals corresponding to the detected lines. It has the advantages of running fast with accurate results. Moreover, it can handle large occlusion of lines and be applied for any central sensors.
5.2. Rotation estimation 5.1. Line extraction For the experiments, we extracted the lines in calibrated omnidirectional images by our algorithm (Bazin et al. 2007) that we slightly modified. It can be considered the extension of the polygonal approximation towards sphere space. The main idea is the following: after detecting edges in the image and building chains of connected edge pixels, we project these chains on the equivalent sphere and check whether they verify the great circle constraint, that is to say whether they correspond to the projection of world lines (cf. Section 2). For this, we perform a split-and-merge algorithm based on the distance between the spherical chain points and the plane defining a great circle. Let Ps be a spherical point and n the normal vector of the plane corresponding to a great circle. Instead of using the approach of Bazin et al. (2007) where the algebraic distance was used
This section presents experiment results of rotation/VP estimation for catadioptric sequences and the omnidirectional Google Street View dataset. Complete results are presented in the accompanying video available on the authors’ website: http://www.cvl.iis.u-tokyo.ac.jp/∼jcbazin/IJRR2011. 5.2.1. Camera–IMU calibration To measure the accuracy of the proposed method, we used an external IMU, the MTi IMU from Xsens. It outputs a drift-free 3D orientation, obtained from gravity and the Earth’s magnetic field, and will be considered a ground truth sensor. More precisely, it returns the orientation between its own frame and the Earth coordinate system. In contrast, our method computes the orientation between the camera frame and the coordinate system composed of the detected vanishing directions. Therefore, it is not possible to directly compare the results obtained from these two sensors and thus they must be
Bazin et al.
71
Fig. 7. Comparison of roll (top), pitch (middle) and yaw (bottom) angles between the calibrated IMU (considered ground truth, dashed line) and the proposed method (solid line with Tgoal = 1◦ , Doriginal = 10◦ , K = 10). The ‘jumps’ of the yaw angle are simply due to the discontinuities (−180◦ , +180◦ ).
calibrated. For the camera–IMU calibration, we follow the method presented in Bazin et al. (2009) based on Lobo and Dias (2007). The main idea is to observe the vertical direction at different poses, and then the relative rotation between the two sensors can be obtained by Horn (1987) in closed form.
5.2.2. Catadioptric sequence To analyze our method, we have acquired a catadioptric video sequence composed of 520 frames at 5 frames per second and the associated IMU data. The video was acquired in a parking lot (i.e. a typical urban scene). A human experimenter held the camera by hand and walked freely, so that the camera had a general motion in any directions (horizontal and height) and orientations (roll, pitch, and yaw). The total travel path was about 40 m and the rotation amplitude was large due to the free motion: about 40◦ for roll, 60◦ for pitch, and 360◦ for yaw angles. For each frame of the sequence, we estimate the best rotation by optimizing Equation (1). For the first frame, an initial solution is obtained by the Hough transform (Quan and Mohr 1989) (cf. Section 3): Hough transform is applied for all of the lines and the most dominant VP is found at the bin containing the highest number of entries, then the associated lines are removed and the process is iterated for the following VPs. Figure 6 depicts typical results of vanishing point extraction for this sequence.
To qualitatively measure the accuracy of our approach, we compare the rotation angles estimated by our proposed method and the calibrated ground truth data obtained by the drift-free IMU. Both permit us to obtain the relative orientation between two time steps (e.g. two consecutive frames). To analyze the error accumulation, we rather compared the absolute rotations. Figure 7 compares the evolution of the absolute rotations. It shows that our estimation does not deviate from the drift-free IMU data, which means that our approach does not accumulate error. It must be noted that, according to Xsens specifications, the dynamic accuracy of the MTi sensor is about 2◦ and a synchronization delay might exist between the camera and the IMU. For qualitative comparison, Table 1 contains the error between the calibrated IMU data and our proposed method with different sets of sampling parameters. These results demonstrate the accuracy of our approach. The vanishing points obtained by our method are mutually orthogonal by construction whereas data analysis by Bazin et al. (2007) has shown that the average angle is 89.7◦ with a standard deviation of 4.7◦ . We have also applied the proposed algorithm on three other catadioptric sequences with no gyroscope data. Some representative results are displayed in Figure 8. The sequences gungdong1 and gungdong2 have been acquired by a hand-held catadioptric camera in free motion (3D translation and full rotation) in a dense urban environment. The rotation is estimated by tracking M = 2 VPs. The sequence chungdae1 has been recorded on a road. By
72
The International Journal of Robotics Research 31(1)
Table 1. Mean and standard deviation error (mean/SD) in degrees between the calibrated IMU data and the proposed methods of fixed sampling (FS) and multi-scale sampling (MS) with the same set of parameters than Table 2. For comparison with existing methods, we also applied the approach of Bazin et al. (2007). Algorithm
Parameters
Roll
Pitch
Yaw
FS MS MS
T = 5, D = 10 K=4 K = 10
1.9 / 2.2 1.9 / 2.2 1.3 / 1.1
2.0 / 1.7 2.0 / 1.7 1.5 / 1.1
4.3 / 3.1 4.3 / 3.1 4.0 / 2.9
FS MS MS
T = 2, D = 5 K=4 K = 10
1.9 / 2.0 2.0 / 3.2 1.7 / 1.7
2.2 / 3.2 1.7 / 2.0 2.0 / 2.8
4.5 / 3.8 4.4 / 3.6 3.8 / 2.7
FS MS MS
T = 1, D = 5 K=4 K = 10
1.7 / 1.7 1.3 / 1.2 1.7 / 1.7
2.0 / 2.8 1.5 / 1.5 2.0 / 2.8
3.8 / 2.7 4.0 / 2.9 3.8 / 2.7
FS MS MS
T = 1, D = 10 K=4 K = 10
1.2 / 1.0 1.3 / 1.0 1.3 / 1.0
1.3 / 1.0 1.3 / 1.0 1.3 / 1.0
3.9 / 2.8 3.7 / 2.7 3.8 / 2.7
4.6 / 2.2
4.3 / 2.1
4.6 / 3.5
Bazin et al. (2007)
Fig. 8. Vanishing point extraction by the proposed algorithm on the sequences gungdong1 (top row), gungdong2 (middle row), and chungdae1 (bottom row). These sequences contain dynamic environments and blurred images. The conics have been enlarged for a better visualization.
Bazin et al.
73
Fig. 9. Combining our rotation estimation algorithm with an OpenGL program allows us to simulate, in real time, the attitude of a robot by applying the estimated rotation to a virtual airplane.
Table 2. Average execution time in frames per second (fps) on non-optimized C++ code with respect to the search space sampling and the number of lines by fixed sampling (FS) and multi-scale sampling (MS) algorithms (execution time for line extraction by Bazin et al. (2007) not included). Algorithm
Parameters
N = 50
N = 100
N = 150
FS MS MS
T = 5◦ , D = 10◦ K = 4 (T1 = 5, Tf = 5) K = 10 (T1 = 2, Tf = 2)
1,901 1,901 177
1,220 1,221 114
885 884 83
FS MS MS
T = 2◦ , D = 5◦ K = 4 (T1 = 2.5, Tf = 1.25) K = 10 (T1 = 1, Tf = 1)
1,090 951 177
704 608 114
512 443 83
FS MS MS
T = 1◦ , D = 5◦ K = 4 (T1 = 2.5, Tf = 0.625) K = 10 (T1 = 1, Tf = 1)
177 628 177
114 407 114
83 296 83
FS MS MS
T = 1◦ , D = 10◦ K = 4 (T1 = 5, Tf = 0.625) K = 10 (T1 = 2, Tf = 0.4)
26 472 89
17 304 57
12 221 42
assuming the planar motion of a car on the road, only one VP needs to be tracked. Our algorithm manages to track the vanishing points correctly in the three sequences. We have also experimented that our approach can handle line outliers (i.e. some lines that are classified to a wrong vanishing point), blurry images (as long as an edge operator can extract some line edge pixels) and dynamic environments (car, pedestrian, etc.) which demonstrates the robustness of our approach. The execution time of our method is displayed in Table 2, using a computer equipped with a quad CPU at 2.4 GHz (only one core was used) and 3.2 GB RAM. The parameters (T = 1◦ , D = 5◦ ) or (T = 1◦ , D = 10◦ ) are realistic values for vehicle control and 3D reconstruction, and lead to an average processing time between 57 fps (K = 10) and 407 fps (K = 4) with N = 100 lines on non-optimized C++ code. Since all of the rotation samples at a level J can be
studied independently, it is possible to implement the proposed framework on a GPU to process the rotation samples in parallel, which would accelerate the execution time. A GPU approach could also be adopted for the line extraction step, since each edge chain can be analyzed independently. For demonstration purpose, we have also developed an OpenGL visualizer connected to our rotation estimation program. It shows that we are able to simulate, in real time, the robot attitude by applying the estimated rotation to a virtual airplane (cf. Figure 9 and the accompanying video). 5.2.3. Google Street View sequence We also applied our rotation estimation framework to a Google Street View dataset kindly provided by Google. This dataset contains some long image sequences, and the associated calibration parameters, collected by an omnidirectional camera embedded on the roof of a moving car as part of the Street View
74
Fig. 10. Line and vanishing point extraction by the proposed algorithm on a Google Street View sequence (same images as Figure 1). Each conic corresponds to a detected line and all parallel lines have the same color. The conics have been enlarged for a better visualization.
The International Journal of Robotics Research 31(1)
Fig. 12. Motion estimation (top) on a part of a Google Street View sequence from the estimation of rotation and car non-holonomic constraints. The ground truth trajectory provided by Google is displayed at the bottom.
Ground truth rotation is unknown for this dataset but the true trajectory is provided. Therefore, in order to compare our results with true data, we computed the camera trajectory from the rotation estimated by our framework and car non-holonomic constraints. Given the rotation obtained by the VPs, the car position at time t, noted ( px ( t) , py ( t) ), can be estimated by non-holonomic constraints: px ( t) = px ( t − 1) +dt ∗ cos( θt + θt /2) py ( t) = py ( t − 1) +dt ∗ sin( θt + θt /2)
Fig. 11. Evolution of the number of lines clustered to their associated VPs along 200 frames of a Google Street View sequence.
feature in Google Maps (see http://maps.google.com/). The car travels on city streets and also turns with large rotations at crossroads. Some representative experimental results for line extraction and VP estimation are shown in Figure 10. Figure 11 illustrates the number of lines clustered with respect to their vanishing points along 200 frames of a Google Street View sequence. The number of lines evolves between about 100 and 250 in the images, which shows that lines are indeed prominent features in urban environments. This figure also depicts the evolution of the number of lines for each VP cluster, which corresponds to the level of VP observability, i.e. how much a certain VP is dominant in the images. This number of clustered lines constitutes also an interesting information to detect seldom situations such as sudden unstructured scenes or rapid huge occlusion.
(4) (5)
where dt is the distance traveled between t − 1 and t, θt is the rotation at time t and θt = θt − θt−1 . Initially px ( t) = py ( t) = 0 at t = 0. In our case, since the distance dt is known only up-to-scale, we set dt = 1 for every t (i.e. constant velocity). Figure 12 shows a typical example of trajectory estimation by these non-holonomic constraints. Additional results are displayed in Figure 13. These sequences contain between 300 and 1,200 frames and cover between 250 and 900 m. The purpose of this experiment is not to compete with the general motion estimation algorithms (see, e.g., Akbarzadeh et al. 2006; Torii et al. 2009), but rather to verify our rotation estimation results with the true trajectory provided by Google. However, for completeness, we also applied a general motion estimation algorithm (Mouragnon et al. 2006). It is composed of the following steps: feature point detection and tracking (SIFT or Harris corners+KLT), projection on the equivalent sphere space (Banno and Ikeuchi 2009), motion estimation based on Nister’s three-point algorithm (Nistér and Stewénius 2006), and finally bundle adjustment (Jeong et al. 2010). The result
Bazin et al.
75
400 300 arrival
250 200
300 200
150 100
100 arrival
50 0
0
start
0
100
200
start
−200
−100
(a)
(b)
(c)
(d)
(e)
(f)
0
Fig. 13. Trajectory estimation (a,b,c) by our VP-based rotation estimation algorithm and car non-holonomic constraints on some parts of the Google Street View dataset. The ground truth trajectory provided by Google is displayed in the bottom-row (d,e,f).
obtained by this approach for the sequence of Figure 12 is shown in Figure 14. The overall aspect is reasonable and the scale is relatively correct. In contrast, concerning the technique combining the VP-based rotation and the non-holonomic constraints, the fact that dt is simply set to a constant value of 1 in Equation (5) explains the nonoverlapping streets at the top of Figure 12. This odometry experiment (1) allows us to interpret and verify the estimated rotation angles more easily and also (2) reflects the trajectory structure. For example, it is important to note that the trajectory estimated by our approach contains orthogonal and parallel parts, which corresponds to the structure of the scene. In contrast, results obtained by general motion algorithms, such as those shown in Figure 14, show that the error accumulates along time so that the estimated trajectory is not consistent with the true motion: parallelism and orthogonality of the streets are not correctly preserved. One may note that these Google Street View sequences are very challenging (e.g. illumination change, dynamic scenes, etc.) and contains many occluders (e.g. trees, parked/moving cars, pedestrians, and shadows), which complicates the extraction of major edges. Despite these difficulties, the results are very satisfying. The complete VP extraction of these sequences are available in the accompanying video.
Fig. 14. Motion estimation result obtained by the general approach (Mouragnon et al. 2006) based on the three-point algorithm (Nistér and Stewénius 2006), associated with Figure 12.
5.3. Rotational video stabilization This section presents how the proposed rotation estimation framework can be applied for video stabilization and also demonstrates the quality of our approach in this stabilization context. Our goal of video stabilization is to maintain the video in a certain orientation. It is particularly suited for hand-held sequences because of the inherent shakes and jitters which make the visualization very inconvenient. We first review the few existing methods, then introduce our approach and finally present some experimental results.
76
The International Journal of Robotics Research 31(1)
(b)
(c)
(d)
(e)
(f)
(g)
(a)
Fig. 15. Rotational stabilization by the proposed method on a hand-held catadioptric video. (a) Reference orientation. (b,c,d) Initial acquired images. (e,f,g) Their rotational alignment with respect to the orientation of (a). The colored arrows illustrate one initial (red) or aligned (green) direction. For a clearer visualization, only one direction is plotted but the rotational alignment consists of, not just a 1D rotation, but rather a complete 3D rotation transformation, so that for example the verticals are also aligned. This vertical alignment is especially visible on (c) where the pitch is correctly rectified in (f) with respect to (a).
This stabilization topic has been studied only very slightly for omnidirectional images. Albrecht et al. (2010) aimed to stabilize an image cropped in the sphere by a virtual perspective camera. However, the output image is only a small part of the sphere and the method needs to track a particular target that must be always visible in the image. Moreover the authors showed that this pure target-tracker approach was not robust enough and needed to be combined with an IMU to obtain satisfying results. Recently, Torii et al. (2010) estimated the camera trajectory and then rectified the images so that they will be aligned with the first frame. Interesting results can be obtained but since the trajectory is estimated by traditional epipolar geometry, it does not work for pure or quasi-pure rotation motion because it leads to a degenerate case. Moreover, it requires the motion to be parallel to the ground and the first image to be chosen ‘with care’, meaning it must be aligned with the vertical direction. In contrast, our method performs the rectification directly from the estimated rotations. Our procedure to rectify two images (referred to as image1 and image2) is composed of three main steps. First, we estimate the VPs and the rotations associated with the two images by our algorithm presented in Section 4. Second, we compute the relative rotation between these images and we apply the inverse of the relative rotation to the spherical points of image2. The third and last step consists of computing the color of the corresponding pixels. Indeed when we back-project the spherical points onto the image plane, the coordinates are not integers and therefore there is no direct one-to-one mapping. For this purpose, a traditional bicubic interpolation method is applied. Figure 15 illustrates the results obtained by the proposed method on a catadioptric video sequence. This simple technique is very effective and provides several advantages.
First, a translation component is not required and thus the method can work in pure or quasi-pure rotation motions, as illustrated in Figure 15(a,b) acquired at the same position (pure rotation). Second, it overcomes the assumption that the first image must be vertically aligned. Indeed, since we can extract the vertical VP, we can rectify any images so that they become aligned with the vertical direction (cf. Figure 15(c,f) and Figure 16). Third, it should be noted that, even in the case of very large rotational motion, the corresponding vanishing directions are perfectly aligned which proves that the rotation effect contained in the images has been correctly rectified. This is illustrated in Figure 15 where the images (a) and (b) have a very high yaw rotation of about 90◦ . A similar result is shown in Figure 16 for a sequence acquired by a Point Grey’s Ladybug camera. This experiment contains some interesting challenges. First, this camera is not exactly central and the environment is a tight 3D indoor scene. Moreover, it contains only a few dominant lines. Despite these difficulties and the strong rotation of the original image, the rectified image is perfectly vertical. The complete original and rectified sequences are available in the attached video.
5.4. 3D reconstruction with a rotating laser This section presents the role that our rotation estimation framework can play for 3D reconstruction application, by combining an omnidirectional camera and a 2D invisible laser range finder (LRF). We embedded the laser and the camera on a Pionneer2 wheeled robot manufactured by MobileRobots, turning at a fixed location (pure rotation) in the middle of an indoor scene. The SICK LRF used in the experiments uses a beam of invisible light to detect the depth of points over a typical range of 180◦ every 0.5◦ ,
Bazin et al.
77
Fig. 16. Vertical rectification. Top row: original image with a strong disorientation. Middle row: VP extraction. Bottom row: rectified image where the camera becomes aligned with the gravity and the vertical lines of the world become vertical straight lines.
in only one 2D plane. Therefore, two steps are necessary to build the entire scene in 3D: first, several sets of laser points are acquired by turning the Pionneer2 and second, the motion of the LRF is estimated to gather the laser data in a common coordinate system. However, estimating the
LRF motion from vertical laser data only is generally not possible. For this, we estimate the rotation of the attached catadioptric camera by the proposed VP-based approach and perform camera–laser extrinsic calibration (Zhang and Pless 2004).
78
The International Journal of Robotics Research 31(1)
Fig. 17. Our Pioneer2 robot in the lobby (left) and a look-up on its catadioptric camera and laser range finder (right).
Fig. 18. Rotation estimation by the proposed method, which allows us, in turn, to estimate the laser rotation given the extrinsic camera– laser calibration for 3D reconstruction purpose.
Figure 17 shows our robot in the middle of a lobby and the embedded equipment. The Pioneer2 robot was controlled to turn 360◦ at a fixed position and to acquire an omnidirectional image with the associated SICK laser data at every 1◦ in a ‘scan-and-go’ fashion. Figure 18 illustrates the VP extraction and Figure 19 shows the typical results of the final 3D reconstruction. One might notice how well the reconstructed 3D scene resembles the real scene. A virtual
tour of the 3D reconstruction is available in the attached video.
6. Conclusion In this paper, we have presented a top-down framework for rotation estimation from omnidirectional images. We
Bazin et al.
79
Fig. 19. Three-dimensional reconstruction (and the associated real images when available) obtained by combining a laser and an omnidirectional camera.
have privileged omnidirectional vision because many parallel lines in different directions can be observed, especially in urban environment, as shown by our experiments. The proposed top-down approach provides several advantages and contributions: it runs in real time, ensures the orthogonality of vanishing points and can use a priori rotation information. It can also be applied for a wide range of vision sensors (catadioptric, polydioptric, etc.) thanks to the equivalent sphere concept. Finally, we performed several experiments on various and challenging video sequences: catadioptric/polydioptric, hand-held/embedded on a car, etc. The experimental results have demonstrated the accuracy and the robustness of the proposed approach with respect to line outliers, dynamic environments and motion blur. We have also successfully applied our method to the stabilization of omnidirectional videos and 3D reconstruction. Funding This work has been supported by the National Strategic R&D Program for Industrial Technology, Korea.
Acknowledgments This work has been initiated within the STAR project of the Hubert Curien (Egide) partnership between RCV Lab at
KAIST-Korea and MIS at UPJV-France. The authors gratefully acknowledge the contribution of Roger Blanco Ribera for the development of the OpenGL simulation program, the precious help of Pierre-Yves Laffont for the laser experiments, and the interesting discussions with Davide Scaramuzza about the nonholonomic constraints during JC Bazin’s research stay at ETHZurich supported by BK21. Most of this work was performed when JC Bazin was with RCV Lab, KAIST, Korea and C Demonceaux and P Vasseur were with MIS Laboratory, UPJV, France.
References Agarwal S, Snavely N, Simon I, Seitz SM and Szeliski R (2009) Building Rome in a day. In International Conference on Computer Vision (ICCV’09). Aguilera DG, Lahoz JG and Codes JF (2005) A new method for vanishing point detection in 3D reconstruction from a single view. In Proceedings of ISPRS Comission, pp. 197–210. Akbarzadeh A, Frahm J-M, Mordohai P, Clipp B, Engels C, Gallup D (2006) Towards urban 3D reconstruction from video. In International Symposium on 3D Data Processing Visualization and Transmission (3DPVT’06), pp. 1–8. Albrecht T, Tan T, West G and Ly T (2010) Omnidirectional video stabilisation on a virtual camera using sensor fusion. In International Conference on Control, Automation, Robotics and Vision (ICARCV’10). Antone ME and Teller SJ (2000) Automatic recovery of relative camera rotations for urban scenes. In IEEE Computer
80
Society Conference on Computer Vision and Pattern Recognition (CVPR’00), pp. 282–289. Banno A and Ikeuchi K (2009) Omnidirectional texturing based on robust 3D registration through Euclidean reconstruction from two spherical images. In Computer Vision and Image Understanding (CVIU’09). Barnard ST (1983) Interpreting perspective image. Artificial Intelligence Journal 21: 435–462. Barreto JP and Araujo H (2005) Geometric properties of central catadioptric line images and their application in calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 27: 1327–1333. Bazin JC, Demonceaux C, Vasseur P and Kweon I (2009) Motion estimation by decoupling rotation and translation in catadioptric vision. In Computer Vision and Image Understanding (CVIU’09). Bazin JC, Kweon I, Demonceaux C and Vasseur P (2007) Rectangle extraction in catadioptric images. In ICCV Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras (OMNIVIS’07). Bazin JC, Kweon I, Demonceaux C and Vasseur P (2008a) A robust top down approach for rotation estimation and vanishing points extraction by catadioptric vision in urban environment. In Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’08). Bazin JC, Kweon I, Demonceaux C and Vasseur P (2008b) Spherical region-based matching of vanishing points in catadioptric images. In ECCV Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras (OMNIVIS’08). Benhimane S and Malis E (2006) A new approach to vision-based robot control with omni-directional cameras. In IEEE International Conference on Robotics and Automation (ICRA’06). Bosse M, Rikoski R, Leonard J and Teller S (2003) Vanishing points and 3D lines from omnidirectional video. The Visual Computer 19: 417–430. Campbell J, Sukthankar R, Nourbakhsh I and Pahwa A (2005) A robust visual odometry and precipice detection system using consumergrade monocular vision. In IEEE International Conference on Robotics and Automation (ICRA’05), pp. 3421– 3427. Chang P and Hebert M (1998) Omni-directional visual servoing for human-robot interaction. In Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’98), Vol. 3, pp. 1801–1807. Corke P, Strelow D and Singh S (2004) Omnidirectional visual odometry for a planetary rover. In Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’04), pp. 4007–4012. Coughlan J and Yuille A (2000) The Manhattan world assumption: Regularities in scene statistics which enable Bayesian inference. In NIPS. Coughlan J and Yuille A (2003) Manhattan world: Orientation and outlier detection by Bayesian inference. In Neural Computation (NC’03). Davison A (2003) Real-time simultaneous localisation and mapping with a single camera. In IEEE International Conference on Computer Vision (ICCV’03), Vol. 2, pp. 1403–1410. Davison A, Calway A and Mayol W (2007) Visual SLAM. In BMVC 2007 Visual SLAM Tutorial. Dellaert F, Seitz S, Thorpe C and Thrun S (2000) Structure from motion without correspondence. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’00), pp. 557–564.
The International Journal of Robotics Research 31(1)
Demonceaux C, Vasseur P and Pégard C (2006) Robust attitude estimation with catadioptric vision. In Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’06). Denis P, Elder JH and Estrada FJ (2008) Efficient edge-based methods for estimating manhattan frames in urban imagery. In Proceedings of the European Conference on Computer Vision (ECCV’08), pp. 197–210. Doubek P and Svoboda T (2002) Reliable 3D reconstruction from a few catadioptric images. In ECCV Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras (OMNIVIS’02), pp. 71–78. Ettinger SM, Nechyba MC, Ifju PG and Waszak M (2003) Vision-guided flight stability and control for micro air vehicles. Advanced Robotics 17: 617–640. Fischler MA and Bolles RC (1981) Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24: 1769–1786. Geyer C and Daniilidis K (2001) Catadioptric projective geometry. International Journal of Computer Vision 45: 223–243. Hartley RI and Zisserman A (2004) Multiple View Geometry in Computer Vision, 2nd edn. Cambridge: Cambridge University Press. Horn BKP (1987) Closed-form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America A 4: 629–642. Hrabar S and Sukhatme G (2003) Omnidirectional vision for an autonomous helicopter. In IEEE International Conference on Robotics and Automation (ICRA’03), pp. 557–564. Hrabar S and Sukhatme G (2004) A comparison of two camera configurations for optic-flow based navigation of a UAV through urban canyons. In Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’04). Jeong Y, Nister D, Steedly D, Szeliski R and Kweon I (2010) Pushing the envelope of modern methods for bundle adjustment. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’10), pp. 1474–1481. Jones E, Fulkerson B, Frazzoli E, Kumar D, Walters R, Radford J, et al. (2006) Autonomous off-road driving in the DARPA grand challenge. In IEEE/ION Position, Location, And Navigation Symposium 2006, pp. 25–27. Karlsson N, Bernardo ED, Ostrowski J, Goncalves L, Pirjanian P and Munich ME (2005) The vSLAM algorithm for robust localization and mapping. In IEEE International Conference on Robotics and Automation (ICRA’05). Kim J, Hwangbo M and Kanade T (2010) Spherical approximation for multiple cameras in motion estimation: its applicability and advantages. Computer Vision and Image Understanding 114: 1068–1083. Kim J-H, Hartley R, Frahm J and Pollefeys M (2007) Visual odometry for non-overlapping views using second-order cone programming. In Proceedings of the Asian Conference on Computer Vision (ACCV’07), pp. 353–362. Kim J-H, Li H and Hartley R (2008) Motion estimation for multi-camera systems using global optimization. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’08). Kosecka J and Zhang W (2002) Video compass. In Proceedings of European Conference on Computer Vision (ECCV’02), pp. 657–673.
Bazin et al.
Lemaire T and Lacroix S (2007) SLAM with panoramic vision. Journal of Field Robotics 24: 1–2. Lhuillier M (2007) Toward flexible 3D modeling using a catadioptric camera. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’07), pp. 1–8. Lobo J and Dias J (2007) Relative pose calibration between visual and inertial sensors. The International Journal of Robotics Research 26: 561–575. Magee MJ and Aggarwal JK (1984) Determining vanishing points from perspective images. In Proceedings of Computer Vision, Graphics and Image Processing, 256–267. Makadia A and Daniilidis K (2006) Rotation recovery from spherical images without correspondences. IEEE Transactions on Pattern Analysis and Machine Intelligence 28: 1170–1175. Martins A, Aguiar P and Figueiredo M (2005) Orientation in Manhattan: Equiprojective classes and sequential estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, pp. 822–826. Mei C (2007) Laser-augmented Omnidirectional Vision for 3D Localisation and Mapping. PhD Thesis. Mei C and Rives P (2007) Single view point omnidirectional camera calibration from planar grids. In IEEE International Conference on Robotics and Automation (ICRA’07), pp. 3945–3950. Miˇcušík B, Martinec D and Pajdla T (2004) 3D metric reconstruction from uncalibrated omnidirectional images. In: Proceedings of the Asian Conference on Computer Vision (ACCV ’04), pp. 545–550. Mouragnon E, Lhuillier M, Dhome M, Dekeyser F and Sayd P (2006) Real time localization and 3D reconstruction. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pp. 363–370. Nistér D and Stewénius H (2006) A minimal solution to the generalized 3-point pose problem. Journal of Mathematical Imaging and Vision 27(1): 67–69. Quan L and Mohr R (1989) Determining perspective structures using hierarchical Hough transform. Pattern Recognition Letters 9: 279–286. Rives P and Azinheira JR (2004) Linear structures following by an airship using vanishing point and horizon line in a visual servoing scheme. In IEEE International Conference on Robotics and Automation (ICRA’04), pp. 255–260. Rondon E, Garcia-Carrillo L-R and Fantoni I (2010) Vision-based altitude, position and speed regulation of a quadrotor rotorcraft. In Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’10), pp. 628–633. Rother C (2000) A new approach for vanishing point detection in architectural environments. In Proceedings of the British Conference on Machine Vision (BMVC’00), pp. 382–391. Rother C (2002) A new approach for vanishing point detection in architectural environments. Journal Image and Vision Computing 20: 647–656. Scaramuzza D, Fraundorfer F and Siegwart R (2009) Real-time monocular visual odometry for on-road vehicles with 1-point RANSAC. In IEEE International Conference on Robotics and Automation (ICRA’09), pp. 4293–4299. Scaramuzza D and Siegwart R (2008) Appearance guided monocular omnidirectional visual odometry for outdoor ground vehicles. In IEEE Transactions on Robotics (TRO’08), 24: 1015–1026.
81
Shufelt JA (1999) Performance evaluation and analysis of vanishing point detection techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence 21: 282–288. Tardif J, Pavlidis Y and Daniilidis K (2008) Monocular visual odometry in urban environments using an omnidirectional camera. In Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’08), 2531–2538. Torii A, Havlena M and Pajdla T (2009) From Google street view to 3D city models. In ICCV Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras (OMNIVIS’09), pp. 2188–2195. Torii A, Havlena M and Pajdla T (2010) Omnidirectional image stabilization for visual object recognition. International Journal of Computer Vision 91(2): 157–174. Wang LK, Hsieh S, Hsueh EC, Hsaio F and Huang K (2005) Complete pose determination for low altitude unmanned aerial vehicle using stereo vision. In Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’05), pp. 316–321. Wildenauer H and Vincze M (2007) Vanishing point detection in complex man-made worlds. In International Conference on Image Analysis and Processing (ICIAP’07), pp. 615–622. Winters N, Gaspar J, Lacey G and Santos-Victor J (2000) Omnidirectional vision for robot navigation. In ICCV Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras (OMNIVIS’00), pp. 21–28. Ying X and Hu Z (2004) Can we consider central catadioptric cameras and fisheye cameras within a unified imaging model? In Proceedings of the European Conference on Computer Vision (ECCV’04), Vol. 1, pp. 442–455. Zhang Q and Pless R (2004) Extrinsic calibration of a camera and laser range finder improves camera intrinsic calibration. In Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’04), vol. 3, pp. 2301–2306.
Appendix: Index to Multimedia Extensions The multimedia http://www.ijrr.org
extension
page
is
found
at
Table of Multimedia Extensions Extension
Type
Description
1
Video
Additional results about VP extraction and rotation estimation in catadioptric and polydioptric image sequences, especially: VP/rotation estimation experiments with a hand-held catadioptric camera; VP/rotation estimation experiments with a polydioptric camera on the Google Street View dataset; rotation visualization with a virtual aircraft; automatic vertical stabilization.