Sensor Fusion in Computer Vision - Semantic Scholar

17 downloads 14181 Views 169KB Size Report
to recover the object shape. In Section III, I discuss various. Computer Vision techniques which are used to fuse sensory data from a set of complementary ...
Sensor Fusion in Computer Vision Alper Yilmaz Dept. of Civil and Env. Eng. and Geodetic Science Ohio State University Columbus, Ohio, 43210 Email: [email protected]

Abstract— Sensor fusion has been an active area of research in the field of computer vision for over two decades. Early approaches to sensor fusion were focused on the recovery of the three-dimensional scene structure from two short baseline cameras which was considered to be similar to the human vision system. Recently with the availability of sensors of various modalities, computer vision researchers have started looking into these new sensory data for the solution of automated understanding of the scene content. In this paper, an outline of some of the most common sensor fusion techniques is provided. Although no priority is given to one technique over the others, we selected only a handful techniques that are related to sensor fusion in the context of urban city modeling.

I. I NTRODUCTION Sensor fusion is the process of combining two or more sources of data to better solve a problem than using the sources individually. Depending on the modalities of the sensors, sensor fusion can be categorized into two classes, sensor fusion using complimentary sensors and sensor fusion using competing sensors. Complementary sensors are composed of sensors with different modalities, such as a combination of a Laser Imaging Detection and Ranging (Lidar) sensor and an Electro Optical sensor, such as a digital camera. In contrast to complementary sensors, competing sensors are composed of a sensor suit which have the same modality, such as two digital cameras which provide photographic images of the same building from two different viewpoints. Due to its similarity to the human vision, use of competing sensors, especially the photographic images obtained from two digital cameras, have been widely used among the Computer Vision (CV) researchers for more than two decades [1]. There are two common use of multiple competing sensors. The first form of use, which was the initial motivation for such a setup, is to acquire different views of an object. Especially depending on the proximity of the camera centers various techniques have been proposed for solving a wide range of problems from structure recovery to object tracking in stereo camera setup. The second form of use is to increase the field of view, which has been used for wide area surveillance applications. Despite the views among the CV researchers toward imitating the human vision, emerging problem areas as well as the complexity of perception in human vision have suggested the CV researchers to use additional sensory data from complementary sensors, such as three-dimensional (3D) scans

generated by sensors such as CyberScan accompanied by digital images of the generated 3D models. Use of sensors other than the regular cameras, which provide photographic view of the scene, has resulted in new research initiatives to emerge. The birth of medical imaging is such an example, where images obtained from non-traditional imaging techniques, such as medical computerized tomography images generated by inverse radon transform of x-ray scans of the body from various angles [2]. A common thumb of rule in computer vision is to propose a general solution which does not require scene specific information. An example of this view is relaxing the requirement for camera information such as the external and internal camera parameters, and the camera gain function. Relaxing the requirement of camera parameters, despite its contradiction with the human vision, resulted in investigation of ways to analyze implicit geometry between different views of a scene, such as exploiting the Fundamental Matrix between two views, or the Tri-focal Tensor between three views. While one can argue the use of camera parameters as being crucial, the results from the Computer Vision literature have shown their avoidance does not deteriorate the expected result. I believe, the solution to a Digital Photogrammetry or a Computer Vision problem should be general and relax the requirement of camera parameters, while camera parameters should be used to improve the results when they are available. In this paper, I give an overview of sensor fusion techniques developed and used in the field of Computer Vision. In this regard, I do not limit the discussion to only urban city modeling, as I believe most of the methods developed in the literature have application in a wide range of problems under the condition that the nature of sensory data is similar. Under this observation, in the following section, I start the discussion on shape recovery from images acquired using competing sensors. In this section, I briefly describe the short-baseline camera setup and generation of a disparity map from an image pair. I extend the discussion on short baseline cameras to the shape from motion which has applications in urban city modeling using ground sensors. Following short baseline based approaches, I close the Section II by the wide baseline camera setup, where two or more distant cameras are used to recover the object shape. In Section III, I discuss various Computer Vision techniques which are used to fuse sensory data from a set of complementary sensors. The motivation

c 2007 IEEE 1-4241-4244-0712-5/07/$20.00 

of section III is the fusion between images and 3D point cloud generated from Lidar sensor, which is commonly used in urban city modeling. Based on this observation, the theme of this section is the reconstruction of 3D object shapes from an image and a 3D model which, in computer vision, is usually in the form of a generic model generated from complementary sensory sensor data. Finally, conclusions are drawn in Section IV. II. C OMPETING S ENSORS Photographic images from two or more competing sensors have long been considered in the Computer Vision research. Early studies on data fusion of these images have started by mimicking the human vision system and was composed of two proximal cameras, also known as short-baseline stereo camerapair, with similar orientations, hence, similar field of views (FOV) [1]. The work on short baseline cameras has led the field to consider moving cameras where two images captured at consecutive time instants can be considered as images from two short-baseline cameras. The research work on short baseline cameras has led to methods that relax the proximity constraint by placing the camera pairs to distant locations, such that the cameras provide wide baseline image pair. Compared to the short baseline setup, wide baseline cameras provide different faces of the objects which provides better recovery of the shape. A. Short-Baseline Stereo Camera Pair Short baseline camera setup, which is shown in Figure 1a, is generally used to compute the disparity of each pixel by establishing correspondences in the captured image pair. Motivated by the perception of depth in human vision, the disparity provides the depth of the pixel relative to other pixels in the image by exploiting the fact that the closer the object to camera is the higher the disparity is in the image pair. The disparity between corresponding pixels in an image-pair is given by d = |xl − xr |. Assuming the projection centers of both cameras are at the image center, using perspective projection and similar triangles the depth of the 3D point can be computed as T (1) Z=f , d where T is the distance between camera centers, and f is the focal length assumed to be the same for both cameras. In this equation, the main complication is in computation of the disparity between two pixels which are associated by a feature matching technique. Despite being an unresolved problem in the general sense, feature matching for short baseline image pair can easily be performed by template matching techniques [3, Page 57]. Generally, templates in the form of a window around each pixel in one image is chosen and is searched in the other image. Despite the intuition behind using templates, projective distortion in the image-pair may result in poor disparity calculation. Recent approaches solve this problem by applying global constraints, such as uniqueness and continuity of the disparity, as well as, occlusion reasoning to preserve discontinuities [4].

B. Structure From Motion In intuitive extension of the short baseline stereo is its application to a video clip which is acquired by a moving camera. In this scenario, consecutive frames from the clip can be considered an image-pair which has a short baseline. Compared to a short baseline image pair, the difference between consecutive frames in video is very small due to high sampling rates, such as 30 frames per second. Frames from a video also provides a richer content, and exploits a history of correspondences in the past image pairs. This property lets the use of both point tracking techniques [5] and optical flow computation [6] to find dense set of disparities. Despite the availability of dense disparity set, recovering the object shape using equation 1 is sensitive to noise, which occurs frequently in disparity maps generated from very short baseline camera pair, such as the ones in consecutive frames from a video. To overcome the noise problems related to recovering shape from pairwise disparities, a simultaneous solution can be computed by exploiting all the disparities at once [7]. This can be achieved by organizing disparities computed in F frames in terms of trajectories of P points, and forming a 2F × P matrix, H 2P ×F . The spatial coordinates of each point in H are mean normalized to discard translation effect, such that the only motion is rotation. The camera rotation R and the object shape S are obtained by factorization of the H matrix to H = RF ×3 S3×P ,

(2)

using singular value decomposition. This approach has recently been applied to recovering shape of buildings from close range imagery [8]. C. Wide Baseline Camera Setup In wide baseline stereo, recovering the object shape relies on the images taken from distant viewpoints. Compared to the short baseline images, wide baseline images expose the object shape better because of higher triangulation angle between the rays a point in the object space to the corresponding points on the images (see Figure 1b for wide baseline camera setup). Recovering the object shape from a wide baseline image pair requires establishing point correspondences between two images. Practically, point correspondence can be established by computing appearance similarity between the interest points in both images. Interest points are detected based on the density texture in the proximity of a pixel [9]. Recent advances in interest point detection has also provide novel techniques to encode the appearance in the locality of the interest points which can be matched using various distance measures [10], [11]. Given point correspondences, hence the disparities, two images can be geometrically related by means of both the coplanarity of vectors defined by the camera centers and the world point, and the relative camera rotation R, translation S and internal parameters M . This relation is analytically

camera center

baseline

ld or

ne

a

er

nt

world point

a

er

ce

e ag

le

po

i ep

image plane

pl

po

ge

pl

a

im

er

nt

baseline

e

ol

ip

ep

(a) short baseline setup

e

an

w

im

m

ca

camera center

t

in

image plane

ce

e

m

ca

ra

(b) wide baseline setup Fig. 1.

described by the following equation − −1 x left Mleft RSMright xright

x left

F xright

= 0,

(3)

= 0,

(4)

where F is called the fundamental matrix [12]. From corresponding points x left and xright , fundamental matrix can be estimated by solving a homogenous equation system using various techniques such as RANSAC [13]. The fundamental matrix provides the point of intersection of all the epipolar lines resulting in the epipole e which is the projection of one camera in the other cameras image plane as shown in Figure 1b. Given the epipoles in the image-pair, we can recover the camera parameters by P left = [eright × F |eleft ] and Pright = [I3 |0]. These camera parameters in turn will provide the 3D point, hence the object shape, by triangulating the image points [14, Page 310]. Reliability of the depth estimation can be increased by adding an additional image of the object and recovering the trifocal tensor which is an extension of the fundamental matrix to three-views. Tri-focal tensors have recently been used to reconstruct building shapes from a sequence of images [15], [16]. Despite being a better camera setup, wide baseline image pair poses a harder depth estimation problem due to occlusions and differences between the views. Problems related to occlusion can be reduced by increasing the number of images covering different views of the object. On the contrary, the difference between the views will remain as a bottle neck in associating the correspondences in image-pairs. III. C OMPLEMENTARY S ENSORS Complementary sensors are used to reveal information of a scene that is not observed by respective sensors. The choice of sensors used for a particular task is generally selected such that they provide different insights to the problem at hand, simplify complexities that can be encountered and increase the robustness of the solution. In computer vision, sensor fusion between complementary sensors is not a common practice. In author’s view, a main reason behind the unpopularity lies in the definition of computer vision which is understanding the scene from photographic images. Despite the unpopularity of complementary sensors,

the algorithms proposed in the field are usually applicable to different sensory data. The generality of the proposed techniques has led vision researchers to look into similar problems in the sensory data that is generated by various sensors other than digital cameras. An example of this use is the direct application of object detection, object tracking and image segmentation algorithms for photographic images to infrared images which encode the heat index of the scene. In military and commercial application on surveillance [17], it has become a common practice to use a combination of both of the sensors and fuse the information to overcome domain specific problems, such as object shadows observed in photographic images are not observed in infrared images. A. Fusion Using Mutual Information Mutual information between two data sets is the measure that quantifies the dependence between them by measuring the uncertainty of one data set given the other data set. For instance if two complementary sensors provide data of two different scenes, then X and Y are independent, such that p(x, y) = p(x)p(y), and their mutual information is 0. Formally, mutual information between random variables X and Y is defined by the probability theory as:  p(x, y) , (5) p(x, y) log M (X; Y ) = p(x)p(y) y∈Y x∈X

where p(x, y) is the joint probability distribution function (PDF) of X and Y , and p(x) and p(y) are the marginal PDFs. Mutual information has also a close relation with entropy: M (X; Y )

= H(X) − H(X|Y ) = H(X) + H(Y ) − H(X, Y ),

(6) (7)

where H(X) and H(Y) are marginal entropies and H(X,Y) is the joint entropy [18]. Mutual information has been a common tool used over a decade for registering data from two complementary sensors [19]. Due to its relevance to registration of Lidar point cloud and aerial images of urban sites, I briefly highlight application of mutual information in context of registering a 3D surface with an image. The 3D surface in the following discussion

is generated via CyberScan hardware which provides a triangulated dense point cloud set for face images, which can be considered as a terrestrial Lidar scanner. Let the 3D surface denoted by u, and the relation between any image of u is given by: v(T (x)) = F (u(x), q) + µ, (8) where T is the transformation of the image to generate a similar pose, F is the imaging function, µ is the imaging noise which is assumed to be Gaussian, and q is the set of external parameters required for imaging, such as the lighting direction. Since the lighting conditions and surface reflectance is usually unknown, the imaging function can be simplified to a relation between the image intensities and surface normals 1 . This relation can be expressed in terms of the mutual information observed by the transformation T between the image generated from the 3D model u F = Fq (u(x)) and the input image vT = v(T (x)), M (uF , vT ) = H(uF ) + H(vT ) − h(uF , vT ).

(9)

The entropy of the random variables z = u F or z = vT can be approximated by: 1  −1  log Gγ (zi − zj ), (10) H(z) ≈ NB NA zi ∈B

zj ∈A

where A and B are two sample sets randomly selected from the image or the 3D point cloud, N denotes the number of samples, G is the assumed Gaussian distribution of the variables, z is the set of transformation parameters and γ is constant covariance. Mutual information given in (9) can be maximized by taking the derivative of M with respect to T 2 and iteratively solving using Newton’s method. One of the main issues with mutual information is the definition of the probability distribution functions for two different inputs which emphasize similar information. General rule of thumb in Computer Vision literature has been the use of Gaussian distribution. In the above discussed method. Gaussian distribution would be a valid choice of distribution, however, images of urban sites usually contain multiple surface types with different reflectance properties, such as roads, vegetation and buildings. In this case, either random sampling is directed by a segmentation procedure to guarantee the samples chosen come from similar texture, or different probability distribution functions should be used to model different surface properties simultaneously. One such choice is the use of Parzen windows [18], where a non-parametric distribution is fit to data by preserving all different surface reflectance characteristics. B. Deformable Models Deformable models exploit the fact that the objects that belong to the same class exhibit in class shape-similarity. 1 This assumption requires that the light source and the observer far away from the objects in the scene, which suits very well for the urban modeling scenario. 2 The first term on the right of equality vanishes due to no dependency on the transformation parameters.

In Computer Vision 3D deformable models are used for 3D tracking of human faces [20], expression synthesis, face recognition, and expression recognition. In all these applications, an common and an important step is to align a 3D model with an image. Due to its relevance to sensor fusion for city modeling, in the following discussion, I describe the alignment between an image and a 3D model. Three dimensional deformable models can be in the form of generic models or subspaces. A generic 3D model is composed of a parametric shape that maps a domain to a 3D point cloud such that changes in the parameters can create new shapes. Generic models usually do not have the complete object representation, such as not all the faces of a building are represented in 3D. Alignment between an image and a generic model is performed by iteratively modifying the parameters of the model such that a cost function is minimized. On the contrary subspace based modeling generates a set of 3D models which form an orthogonal bases which can generate any type of 3D object from a linear combination of the bases models. Alignment between an image and a subspace model is performed by iteratively modifying the weight of each basis such that a cost function is minimized. 1) Generic 3D models: Deformable generic models are parametric shapes that evolve from an initial configuration to the shape of an object. The evolution is generally governed by the gradient magnitude of the image [21]. In three-dimensions the shape is composed of a set of triangulated points s and a parametric model q that captures possible modes of deformations that subsets of model points can undergo. For example, for application of such a generic deformable model on the human face which has a complex structure rigid transformations as well as non-rigid transformation that move certain parts of the face is required. Due to the complexity of modeling a human face, parametric deformation models are generated manually [20]. On the contrary, deformation parameters for the rigid objects, such as buildings in an urban site, can be generated simply by means of rigid transformation (scaling, translation and rotation). Estimating an accurate 3D model x3D is performed by applying deformations to the 3D point cloud: x3D (q; u) = qc + Rs(u; qs ),

(11)

   where q = (q c , qθ , qs ) , qc is the translation, qθ is the rotation parameters, and u identifies a specific surface from the model. In this model, the model parameters are estimated by first order Lagrangian:

˙ x˙ 3D (q; u) = L(u)q,

(12)

where L = δx δq is the model Jacobian [22]. Estimation of the parametric model can be guided by the gradient magnitude (or edges) in the image by projecting the model to the image plane: δx2D ˙ Rcamera L(u)q, (13) x˙ 2D (q; u) = δxcamera

where

ACKNOWLEDGMENT

xcamera (u) = Rcamera x3D (u),

x2D (u) =

f , z3D (x3D , y3D )

and the partial derivative on the right side of the equation is due to derivations by parts of the ∂x 2D /∂q which is a function of xcamera . This equation can be solved by Lagrange multipliers [23] Another form of 3D model used for estimating the structure of an object is the super-quadrics [24] which is a generic analytical model that does not require use of complementary sensors. I will not detail this approach however interested readers can look at [25]. 2) Subspace models: Subspace models benefit from generating a new bases from the data by projecting the multidimensional data to a set of bases functions which emphasize the differences between each sample. A common form of subspace analysis performed in the Computer Vision field is the use of “eigenspace” generated from the n dimensional data samples Dn×M by decomposing the related covariance matrix, C = DD into its eigenvectors u n×n and eigenvalues λ: Cu = λu

(14)

Eigenvalues define the spread of projection of the data on the corresponding eigenvector, hence the larger the eigenvalue is, the more important the corresponding eigenvector is [26]. In sensor fusion, eigenspace decomposition is used to generate a generic model of an “object class” from a set of data which belongs to that class. An example of this would be the eigenspace of the 3D face scans of different people which is generated by CyberScan. active appearance models IV. C ONCLUSION I have briefly discussed a subset of sensor fusion methods from the Computer Vision literature. While most of these methods are used for problems other than urban city modeling, an effort has been made to select only relevant techniques to generate realistic city models, and justify the selected methods relevance to the city modeling domain. The discussion most of the time does not go beyond giving the key ideas about the discussed methods. While I kept the explanations as intuitive as possible, I suggest the enthusiastic readers to read the cited papers for more information and contact me in case need be. An important property common to almost all the methods “discussed” and “not discussed” is discarding the explicit use of camera parameters. I should note at this point that Commuter Vision researchers implicitly model the geometry. Especially in city modeling, imagery is usually tagged with camera orientation parameters. This additional information when used correctly will greatly increase the robustness of these methods, hence a future direction would be on the intelligent integration of this additional information in the solution.

The author would like to thank Prof. Toni Schenk from the Ohio State University for scholarly discussions which let into preparation of this paper. R EFERENCES [1] H. Longuet-Higgins, “A computer algorithm for reconstructing a scene from two projections,” Nature, vol. 293, pp. 133–135, 1981. [2] A. Kak and M. Slaney, “Principles of computerized tomographic imaging,” IEEE PRESS, 1988. [3] W. Forstner and A. Pertl, hotogrammetric Standard Methods and Digital Image Matching Techniques for High Precision Surface Measurements. Elsevier Science Publishers, 1986. [4] C. Zitnick and T. Kanade, “A cooperative algorithm for stereo matching and occlusion detection,” PAMI, vol. 22, no. 7, pp. 675–684, 2000. [5] S. K and M. Shah, “A non-iterative greedy algorithm for multi-frame point correspondence,” PAMI, vol. 27, no. 1, pp. 51–65, 2005. [6] B. D. Lucas and T. Kanade., “An iterative image registration technique with an application to stereo vision,” in International Joint Conference on Artificial Intelligence, 1981. [7] C. Tomasi and T. Kanade, “Shape and motion from images under orthography: A factorization method,” IJCV, vol. 9, no. 2, pp. 137–154, 1992. [8] T. Migita, A. Amano, and N. Asada, “Incremental procedure for recovering entire building shape from close range images,” in Int. Arc. of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. XXXIV, Part 5/W12, 2003, pp. 237–240. [9] C. Harris and M. Stephens, “A combined corner and edge detector,” in 4th Alvey Vision Conference, 1988, pp. 147–151. [10] D. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004. [11] H. Bay, T. Tuytelaars, and L. V. Gool, “Surf: Speeded up robust features,” in ECCV, 2006. [12] O. Faugeras, “What can be seen in three dimensions with an uncalibrated stereo rig,” in ECCV, 1992, pp. 563–578. [13] D. M. PHS Torr, “The development and comparison of robust methods for estimating the fundamental matrix,” IJCV, vol. 24, no. 3, pp. 271– 300, 1997. [14] R. Hartley and A. Zisserman, Multiple View Geometry in computer Vision. Cambridge Un. Press, 2003. [15] G. Schindler, P. Krishnamurthy, and F. Dellaert, “Line-based structure from motion for urban environments,” in Int. Symp. on 3D Data Processing, Visualization and Transmission, 2006. [16] Q. A. et al., “Towards urban 3d reconstruction from video,” in Int. Symp. on 3D Data Processing, Visualization and Transmission, 2006. [17] A. Yilmaz, K. Shafique, and M. Shah, “Target tracking in airborne forward looking imagery,” Journal of Image and Vision Computing, vol. 21, no. 7, pp. 623–635, 2003. [18] R. Duda, P. Hart, and D. Stork, Pattern classification, 2nd edition. Wiley, 2000. [19] P. Viola and W. Wells, “Alignment by maximization of mutual information,” in ICCV, 1995, pp. 16–23. [20] D. DeCarlo and D. Metaxas, “Optical flow constraints on deformable models with applications to face tracking,” IJCV, vol. 38, no. 2, pp. 99–127, 2000. [21] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: active contour models,” IJCV, vol. 1, pp. 321–332, 1988. [22] D. Terzopoulos and D. Metaxas, “Dynamic 3d models with local and global deformations: deformable superquadrics,” PAMI, vol. 13, no. 7, pp. 703–714, 1991. [23] M. Schabracq, C. Cooper, C. Travers, and D. Maanen, Occupational Health Psychology: The Challenge of Workplace Stress. British Psychological Society, 2001. [24] D. Metaxas and D. Terzopoulos, “Constrained deformable superquadrics and nonrigid motion tracking,” in CVPR, 1991, pp. 337–343. [25] D. Metaxas, Physics based modelling of nonrigid objects for vision and graphics. PhD Thesis, University of Toronto, 1992. [26] E. Weisstein, CRC Concise Encyclopedia of Mathematics, Second Edition. CRC Press, 2002.