Integrating Shape from Shading and Range Data Using Neural Networks Mostafa G.-H. Mostafay, Sameh M. Yamany, and Aly A. Farag Computer Vision and Image Processing Lab, EE Department University of Louisville, Louisville, KY 40292. fmostafa,yamany,
[email protected]). www.cvip.uo .edu.
Abstract
This paper presents a framework for integrating multiple sensory data, sparse range data and dense depth maps from shape from shading in order to improve the 3D reconstruction of visible surfaces of 3D objects. The integration process is based on propagating the error difference between the two data sets by tting a surface to that dierence and using it to correct the visible surface obtained from shape from shading. A feedforward neural network is used to t a surface to the sparse data. We also study the use of the extended Kalman lter for supervised learning and compare it with the backpropagation algorithm. A performance analysis is done to obtain the best neural network architecture and learning algorithm. It is found that the integration of sparse depth measurements has greatly enhanced the 3D visible surface obtained from shape from shading in terms of metric measurements.
1 Introduction
Three-dimensional (3D) reconstruction of a sensed scene is crucial for machine vision and robotics systems in order for these systems to autonomously interact with their environment. The problem of 3D object reconstruction from two-dimensional (2D) images has attracted much attention due to its theoretical challenge and its many practical applications. Individual vision modules (stereo, shading, texture, etc.) cannot accurately reconstruct the 3D structure of the imaged scene due to the insucient data constraints and the presence of sensor noise. In general, the performance of 3D vision systems is greatly enhanced when various sources of information about the 3D scene (e.g., stereo, range, shading, etc.) are incorporated[1, 2]. The synergistic use of overlapping and complementary data sources provides information that is otherwise not available from individual sources[1]. For example, vision modules based This work is supported in part by DoD under contract: USNV N00014-97-11076. y M. Mostafa is on leave from Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt.
on intensity images (stereo, shading, etc.) do a poor job in the case of scenes with shadows. This problem does not exist in the information provided by sensors like a range nder. In this paper, we present a framework for improving the visible surface representation obtained from shape from shading by integrating it with range data from a laser range nder or structured light, which provide depth information about the scene that is not available from shape from shading. The integration process is based on propagating the error dierence between the two data sets by tting a surface to that dierence. A multilayer feedforward neural network is used to t a surface to the sparse data. A performance analysis is done to obtain the best neural network architecture. We also present a comparison between the standard backpropagation (BP) learning algorithm and the use of the extended Kalman lter for supervised learning. This article is organized as follows. First, related work are described in Section 2. Our integration approach and its analysis are presented in Section 3. Section 4 presents our results and discussions. Finally, the paper is summarized and concluded in Section 5.
2 Related Work
Computer vision systems that are based on a single sensor have an inherent weakness. Generally they cannot reduce uncertainty. Sensor uncertainty arises from many sources. On one hand, sensors are naturally noisy. On the other hand, uncertainty may arise from the scene itself, e.g. missing feature by occlusion or from the sensor type where most sensors cannot measure all the relevant attributes of the sensed scene, e.g., a camera cannot measure depth information. Recent research proposed computer vision systems that are based on multiple sensory data to overcome the uncertainty and ambiguity [1]. Most of the vision systems that employ data fusion techniques for 3D scene reconstruction and interpretation can be divided into two categories: systems that integrate multiple cues (i.e. overlapping data) from the same sensing modality[3, 4, 5, 2] and
1063-6919/99 $10.00 (c) 1999 IEEE
systems that integrate complementary data from multiple sensors[6, 7, 8]. Data integration is mainly used in dynamic systems, e.g. mobile robots, to dynamically reconstruct a 3D representation of their environment. Redundant information, e.g multiple intensity images, and/or complementary information, shuch as intensity, range, infrared, etc. are used in these systems. Ayache and Faugeras[3] employed the extended Kalman lter to fuse multiple views taken by a mobile robot over time in order to build a 3D representation of the robot's environment. Fusion of visual maps corresponding to dierent positions of the mobile robot results in a better estimation of its displacement between the various viewpoint positions. Porrill[9] and Pollard et al.[4] have used Gauss-Markov estimation together with geometric constraints for the feature-level fusion of multiple stereo views to increase the accuracy of a wire frame model of the 3D object. Cryer et al.[10] presented a method for integrating the high frequency information of the scene image from shape from shading and the low frequency information from stereo. Pankanti and Jain[2] have integrated perceptual grouping, stereo, shape from shading, and shape from texture modules under a framework that facilitates a plausible interpretation of a scene. Hemayed and Farag[11] have integrated edge based stereo with structured light to improve the reconstruction of 3D Models of 3D objects. Wang and Aggarwal[6] integrated information from both active and passive sensing modalities for 3D modeling of 3D objects. Their approach is to extract potential points of interest from the intensity image and then selectively sense the range at these feature points. Richardson and Marsh[7] integrated acoustic measurements with photometric stereo. Allen[8] integrated passive stereo vision and active exploratory tactile sensing to improve object recognition tasks of a robot.
3 Our Approach
Shape From Shading (SFS) is a low-level vision module that produces a dense depth map. It estimates the visible surface shape from local variations in image intensity. In literature, two general classes of SFS algorithms have been developed: 1) local algorithms[12], which attempt to estimate shape from local variations in image intensity and global algorithms which can be further divided into global minimization approaches[13]. 2) global propagation approaches[14]. The global propagation approach attempts to propagate information across a shaded surface starting from points with known surface orientation (singular points), while in global minimization, a solution is obtained by minimizing an energy function. Tsai and Shah [12] have described a linear technique to solve the SFS problem. We have used the Tsai-Shah algorithm which is more
suitable for our application because of its speed and accuracy. The reason behind selecting SFS is that smooth surfaces hardly have any of the textured features needed, for example, for shape from stereo. In this case, the only feasible feature is the distribution of light on these surfaces which leads us to investigate the shape from shading (SFS) approach. Due to the many approximations, e.g., the direction of light source should be known a priori, the assumption of constant albedo, etc., shape from shading has some problems. The main problem with the 3D visible surfaces obtained using shape from shading is that they lack metric information. Shape from shading also suers from the discontinuities due to highly textured surfaces and dierent albedo. The main idea of this work is the integration of the dense depth map obtained from the shape from shading with a sparse range data for the reconstruction of 3D visible surfaces with accurate metric measurements. This integration has two advantages. First, it helps in removing the ambiguity of the 3D visible surface discontinuities produced by shape from shading. Second, it compensates for the missing metric information in the shape from shading. Laser range nders and structures light devices, contrary to the laser range scanners, produce a sparse depth map of the scene. Since these sparse depth points does not contain all the depth information about the surface, it cannot be used directly to represent the visible surfaces accurately. Instead, we propose to propagate the error dierences between the available range data and the shape from shading throughout the remaining measurements where only shape from shading data are available. This can be done in three steps, as depicted in Fig. 1. First, the consistency of the two sensory measurements are established after registration. After that, we calculate the error dierence in the depth measurements between the available range data and the shape from shading. Second, we t a surface to that error dierence. Finally, the resultant surface is used to correct the shape from shading. The surface tting process, which is cast as a function approximation, is carried out in this paper using neural networks as discussed in the following section.
3.1 Surface Approximation Using Neural Networks
Surface interpolation/ tting, has been one of the most intensely studied problems in low-level computer vision. It plays a central role in the construction of a continuous 2 21 -D sketch from sparse visual data. The computational theories used in conjunction with surface interpolation include variational principles[15] and regularization theory[16, 17]. The common element of
1063-6919/99 $10.00 (c) 1999 IEEE
Shape from Shading Surface Approx. (LEP/NN)
−
Range data
+
Integrated 3D surface
Figure 1: Functional block diagram for the integration process of shape from shading and range data. these computational theories is the minimization of a global energy function composed of many local energy components. This minimization has usually been implemented using iterative algorithms. A weak point in these methods is that it has adjustable parameters and does not perform well in case of too sparse data. Another common problem is that they can nd only locally optimal solutions, instead of nding the global minimum of the energy function[18]. Surface interpolation is considered a function approximation problem. Consider a nonlinear inputoutput mapping de ned by the functional relationship Z = f (x), where the vector x is the input and the vector Z is the output. The mapping valued function f (:) is assumed to be unknown. Given a limited set of inputoutput examples Z = f (x ); i = 1; ; N , the requirement is to nd the function F(:) that approximates f (:) over all inputs. That is, i
i
kF(x) , f (x)k < for all x ;
(1)
where is a small threshold. This function approximation problem is a perfect candidate to supervise learning with xi playing the rule of input vector and di serving the role of desired response. In supervised learning of a multilayer neural network, the set of examples Z is used to train the neural network as a model of the unknown system. If yi is the output of the network produced in response to an input vector xi . The dierence between the di (associated with xi ) and the network output yi provides the error signal ei = di , yi , which is equivalent to the difference in Eq.1. ThePtotal error energy in the network output is E (n) = 12 j2C e2j (n), where C is the set of output neurons, is used to minimize the squared dierence between the outputs of the unknown system and the neural network in a statistical sense. For a given training set, 2 3 N N X X X 1 1 4 1 e2j (n)5 ; Eav = E (n) = (2) N N 2 i
n=1
n=1
j 2C
is the average squared error energy which represents the cost function as a measure of learning performance. The objective of the learning process is to adjust the network free parameters (the synaptic weights) to minimize this cost function Eav [19]. The backpropagation learning algorithm[20, 19] is used for the supervised training of multilayer networks. It applies a correction wij (n) = , rE (n) , wij (n , 1) to the network synaptic weights wij (n), where is the learning parameter and is a momentum constant. Although the BP algorithm is now considered the most popular algorithm for training multilayer networks, it has two main problems. First, appropriate learning parameters and need to be carefully chosen for each training set. The tuning of these free parameters is not trivial. Second, the rate of convergence tends to be relatively slow due to the stochastic nature of the algorithm, which in turn makes it computationally extensive. These limitations may be overcome by viewing the supervised training of the network as an optimum ltering problem. The optimum approach to the solution of this problem is to recursively utilize information contained in the training data traced back to the rst iteration of the learning process, a situation where the Kalman lter theory can be utilized. The bene t of using this theory is that it is formulated in terms of state-space concepts which provides ecient utilization of the information contained in the data. The Extended Kalman Filter-based (EKF-based) training has been shown to be an eective and powerful technique [21]. It belongs to the class of second-order training methods, and it adapts weights of the network in a pattern-by-pattern fashion, accumulating important training information in approximate error covariance matrices and providing individually adjusted updates for each of the network's weights.
3.2 Performance Analysis
The rst analysis made to the system is to choose the network training algorithm that is fast and produce smaller root mean square error (RMS). The performance of both backpropagation and Kalman lter training algorithms are studied. As shown in Fig. 2, training a multilayer feedforward neural network using the EKF outperforms the network that is trained using the standard BP algorithm. The mean square error reaches its global minimum after only 4-10 iterations. To the contrary, using BP, the same network needs at least 1000 iterations in order to learn the same pattern (1 epoch with the EKF algorithm takes as much time as 30 epochs with the BP algorithm). Another important issue is that in contrast to the BP algorithm no adjustable parameters need to be tuned in the case of
1063-6919/99 $10.00 (c) 1999 IEEE
Table 1: Root mean square (RMS) error between the integrated surfaces and the ground truth for dierent neural networks topologies calculated for both smooth and free-form surfaces
0.50
rms Error
0.40
Network topology smooth surf. free-from surf. 2-3-3-1 0.132328 0.274981 2-5-3-1 0.132320 0.502865
0.30
2-7-3-1
0.20
Backpropagation
0.10
Kalman Filter 0.00 −50.00
150.00
350.00 550.00 Iterations
750.00
950.00
Figure 2: Comparison between the Kalman lter and Backpropagation learning algorithms in the learning phase. the EKF algorithm. This shows the superiority of the Kalman lter learning algorithm over the backpropagation. The Kalman lter learning algorithm is used in the rest of the following analysis. Table 1 shows samples of the results of the analysis we have carried out to choose the network topology. For this analysis, we have used a ground truth dense depth map registered with intensity images obtained from a laser range scanner. We have used only 1% of the range data and applied our approach on the sampled data and the shape from shading. The RMS error between the integrated surface and the ground truth is used as a measure of the network performance. The table shows the RMS error obtained from the integration processes using dierent network topologies. Two dierent types of surfaces, a sphere as a smooth surface and a chess piece as a free-form surface are investigated. As the table shows, the best performance is obtained with the network topology 2-7-3-1. That is, two neurons in the input layer (the pixel coordinates in the image), two hidden layers with 7 and 3 neurons, respectively, and one neuron in the output layer (the depth measurement). To show the importance of integrating the shape from shading with the sparse range data, we calculated the RMS error between the ground truth data and the results of the integration process with and without shape from shading. Figure 3 shows the results of these calculations. It shows that shape from shading is important in the case of a free-form surface (chess piece) while it makes no dierence in the case of a smooth surface (sphere). This was expected, as the shape from
2-9-3-1 2-7-5-1 2-9-5-1 2-9-7-1 2-5-3-3-1 2-7-3-3-1 2-9-3-3-1 2-7-5-3-1 2-9-5-3-1 2-9-7-3-1 2-9-7-5-1
0.131924
0.131947 0.132058 0.131991 0.132123 0.131795 0.131835 0.131798 0.131757 0.131807 0.131849 0.131977
0.245706
0.521841 0.501612 0.506040 0.475516 0.536847 0.529776 0.535099 0.543467 0.532535 0.525692 0.504032
shading is successful in recovering the curvature of the smooth surface up to a scale factor. This means that only few range measurements are required to recover the scaling factor. In the case of free-form surfaces, the surface approximation process tends to smooth the surface where range data are not available which produces large RMS error, while shape from shading, to some extent, captures the roughness of the surface and corrects the smoothness produced by the surface approximation process. The gure also shows that shape from shading is important even at high sampling ( 60% of the range data), i.e. with a plenty of range data.
4 Results and Discussions
Figure 4 shows dierent types of surfaces - smooth (sphere), smooth with a hole (torus), and a free-form surface (chess piece) used in the analysis, their shape from shading, and the integration results. The sampled range data are shown as cross signs in the intensity images. The integration process has reduced the RMS error from 0:582, when using the SFS alone, to 0:132 after integrating the range data for the sphere. For the torus, the RMS error was reduced from 1:253 to 0:31 and for the chess piece from 1:681 to 0:246. The gure also shows the results of reconstructing the 3D surface of a vase by integrating the SFS and a few range data from a laser range nder. The gure shows that the integration has greatly improved the 3D reconstruction of visible surface of the vase which has both sharp edges and smooth surfaces. As another application, Figure 5 shows the results of using this framework in a system for orthodontic measurement [22]. The overall purpose of this system is to develop a model-based vision system for orthodontics
1063-6919/99 $10.00 (c) 1999 IEEE
0.80 Without SFS (Chess)
rms Error
0.60
With SFS (Chess) 0.40
0.20
0.00 0.0
With/Without SFS (Sphere)
0.2 0.4 sampled fraction of range data
0.6
Figure 3: Comparison between the RMS error, without the SFS module, between the integrated surfaces and the ground truth range data as a function of the sampled fraction of the range data. to replace traditional approaches that can be used in diagnosis, treatment planning, surgical simulation and for implant purposes. Image acquisition is obtained using intra-oral video camera and range data are obtained using a 3D digitizer arm. The shape from shading technique is then applied to the intra-oral images. The required accurate orthodontic measurements cannot be deduced from the resulting shape, hence the need of some reference range data to be integrated with the shape from shading results. The output of the integration algorithm to each teeth segment image is a description of the teeth surface in this segment. A fast and accurate 3D registration technique is then performed to register the surfaces from dierent views together [22].
5 Summary and Conclusions
We have presented a frame work for integrating shape from shading with range data. The integration method is based on tting a surface to the error difference between the shape from shading and the sparse range data available either from a laser range nder or a structured light. This approximated surface is used to correct the shape from shading result. A feedforward multilayer neural network is used as a surface approximator. Two learning algorithms, backpropagation and extended Kalman lter, are used to train the network. The extended Kalman lter outperformed the backpropagation algorithm in terms of fast convergence and minimum root mean square (RMS) error. The best network topology is obtained by nding the topology that gives the minimum RMS error between the ground
Figure 4: Intensity image with range data marked as cross signs (left), 3D visible surfaces obtained from the shape from shading (middle), and the nal surfaces obtained from the integration process (right) for dierent objects. truth data and the results of the integration process. The surface approximation using the neural networks is found to work exceptionally well in learning and retrieving the smoothness of a surface from a few range data points. The integration of a few ground truth range data not only greatly improves the 3D visible surface reconstruction obtained from shape from shading, but also produces 3D visible surfaces representation with nearly accurate metric measurements. It is found that the shape from shading is important for obtaining a better surface reconstruction even when there are many range data (about 60% of the surface) in the case of free-form surfaces.
Acknowledgments
The authors would like to thank the Vision Lab at USF for making their range database available on the Internet.
References
[1] M. A. Abidi and R. C. Gonzalez, eds., Data Fusion
1063-6919/99 $10.00 (c) 1999 IEEE
x
x x
x
x
x x
x x
x
x
x
x
x
x
x
x
x
x x
x x
x
x x
x x x
(a)
x x
(b)
(c)
(d)
Figure 5: (a) Intra-oral intensity images with range data marked as cross signs, (b) 3D visible surfaces obtained from the shape from shading, (c) and the nal surfaces obtained from the integration process, (d) a visible surface mesh obtained from registering the two views in (c). in Robotics and Machine Intelligence, Academic Press, 1992. [2] S. Pankanti and A. K. Jain, \Integrating vision modules: Stereo, shading, grouping, and line labeling," IEEE Trans. on Patt. Anal. and Mach. Intell. 17, pp. 831{842, September 1995.
[3] N. Ayache and O. D. Faugeras, \Building, registering, and fusing noisy visual maps," The Int. Journal of Robotics Research 7(6), pp. 45{65, 1988. [4] S. B. Pollard, T. P. Pridmore, J. Porill, J. E. W. Mayhew, and J. P. Frisby, \Geometric modeling from multiple stereo views," The Int. Journal of Robotics Research 8(4), pp. 3{32, 1989. [5] J. E. Cryer, P. Tsai, and M. Shah, \Combining shape from shading and stereo using human vision model," Tech. Rep. CS-TR-92-25, University of Central Florida, Orlando, Florida, 1992. [6] Y. F. Wang and J. K. Aggarwal, \On modeling 3-D objects using multiple sensory data," in Proc. IEEE Int. Conf. on Robotic and Automation, pp. 1098{1103, 1987.
[7] J. M. Richardson and K. A. March, \Fusion of multisensor data," The Int. Journal of Robotics Research 7(6), pp. 78{96, 1988. [8] P. K. Allen, \Integrating vision and touch for object recognition tasks," The Int. Journal of Robotics Research 7(6), pp. 15{33, 1988. [9] J. Porrill, \Optimal combination and constraints for geometrical sensor fusion," The Int. Journal of Robotics Research 7(6), pp. 66{77, 1988. [10] J. E. Cryer, P. Tsai, and M. Shah, \Combining stereo and shading," Pattern Recognition 28(7), pp. 1033{ 1043, 1995. [11] E. E. Hemayed and A. A. Farag, \Integrating edgebased stereo and structured light for robust surface reconstruction," in Proc. IEEE Int. Conf. on Intelligent Vehicles,, (Stuttgart, Germany), October 1998. [12] P. S. Tsai and M. Shah, \A fast linear shape from shading," in IEEE Conf. on Computer Vision and Pattern Recognition, pp. 734{736, 1992. [13] Q. Zheng and R. Chellapa, \Estimation of illuminant direction, albedo, and shape from shading," IEEE Trans. on Patt. Anal. and Mach. Intell. 13(7), pp. 680{ 702, 1991. [14] B. K. P. Horn and M. J. Brooks, eds., Shape from Shading, The MIT Press, Cambridge, Massachusetts, 1989. [15] W. E. L. Grimson, \An implementation of a computational theory of visual surface interpolation," Computer Vision, Graphics, and Image Processing 22, pp. 29{69, 1983. [16] T. Poggio, V. Torre, and C. Koch, \Computational vision and regularization theory," Nature 317(26), p. 314, 1985. [17] D. Terzopoulos, \The computation of visible-surface representation," IEEE Trans. on Patt. Anal. and Mach. Intell. 10(4), pp. 417{438, 1988. [18] R. Szeliski, \Fast surface interpolation using hierarchical basis functions," IEEE Trans. on Patt. Anal. and Mach. Intell. 12(6), pp. 513{528, 1990. [19] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, 1999. [20] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, \Learning representation of back-propagation errors," Nature 323, pp. 533{536, 1986. [21] S. Singal and L. Wu, \Training feed-forward networks with the extended kalman algorithm," in IEEE Int. Conf. Acoustic, Speech, and Signal Processing, p. 1187, 1989. [22] M. Ahmed, S. M. Yamany, E. E. Hemayed, S. Roberts, S. Ahmed, and A. A. Farag, \3d reconstruction of the human jaw from a sequence of images," in IEEE Conf. on Computer Vision and Pattern Recognition, (Puerto Rico), June 1997.
1063-6919/99 $10.00 (c) 1999 IEEE