Missing:
Three Dimensional Model Building In Computer Vision (II) E. E. Hemayed S. M. Yamany W. B. Seales A. A. Farag TR-CVIP 97 Sept. 1997
Contents 1 Stereo Vision
1.1 Camera Calibration : : : : : : : : : : : : : : : : : : : 1.1.1 Calibration Using A Non-Linear Technique : : 1.1.2 Calibration Using A Two-Stage Technique : : : 1.1.3 Portable Camera Calibration Utilities : : : : : 1.2 Surface Reconstruction : : : : : : : : : : : : : : : : : : 1.2.1 Occluding Contour Based Reconstruction : : : 1.2.2 A Correlation-based Trinocular Vision System 1.2.3 Results and Comments : : : : : : : : : : : : : 1.3 Active Stereo Vision : : : : : : : : : : : : : : : : : : :
2 Shape From Shading
2.1 Three-Dimensional Shape From Shading : : : : 2.2 Surface-Based Registration : : : : : : : : : : : 2.2.1 Finding the Closest Point : : : : : : : : 2.2.2 Minimization Using Genetic Algorithms
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
3.1 Triangulation Using Deformable Contours : : : : 3.1.1 Slicing The 3D Data Set : : : : : : : : : : 3.1.2 Curve Fitting Using Deformable Contour 3.1.3 Slice Linking : : : : : : : : : : : : : : : : 3.1.4 Results and Comments : : : : : : : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
3 3D Laser Scanner
1
5
5 5 6 10 10 10 13 19 21
22 22 23 24 24
29 29 29 30 32 33
List of Figures 0.1 Overview of the 3D model builder. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
4
1.1 Camera geometry with perspective projection and radial distortion : : : : : : : : : : : : : : : 1.2 (a) Dierent type of edges, n: surface normal discontinuities, d: depth discontinuities, dn: surface normal and depth discontinuities, r: changes in surface re ectance, s: shadows. (b) The points on the extremal boundary shift when the point of view changes. : : : : : : : : : : 1.3 The fundamental components of occluding contour based reconstruction. : : : : : : : : : : : : 1.4 The parameterization of the local 3D surface in the neighborhood of point M . : : : : : : : : : 1.5 Each camera Ci sees a dierent extremal boundary curve Ri on a smooth surface so there are three 3D surface points Mi for every point match. : : : : : : : : : : : : : : : : : : : : : : : : 1.6 CVIP trinocular vision system : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.7 The dierent phases of surface reconstruction : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.8 The epipolar constraint : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.9 The modi ed gradient direction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.10 The correlation windows : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.11 The three images of one view of the rst object : : : : : : : : : : : : : : : : : : : : : : : : : : 1.12 The reconstruction results: (Left: Occluding contour based results, Right: Correlation based results) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.13 The three images of one view of the second object : : : : : : : : : : : : : : : : : : : : : : : : 1.14 The reconstruction results: (Left: Occluding contour based results, Right:Correlation based results) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
7
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
14 14 15 16 17 18 19 20 20 21
2.1 2.2 2.3 2.4 2.5
System Overview : : : : : : : : : : : : : : : : : : : : : : : : : : Gene structure used for data encoding : : : : : : : : : : : : : : Extracting 3D points from a sequence of 2D images using SFS Wire-frame and rendered surface of the registration results : : Part of the Jaw model obtained from 15 data sets : : : : : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
27 27 28 28 28
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12
Overview of the 'SFL' approach. : : : : : : : : : : : : : : : : : : : : : : : : : The drawback of using the minimum distance as the only closeness criterion : Deformable contour : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The deformable contour of two dierent boundaries : : : : : : : : : : : : : : : A simpli ed example of slice linking. : : : : : : : : : : : : : : : : : : : : : : : A partial view of the strong sides table. : : : : : : : : : : : : : : : : : : : : : A simpli ed example of the mesh that closes the top of the surface : : : : : : Two slices of a computer generated model : : : : : : : : : : : : : : : : : : : : The wireframe and the solid model of the theoretical model : : : : : : : : : : The wireframe and the solid model of the phone handset : : : : : : : : : : : : The cloud of data and the wireframe of a human jaw. : : : : : : : : : : : : : The solid model and the rapid prototype model of a human jaw : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
30 31 31 32 34 34 35 36 36 37 37 38
2
: : : : :
11 11 13
SUMMARY:
The 3D model builder ( gure 0.1) consists of three phases: Data Acquisition, Data Preprocessing, and Surface Reconstruction. The data acquisition phase provides the computer with information about the physical object. The input to this phase can come from four dierent scenarios: stereo vision, shape from shading, 3D laser digitizer or Computerized Tomography (CT). The data preprocessing phase is incorporated in each technique, to facilitate the process of surface reconstruction. In a stereo vision system, features from a sequence of images are extracted and used in the surface tting phase. Shape from shading estimates the depth of the image pixels based on the grey level of these pixels. The data obtained from the laser digitizer contains redundant information that has to be eliminated. The CT slices are segmented to mark the object that is needed to be reconstructed. The third phase in the 3D model builder is to t a surface to the processed data. This phase is known in the computer vision eld as triangulation or surface tting. However, in the shape from shading technique, one may use multiple views for the same object, to get a complete description of the surface. In order to get a 3D model for the whole object, dierent views of the object are registered. This process of registration is applied to the 3D model of each view.
3D Model Builder Stereovision system
Shape from Shading
Feature Extraction
Depth Estimation
Feature Matching Depth Map Surface Fitting
Registeration of Multiple Views
Physical Object
3D Model 3D Laser Digitizer
Computerized Tomography (CT)
Data Reduction
Image Segmentation
Triangulation
Contour Extract. & Surface Fitting
Data Data Surface Acquisition Preprocessing Fitting Figure 0.1: Overview of the 3D model builder. In this report, we present three approaches of 3D model building, stereo vision, shape from shading and 3D laser digitizer (scanner). The last approach, CT (or MRI), is mainly used in the biomedical eld. The reader can refer to our work in this area. The report is organized as follow ; chapter 1 is about Stereo Vision , chapter 2 about Shape from Shading and chapter 3 about 3D Laser Scanner.
3
Chapter 1
Stereo Vision One way in which humans perceive depth is through a process called binocular stereopasis or stereo vision. Stereo vision uses the images viewed by each eye to recover depth information in a scene. A point in the scene is projected into dierent locations in each eye, where the dierence between the two locations is called the disparity. Using geometric relationships between the eyes and the computed disparity value, the depth of the scene point can be calculated. Stereo vision, as used in computer systems, is similar. Our work in the stereo vision area involves mainly three parts, camera calibration process, surface reconstruction, and active vision. In the following sections, we'll discuss our work in each eld.
1.1 Camera Calibration Geometric camera calibration is a fundamental prerequisite for any vision system that relies on quantitative measurements of the observed scene. The camera calibration is the process of determining the internal camera geometric and optical characteristics (intrinsic parameters) and the 3D position and orientation of the camera frame relative to a certain world coordinate system (extrinsic parameters). Currently we are using two dierent camera calibration techniques. The rst technique, calibration using a non-linear method, is developed by INRIA [1] and the second technique, calibration using a two-stage approach, is developed by Tsai [3]. Both techniques are using the pinhole camera model. However, the latter technique is taking into account the radial lens distortion. The following sections present a discussion of the two techniques.
1.1.1 Calibration Using A Non-Linear Technique
The classical approach to camera calibration relies on several primary assumptions. First, images are taken of an object, or calibration target, for which exact metric information is known. Second, it is assumed that the features of this calibration target can be accurately identi ed in the image. And third, it is assumed that the camera parameters remain xed, and hence calibration is performed only once as a pre-imaging step. Any changes to the camera setup require re-calibration. Under these assumptions, the classical method solves for the transformation that best maps the known three-dimensional features of the target onto their detected two-dimensional positions in the input image. We have used a non-linear technique for calibrating based on the work of Robert [1], which relaxes the second assumption above, that of accurately identifying target points in the calibration image. Using this method, we calibrate without a strong dependence on very accurately detected image features. The algorithm uses a measure of con dence in the location of features as part of the optimization process. This overcomes the error associated with de nitively localizing features early in the process and then propagating that error to the nal solution. Thus it is the collective arrangement of more weakly-detected features that constrains the process. This approach has the bene t of making the choice of calibration target nearly irrelevant, and gives freedom from needing to develop highly-specialized feature detectors. Thus the algorithm incorporates the feature detection in the equation-solving process to avoid the inaccuracies of direct feature detection. More speci cally, this non-linear method formulates the calibration based on a pinhole camera model. An error criterion is de ned based on a starting point for the calibration parameters generated from user input 4
or automated algorithms using pattern-speci c information. This error criterion measures the con dence that each known ducial point from the 3-D pattern model is actually projecting onto an image location which is the image of a ducial point. The error criterion upon which the optimization is based is formulated as follows. An image is taken of a known 3-D calibration pattern. This image is then converted into a normalized edge gradient image, in which a value of zero representing a strong edge, and a value of one representing no edge are present at a pixel location. The edge response from the gradient magnitude operator represents the set of potential locations where selected points from the calibration pattern should project. The exact shape of the 3-D pattern and the set of points used on it can be arbitrary, as long as a uniform sampling of the pattern is provided. The error criterion to minimize for this image is then:
E=
n X i=1
~ i) gradient(PX
(1.1)
~ i , which is a projection of the where gradient() is the value in the gradient image of the 2-D image point PX 3-D feature point Xi . If the data points selected on the 3-D pattern project correctly, they should eventually reach the bottom of the energy wells representing the edges in the image gradient landscape. Thus the objective is to nd P~ over all P~ , composed of the explicit camera parameters, which minimizes E . The implementation minimizes the sum of the squares of the M weighted nonlinear error measurement functions (one measurement error per feature point per image), in N variables (calibration parameters) by a modi cation of the Levenberg-Marquardt algorithm. This non-linear optimization includes the adjustment of the location of the image points according to the error criterion, and converges after a sucient number of iterations.
1.1.2 Calibration Using A Two-Stage Technique
In this technique, Tsai used the pinhole camera model that is used in several pieces of literature. However, he considered the radial distortion introduced by the lens. In this section, we present the Tsai technique in two steps, the rst is a description of the camera model used in the technique and the second is a description of the technique itself.
The Camera Model
Fig. 1.1 illustrates the camera geometry with perspective projection and radial lens distortion where (xw ; yw ; zw ) is the 3D world coordinate system with origin ow , (x; y; z ) is the camera 3D coordinate system with origin O which is the optical center, (X; Y ) is the image coordinate system centered at O1 , and f is the focal length. Consider point P that is located at (xw ; yw ; zw ) relative to the 3D world coordinates. The projection of this point in the image plane is computed as follows:
Rigid body transformation: A transformation (rotation R followed by translation T ) is applied to transform P to the camera 3D coordinate system (x; y; z ), 2 4
3
2
x xw y 5 = R 4 yw z zw
3 5
+T
(1.2)
where R is 3x3 rotation matrix and T is translation vector. They represent the extrinsic camera parameters.
Perspective projection: Point P is projected to the ideal (undistorted) image coordinate (Xu ; Yu ) using perspective projection and focal length f ,
5
−z O
f
x
O1
X
Pd(Xd,Yd)
y
Pu(Xu,Yu)
Y
zw
ow
yw P(x,y,z)
xw
Figure 1.1: Camera geometry with perspective projection and radial distortion
Xu = f xz Yu = f yz
(1.3)
Radial lens distortion: Due to the lens distortion, the projected point is shifted to a new position (Xd; Yd ), that is the distorted or true image coordinate,
Xd + D x = X u Yd + Dy = Yu
(1.4)
Dx = Xd(1 r2 + 2 r4 + ) Dy = Yd (1 r2 + 2 r4 + ) q r = Xd2 + Yd2
(1.5)
where,
Tsai considered only the rst term since adding more terms would not help and might cause numerical instability. The modeling of lens distortion can be found in [6]. Computer image coordinate: Finally, the real image coordinate (Xd ; Yd ) is transformed to computer image coordinate (Xf ; Yf ) as follows,
Xf = sx dx?1 Xd + Cx Yf = d?y 1 Yd + Cy 0
6
(1.6)
where (Xf ; Yf ) is the row and column numbers of the image pixels in computer frame memory, (Cx ; Cy ) is the row and column numbers of the center of the computer frame memory,
Ncx dx = d x N
(1.7)
0
fx
dx center to center distance between adjacent sensor elements in X (scan line) direction, dy center to center distance between adjacent CCD sensor in Y direction, Ncx number of sensor elements in the X direction, Nfx number of pixels in a line as sampled by the computer. Based on the described camera model, the camera parameters are categorized into two classes, the extrinsic parameters which are the same as other calibration techniques (R; T ), and the intrinsic parameters, f eective focal length, 1 lens distortion coecient, Sx uncertainty scale factor for x, due to TV camera scanning and acquisition timing error, (Cx ; Cy ) computer image coordinate for the origin in the image plane.
The Two-Stage Camera Calibration Procedures
As sx , the uncertainty scale factor in X , is not known a priori, the calibration points should be noncoplanar. Putting this condition in mind, the procedures of camera calibration are stated by Tsai as follow:
Stage 1- Compute 3D Orientation, Position (x; y) and Scale Factor: Compute image coordinate (Xd` ; Yd` ), where (Xd` ; Yd` ) is de ned the same as the (Xd; Yd ) in Eqn. 1.6 except that sx is set to 1:
1. Detect the pixels location of the calibration point i. Call it (Xfi ; Yfi ). 2. Obtain Ncx; Nfx; d`x ; dy according to Eqn. 1.7 using information of camera and frame memory supplied by manufacturer. 3. Take (Cx ; Cy ) to be the center pixel of frame memory. 4. Compute (Xdi ; Ydi ) using
Xdi = s?x 1 dx (Xfi ? Cx ) Ydi = dy (Yfi ? Cy ) 0
(1.8)
for i = 1; ; N and N is the total number of calibration points. Compute Ty?1sx ri ; Ty?1sxTx ; i 2 [1; 6]: These unknowns are computed by setting up the the following linear equation,
Ty?1sx r1 6 T ?1 sx r2 6 y 6 T ?1 sx r3 6 y ` ` ` ` ` ` ` Ydi xwi Ydi ywi Ydi zwi Ydi ?Xdi xwi ?Xdi ywi ?Xdizwi 66 Ty?1sx Tx 6 T ?1 sx r4 6 y 4 T ?1 sx r5 y Ty?1sx r6 2
7
3 7 7 7 7 7 7 7 7 5
= Xdi`
(1.9)
For each calibration point i 2 [1; N ] with 3D world coordinate (xwi ; ywi ; zwi ) and modi ed image coordinate computed above (Xd` ; Yd` ). With N >> 7, an overdetermined system of linear equations can be established and solved for the seven unknowns. Compute (r1 ; ; r9 ; Tx ; Ty ) from a1 = Ty?1sx r1 ; a2 = Ty?1sx r2 ; a3 = Ty?1sxr3 ; a4 = Ty?1sx Tx; a5 = Ty?1r4 ; a6 = Ty?1r5 ; a7 = Ty?1 r6 1. Compute jTy j using the following formula
jTy j = (a25 + a26 + a27 )?1=2
(1.10)
2. Determine the sign of Ty : (a) Pick an object point i whose computer image coordinate (Xfi ; Yfi ) is away from the image center (Cx ; Cy ); the object world coordinate is (xwi ; ywi ; zwi). (b) Pick the sign of Ty to be +1. (c) Compute the following: r1 = (Ty?1 r1 ) Ty ; r2 = (Ty?1r2 ) Ty ; r4 = (Ty?1 r4 ) Ty ; r5 = (Ty?1r5 ) Ty ; Tx = (Ty?1 Tx) Ty ; x = r1 xw + r2 yw + Tx ; y = r4 xw + r5 yw + Ty where Ty?1r1 ; Ty?1r2 ; Ty?1r4 ; Ty?1r5 ; Ty?1Tx are determined in the previous step. (d) If (x and X have the same sign) and (y and Y have the same sign), then sgn (Ty ) = +1 else sgn (Ty ) = ?1 . 3. Determine sx using the formula:
sx = (a21 + a22 + a23 )1=2 jTy j
(1.11)
4. Compute the 3D rotation matrix R that is r1 to r9 . r1 to r6 and Tx are computed using the following formula:
r1 = a1 Ty =sx r2 = a2 Ty =sx r3 = a3 Ty =sx r4 = a 5 T y r5 = a 6 T y r6 = a 7 T y Tx = a 4 T y
(1.12)
The third row of R is computed as the cross product of the rst two rows of R.
Stage 2- Compute Eective Focal Length, Distortion Coecients, and z Position: Compute an approximation of f and Tz by ignoring lens distortion. This is can be done by establishing the following linear equation with f and Tz as unknowns:
where
f Tz = wi dy Yi
(1.13)
yi = r4 xwi + r5 ywi + r6 0 + Ty wi = r7 xwi + r8 ywi + r9 0
(1.14)
yi ?dy Yi
for each calibration point i 2 [1; N ]. With several calibration points, this yields an overdetermined system of linear equations that can be solved for the unknowns f and Tz . 8
Compute the exact solution for f; Tz ; 1 by solving the following equation with f; Tz ; 1 as unknowns, r5 yw + r6 zw + Ty dy Y + dy Y 1 r2 = f rr4 xxw + +r y +r z +T 0
7
where
w
8
w
q
r = (s?x 1 d`x X )2 + (dy Y )2
9
w
z
(1.15) (1.16)
The solution can be obtained using standard optimization scheme. The approximated values of f and Tz are used as initial guess. The initial guess for 1 is zero.
1.1.3 Portable Camera Calibration Utilities
We are re-implementing calibration tools as a platform-independent software using Java and associated toolkits for usable graphical user interfaces. This eort is designed to provide a uniform and portable framework through which we can test various calibration techniques. The design of this tool is geared toward three primary goals: providing a useful and portable GUI, allowing new calibration algorithms to be plugged into the tool easily for evaluation and use, and allowing remote-site control over the calibration process. We are currently designing a GUI which will allow the display of data related to the geometry of up to three cameras which are to be calibrated. The GUI will support options such as display of epipolar geometry overlaid on calibration images, visualization of points projected through the existing camera geometry, positions of the epipoles, and estimation of camera parameter variations as the system iterates. These functions are non-trivial to implement, and the Java toolkits provide a uniform and platform-independent API which we hope will help in popularizing this tool within the research community. The design allows for experimental calibration algorithms to be plugged in and executed. Thus we can evaluate alternatives as well as our existing algorithms which we are re-coding for the Java implementation. The tool will use Java's network API in order to acquire images or image databases remotely for analysis and calibration. This is part of the design in order to improve collaboration between sites and to foster use of equipment by remote hosts as well as local users. Currently the tool is being designed and the GUI is being implemented. The two existing calibration algorithms are being re-implemented as two initial calibration algorithms in a suite of options we expect will grow as the project continues.
1.2 Surface Reconstruction In the surface reconstruction area, we studied and implemented two classical stereo vision techniques, MPG [10, 12] and a Rule-based approach [11, 13]. In addition, Seales [14] developed and implemented a robust technique for surface reconstruction using the occluding contours. Currently, we are working in developing new approaches and better utilizing classical stereo approaches for surface reconstruction. Another technique which has been partially developed but not published yet, is a correlation-based trinocular vision system. The following sections present the current developed technique and the occluding contour based reconstruction. Results and comments of both techniques are given at the end of the section.
1.2.1 Occluding Contour Based Reconstruction
Occluding contours are special class of edges, Fig.1.2(a). They arise from the extremal boundaries of the object. They are viewpoint dependent, Fig 1.2(b), and characterized by the fact that the optical rays of their points are tangential to the surface of the object. In this work, we exploit the special geometry that produces the occluding contour to recover and use important surface information: surface normals and curvatures. The fundamental components of our working system, as shown in Fig. 1.3, are edge detection, stereo matching, edge classi cation and reconstruction, multiple-frame fusion, and global surface tting. Our system can be described as follows, Firstly we calibrate a trinocular system of cameras using a well-known 9
dn r
n
d
s (a)
(b)
Figure 1.2: (a) Dierent type of edges, n: surface normal discontinuities, d: depth discontinuities, dn: surface normal and depth discontinuities, r: changes in surface re ectance, s: shadows. (b) The points on the extremal boundary shift when the point of view changes.
Input Images
Edge Detection
3D Model
Stereo Matching
Global Surface Fitting
Edge Classification & Reconstruction
Multiple−Frame Fusion
Figure 1.3: The fundamental components of occluding contour based reconstruction.
10
method [2]. Using this calibrated rig we obtain a sequence of grey-level images of an unknown object undergoing a motion, e.g. rotation and translation, on a controllable table. The purpose of the motion is to get dierent views of the object and hence more reconstructed points that will help the surface tting process. For each view of this trinocular sequence of images, we apply an edge detector [7]. A cubic B-spline is tted to the detected edges. With the camera geometry, a spline-based trinocular stereo algorithm is applied in order to recover stereo matches [8]. Based on the camera geometry, the matched edges are classi ed into two classes, occluding contours and xed edges. The occluding contours are used to recover the 3D surface of each view while the xed edges are used to recover the motion transformation between the dierent views. The motion transformation is then used to fuse the multiple-views and obtain a global 3D description of the object. Finally, a surface is tted to the reconstructed 3D points to obtain a meaningful description of the surface as a mesh of triangles. In the following paragraph we would like to present the key component of our algorithm which is the edge classi cation and reconstruction.
Edge Classi cation and Reconstruction
We employ the Vaillant-Faugeras method for reconstructing local surface information at the occluding contour [9]. The reconstructed surface is used to classify the edges. This classi cation is based on the fact that at a xed edge, a correspondence among points on matching curves means a single 3D point has been projected into each of the three images. However, at an occluding contour, each image point is the result of the projection of a dierent 3D extremal boundary point. The concept behind Vaillant technique can be presented as shown in Fig. 1.4. In the shown gure, the camera is viewing an extremal boundary R on a smooth surface S in 3D. The projection of R in the image plane is r. Let the camera's optical center be C then Cm is the optical ray that goes through the image point m. Cm intersects the extremal boundary R at the 3D point M . De ne t to be the tangent vector to r at m in the image plane., then Cm and t de ne the tangent plane on S at M . The normal to the image plane is n and is de ned by the Euler angles and as
Cm t) n(; ) = k ((Cm t) k
(1.17)
Let the function p(; ) be the perpendicular distance from the origin of the 3D coordinate system to the tangent plane , then we have
p(; ) = n(; ) OC
(1.18)
The equation of the tangent plane can be written as
n(; ) ? p(; ) = 0 (1.19) where is the vector of the 3D surface point (x; y; z )T and n(; ) is the normal to S at . In [9] the authors show that = (; ) is a local parameterization of the surface S at every non-parabolic point. The mapping (; ) ! p(; ) de nes with parametric equations ; ) ? sin @p(; ) x = cos cos p(; ) ? sin cos @p(@ cos @ ; ) + cos @p(; ) y = sin cos p(; ) ? sin sin @p(@ cos @ @p ( ; ) z = sin p(; ) + cos @
(1.20) (1.21) (1.22)
These fundamental equations are used to estimate the image rst and second derivatives of p(; ) that is used to reconstruct zero-order dierential properties (M), rst-order (surface normal), and second-order 11
y
n Ο
x M η
t
z p(θ, φ)
R
m
S
r C
Figure 1.4: The parameterization of the local 3D surface in the neighborhood of point M . (surface curvature) at the occluding contour. The result of the reconstruction is a set of three 3D curves:one curve for each camera. Fig. 1.5 shows the reconstructed 3D curves. Once the matched curves are reconstructed, a curvature test is applied to these curves to classify them as xed or occluding edges. The key idea of this test is the fact that the three cameras see dierent points at the extremal boundary, Fig. 1.5. In our algorithm, we hypothesize that every edge has originated from the projection of an extremal boundary. Then, those edges that yield a reconstructed surface curvature that is very large are rejected as occluding edges. Those edges that yield a small reconstructed surface curvature are labeled as occluding contours.
1.2.2 A Correlation-based Trinocular Vision System
During the last two decades, many stereo vision algorithms have been developed. These approaches can be classi ed into two categories, edge-based, and region-based stereo. The former is based on matching feature points from each image while the latter is based on matching regions from each image. The main dierence of these two categories is the nature of the reconstructed data. The edge-based results in a sparse disparity map, while the region-based results in a dense disparity map. However, the edge-based is faster and more accurate than the region-based approach. Integrating the two approaches highly improves the performance of the stereo algorithms. However, in our case, we prefer to use an edge-based approach in order to achieve more accuracy. The problem of the sparseness nature of the reconstructed data is solved by taking dierent views for the object and integrating them together. For achieving accurate integration between these views we use a calibrated rotary table. Fig. 1.6 shows the rotary table, its controller, and CVIP trinocular vision system. Our reconstruction technique can be divided into ve phases, see Fig. 1.7. The rst phase is the feature extraction process which is followed by feature matching and the disambiguation process. The output of these phases is a set of triple 2D points where each triple represented a candidate match. Using the camera parameters, the matched points are transformed to the 3D space. Following this task, an interpolation technique is applied to the reconstructed data set to interpolate for the missing points. Finally, a triangular mesh is tted to the 3D data set and an STL le format for the reconstructed object is obtained and can be used to build a rapid prototype model for the object.
12
m1
R3
r1 R2
R1
M3
M2 M1
m2 r2 m3
r3
Figure 1.5: Each camera Ci sees a dierent extremal boundary curve Ri on a smooth surface so there are three 3D surface points Mi for every point match.
Figure 1.6: CVIP trinocular vision system
13
Input Images
Feature Matching & Disambiguation
Feature Extraction
Rapid Physical Prototyping Object Machine
Output 3D Model
Triangulation
Point Construction
Interpolation
Figure 1.7: The dierent phases of surface reconstruction
Feature Extraction
Based on the psychology studies of the eye, it has been observed that the eye is highly sensitive to the edges of the scene. This observation leads to the development of many edge detection algorithms . Among those algorithms are the zero-crossings of the Laplacian of a Gaussian (LoG) [24]. The contours produced by such technique are closed as needed by some applications. However, the extracted edges have rounded corners and shifted edges' locations. The other common edge detection is Canny [25] which has become one of the most widely used edge nding algorithms. However, the connectivity of the extracted edges at junctions is poor, and corners are rounded. Both techniques altered the feature locations that will aect the transformation from 2D to 3D space. Therefore, we did not use them as our feature extraction techniques. Instead, we used a recent edge detection called 'SUSAN' as an abbreviation for 'Smallest Univalue Segment Assimilating Nucleus.' This technique has the advantage of giving thinned edges with accurate locations. For more details about SUSAN, readers are referred to [23]. Other features can be used such as line segments or high curvature points.
Feature Matching and Disambiguation
Matching dierent images' features of a single scene is one of the bottlenecks in computer vision. Over the years a lot of algorithms for feature matching have been proposed [15]. Among those techniques, the gradient orientation of the edges [20, 19] and the correlation based matching [21] have taken the attention of many researchers for their robustness (to some extent). The gradient orientation of the edges is robust in measuring the shape of the projected edges. The gradient orientation is mainly used as a mean measure of the gural continuity, that says that the contours on a scene surface will project into each image as continuous contours with approximately the same shape. On the other hand, the gradient is sensitive to noise and the local shape of the point is not enough for getting unique matches. The other common matching technique is to match two points based on a correlation score that is computed in the neighborhood of each point. This approach is less sensitive to noise and to some extent is a good measure for the local shape of a point, but it is highly susceptible to changes in intensity variations from one image to another. Based on the previous observations, we decided to use both approaches, the gradient and the correlation, as a measure of feature matching. Each one will cover the drawback of the other. However, both techniques are sensitive to distortion from dierent viewing positions and the presence of occluding boundaries in the scene. For solving the former problem, we modi ed the classical approaches for computing the gradient 14
II
M
Epipolar line for m2
I1
I2
Epipolar plane
m2
m1
e2
e1 C1 l m2
lm 1
C2
(R,t)
Epipolar line for m1
Figure 1.8: The epipolar constraint orientation and the correlation score. The latter problem is solved by adding the third camera that will increase the computation time as a price for xing the occlusion problem. In addition to using the mentioned technique for feature matching, we utilize the geometry of the camera systems in what is known as the epipolar constraint. The following paragraphs discuss the epipolar constraint and our modi ed approaches for the gradient and the correlation techniques. Epipolar Constraint: The Epipolar constraint is a well-known constraint in stereo vision matching. Simply, it says that if m1 (a point in Image 1) and m2 (a point in Image 2) correspond to a single physical point M in space, then m1 , m2 , C1 ( the optical center of camera 1), and C2 ( the optical center of camera 2) must lie in a single plane. This plane is called the epipolar plane and it intersects the image plane in a line that is known as the epipolar line, see Fig. 1.8. Given a point m1 in the rst image, its corresponding point in the second image is constrained to lie on the epipolar line of m1 . The epipolar constraint simpli es the matching search process to be 1D (along the epipolar line) instead of 2D (along the whole image). Under the pinhole camera model, the epipolar constraint can be represented in terms of the fundamental equation that is de ned, for two points m1 , and m2 the projections of M in Image 1 and Image 2 respectively, as
m1 Fm2 = 0
(1.23)
where, F , the fundamental matrix, is de ned as
F = A?2 T t ^ R A?1 1
(1.24)
where A1 and A2 are the intrinsic matrices of the rst and second cameras, respectively, (t; R) is the displacement from the rst camera to the second camera coordinate systems, and the operator ^ is de ned as [18]. The gradient operator: The gradient operator is represented by a pair of two masks H1, H2, which measure the gradient of the image I (u; v) in two orthogonal directions. The gradient vector magnitude and direction are given by q
g(u; v) = g12 (u; v) + g22 (u; v) v) g (u; v) = tan?1 gg2 ((u; u; v) 1
(1.25) (1.26)
4< U; H > u; v) and g (u; v) = 4< U; H > u; v) are the bidirectional gradients and < where g1 (u; v) = 1 2 2 ( ( U; H >u;v is de ned, for an image I , as 4 < U; H >u;v =
XX
i
j
h(i; j )I (i + u; j + v)
15
(1.27)
Epipolar v1 line
The classical gradient direction
u1
Boundary curve
The modified gradient direction m1
Figure 1.9: The modi ed gradient direction In our implementation we used a common gradient operator Sobel [22]. The two masks H1 , H2 are de ned as 2 3 ?1 0 1 H1 = 4 ?2 0 2 5 (1.28) ?1 0 1 2 3 ?1 ?2 ?1 H2 = 4 0 0 0 5 (1.29) 1 2 1 where the boxed element indicates the location of the origin. The gradient direction at a point m is de ned as the slope of the edge curve at that point with respect to the horizontal axis of the image plane. The gradient direction is considered as a good measure for the local shape of the object's boundary. However, comparing the gradient directions of two dierent views of an edge point is sensitive to distortion from dierent viewing positions. In order to minimize that sensitivity, we modi ed the gradient direction measure to be the slope of the edge point relative to the epipolar line at the point, not the horizontal axis. Fig. 1.9 shows the modi ed gradient direction. Correlation: One of the common feature matching techniques is the correlation operation. The correlation is performed between a feature point in one of the two images and all the feature points in the second image. Obviously, this is a time consuming task, and to minimize the complexity of that task, a search window will be de ned in the second image. Therefore, the search for matching will be applied only to those points in the search window. The correlation algorithm can be summarized as follows: For each feature point m1 in the rst image, 1. De ne a correlation window of size (2N + 1) (2M + 1) centered at this point. Two sides of the correlation window are chosen to be parallel to the epipolar lines and the other two sides are chosen to be perpendicular to the epipolar lines, see Fig. 1.10. De ning the window in such manner improves the accuracy of the correlation process. 2. Select a rectangular search area of size (2W + 1) (2H + 1) around this point in the second image. 3. For each of the feature points that lie in the search window in the second image, compute the correlation score. We used a classical correlation score [17] that is de ned as Score(m1 ; m2 ) = N X
M h X
i=?N j =?M
1p (2N + 1)(2M + 1) 2 (I1 ) 2 (I2 ) i
h
I1 (u1 + i; v1 + j ) ? I1 (u1 ; v1 ) I2 (u2 + i; v2 + j ) ? I2 (u2 ; v2 ) 16
i
(1.30)
Epipolar lines u1
u2 m2
m1 v2
v1
Image 1
Correlation window
Image 2
Figure 1.10: The correlation windows where Ik (u; v) is the average at point (u; v) of Ik (k = 1; 2) and is given by:
Ik (u; v) =
PN
i=?N
PM
j =?M Ik (u + i; v + j )
(2N + 1)(2M + 1)
(1.31)
and (Ik ) is the standard deviation of the image Ik in the neighborhood (2N + 1) (2M + 1) of (u; v), which is given by s
(Ik ) =
PN
PM
j =?M Ik (u + i; v + j ) ? I 2 (u; v ) k (2N + 1)(2M + 1)
i=?N
2
(1.32)
4. The pair that has a correlation score higher than a prede ned threshold is considered as a candidate match. Thus, for each point in the rst image, we have a set of candidate matches from the second image. In our implementation, N = M = 7 for the correlation window, and W = 100, H = 80 for the search window.
Point Construction
Multiple-point matches are reconstructed by the INRIA reconstruction tool using linear least squares to estimate point position and covariance values for two or more points in correspondence. The covariance values give an estimate of the certainty of the point position relative to certainties in calibration values such as the location of the camera system's optical centers. The utility has decoupled the matching from the reconstruction, allowing matching to be performed separately (often in the absence of complete calibration information). The reconstruction can make use of full calibration or can perform a relative reconstruction using the fundamental matrix [4] and other point matches as a basis.
Interpolation
The output of the previous phases of surface reconstruction is sparse points in the space that include most of the details of the object. However, for the purpose of building the 3D model of that object, more points 17
Figure 1.11: The three images of one view of the rst object need to be computed in addition to what we have. The common approach for computing these points is to use an interpolation technique. The interpolation technique that we used utilizes the grey level information included in the images. The grey level is directly related to the orientation and the surface properties of the objects besides the light source properties. Therefore, for the same surface and light properties, the grey level depends only on the orientation of the surface. This is mainly the concept behind shape from shading. Therefore we are working in integrating the shape from shading technique with our stereo technique that will result in more reconstructed points that ll the gaps between the sparse points.
Triangulation
We developed a novel technique to t a surface for a cloud of data [16]. The idea of our technique is to slice the data set into parallel cross sections and to t a curve for each cross section. Once the curves are tted to each slice, a linking process is running over each two consecutive slices to link them together. The output of the linking process is a mesh of triangles. This mesh is closed and has consistent normal directions which are two necessary conditions for rapid prototyping machines. The details of that technique are provided in Chapter 3.
1.2.3 Results and Comments
We applied the two techniques, occluding contour- and correlation- based reconstruction, to many objects. In this report we present only the reconstruction of two objects. Fig.1.11 shows the three images of one view of the rst object. The reconstruction results are shown in Fig. 1.12. Fig. 1.13 shows the three images of one view of another object. The reconstruction results are shown in Fig. 1.14. The results are presented as a sparse points without tting mesh of triangles to it. We are working now in the interpolation process to get more points in the 3D data set, then the triangulation process can be applied. The reconstructed surfaces in both techniques need data authentication to get rid of the false matches. Also, the results of both techniques need more points before tting a surface to any of them. However, the occluding contour reconstruction is more robust in the case of smoothed surface objects but it diminishes the corners that may exist in the object. This results from tting B-spline to the extracted edges in the rst phase of the reconstruction technique. This drawback does not exist in the other technique. Integrating these two techniques, in addition to shape from shading, will improve the performance of the reconstruction. The integrated technique will utilize the robustness of the occluding contour approach, the features preservation of the correlation technique, and the denseness nature of shape from shading.
18
Figure 1.12: The reconstruction results: (Left: Occluding contour based results, Right: Correlation based results)
Figure 1.13: The three images of one view of the second object
19
Figure 1.14: The reconstruction results: (Left: Occluding contour based results, Right:Correlation based results)
1.3 Active Stereo Vision The goal of active stereo vision is to improve the quality of depth values recovered from stereo reconstruction by dynamically modifying the camera parameters. The zoom, focus, and aperture of a camera can be controlled to improve the resulting image, and this improvement can result in substantial improvements in the processing required to perform stereo matching. In order to approach the problem of active stereo vision we have studied algorithms to calibrate cameras which can dynamically zoom, focus and aperture. Our work in calibrating active cameras is most similar to that of Willson [5], who analyzed a camera capable of changing zoom, focus and aperture. He estimates camera parameters by performing Tsai's camera calibration algorithm [3] independently on images taken for various settings of zoom and focus. The resulting parameter values are used as a starting point for an optimization process that adjusts parameter values over the complete zoom and focus range. The key dierence between our approach and that of Willson is our relaxed assumption about localized features. As we discussed in the earlier section on non-linear calibration, The basis of his optimization process is feature detection and localization using a specialized circle-detector, which gives sub-pixel accuracy. We do not assume that individual features are detected to such a high degree of accuracy. Instead, we have extended the INRIA non-linear algorithm to a multi-image, multi-staged optimization in order to calibrate the camera across a large range of zoom values.
20
Chapter 2
Shape From Shading Our work in this eld involved the utilization of the classical approach of shape from shading in dierent applications. We used SFS to automate the process of ceramic tile design [33, 34]. Also, we used SFS to obtain a record of the patient's occlusion with the help of a novel registration technique using genetic algorithms [26, 27, 28]. In this chapter, we'll present a description of our work of building a 3D model for the jaw impression. The system, shown in gure 2.1, consists of the following modules: First, a sequence of images are obtained for segments of the jaw. These segments can be overlapping. Shape from shading (SFS) is applied to obtain 3D representation of the segments. Second, a genetic-based approach is used to merge/register the segments. Finally a priori information are used to align the segments on the correct geometry of the jaw. The resulting 3D model can be manipulated with CAD tools to provide the data structure for further processing.
2.1 Three-Dimensional Shape From Shading The process of recovering surface orientation from gray level shading, Shape From Shading (SFS), was rst de ned by Horn [31]. Two general classes of algorithms have been developed: local algorithms, which attempt to estimate shape from local variations in image intensity and global algorithms which can be further divided into global minimization approaches and global propagation approaches. The global propagation approach attempts to propagate information across a shaded surface starting from points with known surface orientation (singular points), while in global minimization a solution is obtained by minimizing an energy function [33]. Among the local algorithms, we present in this section the one developed by Pentland [32] because of its speed and accuracy. Over a small region we can always approximate the re ectance map by a linear function of the partial derivatives (p; q). This approximation can be determined by taking a rst-order Taylor series expansion of R(p; q) about the central point (p0 ; q0 ), to obtain
(p; q) E (x; y) R(p0 ; q0 ) + (p ? p0 ) @R@p p=p0 ;q=q0 @R ( p; q ) (2.1) + (q ? q0 ) @q p=p0 ;q=q0 = k1 + k2 p + k3 q: For Lambertian re ectance function k1 = cos , k2 = cos sin , k3 = sin sin , where and are the tilt and the slant of the illuminant respectively. This linear approximation of the re ectance function becomes accurate over a larger area as the illuminant becomes more oblique, and over a smaller area as the illuminant 21
moves closer to the viewer's direction. Equation (2.1) may be transformed into the Fourier domain in order to obtain a convenient and ecient solution. k1 is a d.c. term and can be ignored.
FE (f; ) = k2 Fp (f; ) + k3 Fq (f; )
(2.2)
where f is the radial frequency and is the orientation. Since p and q are the partial derivatives of the surface height Z , their Fourier transforms are simply
Fp (f; ) = 2 cos f e j2 Fz (f; )
(2.3)
Fq (f; ) = 2 sin f e j2 Fz (f; )
(2.4)
Applying the previous equations into equation 2.1 and substituting for the values of k1 and k2 we obtain
FE (f; ) = j 2f sin (cos cos + sin sin ) Fz (f; )
(2.5)
where j is the imaginary axis of the complex domain. = j 2f sin cos( ? ) Fz (f; )
(2.6)
FE (f; ) = j 2fSd sin Fz (f; ):
(2.7)
Fz (f; ) = j 2fS1 sin FE (f; ):
(2.8)
let cos( ? ) = Sd then
d
This algorithm gives a non-iterative, closed-form solution using the Fourier transform. The problem lies in the linear approximation of the re ectance map when the non-linear terms are large [33] [34]. Figure 2.3 presents the application of this technique to extract depth information from dierent 2D jaw images. Once the 3D information for a series of 2D images fSj j j = 1; 2; ::; N g is obtained, 3D data registration is applied to combine and align these data sets into one model. In the following section, we will present a new technique for data registration using genetic algorithms.
2.2 Surface-Based Registration
A parametric 3-D shape S , either a curve segment or a surface, is a vector function x : [a,b] !