Document not found! Please try again

Genetic-Based Stereo Algorithm and Disparity Map Evaluation ...

8 downloads 34735 Views 1MB Size Report
The algorithm first takes advantage of multi-view stereo images to detect occlusions, and therefore, removes mismatches caused by visibility problems.
International Journal of Computer Vision 47(1/2/3), 63–77, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. 

Genetic-Based Stereo Algorithm and Disparity Map Evaluation MINGLUN GONG AND YEE-HONG YANG Department of Computing Science, University of Alberta, Edmonton, AB Canada T6G 2E8 [email protected] [email protected]

Abstract. In this paper, a new genetic-based stereo algorithm is presented. Our motivation is to improve the accuracy of the disparity map by removing the mismatches caused by both occlusions and false targets. In our approach, the stereo matching problem is considered as an optimization problem. The algorithm first takes advantage of multi-view stereo images to detect occlusions, and therefore, removes mismatches caused by visibility problems. By optimizing the compatibility between corresponding points and the continuity of the disparity map using a genetic algorithm, mismatches caused by false targets are removed. The quadtree structure is used to implement the multi-resolution framework. Since nodes at different level of the quadtree cover different number of pixels, selecting nodes at different levels gives a similar effect as adjusting the window size at different locations of the image. The experimental results show that our approach can generate more accurate disparity maps than two existing approaches. In addition, we introduce a new disparity map evaluation technique, which is developed based on a similar technique employed in the image segmentation area. Comparing with two existing evaluation approaches, the new technique can evaluate the disparity maps generated without additional knowledge of the scene, such as the correct depth information or novel views. Keywords: disparity map evaluation, multi-resolution image, stereovision, genetic algorithm, Markov random fields

1.

Introduction

A stereovision system normally takes two or more images obtained using parallel cameras as inputs and produces a disparity map. A desired disparity map should be smooth and should contain sufficient detail for subsequent processing.

1.1.

Previous Work in Stereovision

Previous work in stereovision can be coarsely classified into feature-based methods (Grimson, 1981, 1985; Hoff and Ahuja, 1989) and intensity-based methods (Chen and Medioni, 1999; Kanade and Okutomi, 1994; Levine et al., 1973; Zitnick and Kanade, 2000). The latter category can be further divided into local approaches and global approaches.

Feature-based methods first detect features, e.g. edges and corners, in the source images. The matching process is then conducted with these features. Generally, feature-based approaches can provide robust, but sparse, depth information. A complex interpolation process is needed to obtain a complete disparity map. Local intensity-based methods select a window centered at each pixel. Pixels within the window are used to compute correlation or sum of squares differences between input images. The disparity that produces the best match is set as the disparity of the pixel. The selection of window size is a critical problem. The window size should be large enough to include enough intensity variation for reliable matching. On the other hand, the size should be small enough to avoid disparity variation inside the window. The generated

64

Gong and Yang

disparity maps will be noisy if the widow size is too small, and blurry if the widow size is too large. Levine et al. (1973) change the window size locally according to the intensity pattern. Kanade and Okutomi believe that the optimal window size depends not only on intensity variation but also on the disparity variation, which is an unknown itself. Therefore, they propose an iterative algorithm (Kanade and Okutomi, 1994): starting from an initial estimate of the disparity map, the algorithm iteratively updates the disparity value of each pixel by choosing the size and shape of window until it converges. The above-discussed local intensity-based methods can produce dense disparity maps. However, these approaches are sensitive to noise since disparities are computed based on local information only. To address this problem, global intensity-based methods try to apply global constraints in the matching process, i.e., the disparity of a pixel is influenced by the disparities of its neighbors. Global intensity-based methods can be formulated as the process of extracting a surface from a threedimensional u-v-d volume, i.e., the so-called disparity space (Intille and Bobick, 1994; Yang et al., 1993). The value of any voxel in the disparity space indicates the probability of the disparity of pixel (u, v) is d. Consequently, the desired disparity map should be a surface that satisfies two constraints: (1) the surface is smooth; (2) it passes through voxels of high probabilities. Different algorithms have been proposed using this formulation. In Chen and Medioni’s approach (Chen and Medioni, 1999), the disparity surface is extracted from the volume using a propagation type algorithm. The algorithm undergoes two steps: seed voxel selection and surface tracing. Time efficiency is achieved by executing the algorithm in a coarse-to-fine fashion. In Zitnick and Kanade’s approach (Zitnick and Kanade, 2000), a three-dimensional function is used to update the volume iteratively so that the probabilities of different disparities can influence each other. After the process converges, the disparity with the maximum probability is selected as the disparity for the corresponding pixel. Zitnick and Kanade also suggest that occluded areas can be explicitly detected by setting a threshold. If the maximum probability value is smaller than the threshold, the pixel will be classified as occluded. Recovering disparity information from two views has its inherent limitations. First of all, the length of the baseline, i.e. the distance between the two cameras, has to be carefully chosen. Using a short baseline cannot

estimate distance accurately due to narrow triangulation. On the other hand, a longer baseline means a larger disparity range must be searched to find the match, which has a larger possibility of a false match. Hence, a tradeoff has to be made between precision and correctness. Okutomi and Kanade try to address this tradeoff problem using a multiple-baseline approach (Okutomi and Kanade, 1993). They place multiple cameras along a line with their optical axes perpendicular to the line. The sum-of-squares-difference value is computed for each stereo pair and is used as the dissimilarity measure. The matching algorithm tries to minimize the sum of different dissimilarity measures obtained from multiple stereo pairs. Another commonly known issue of analyzing binocular disparity is that some part of the scene may be visible in only one of the images. As a result, the disparity information is not recoverable since it is not possible to find a correct match. Satoh and Ohta proposed the Stereo by Eye Array algorithm (Satoh and Ohta, 1994, 1995) to solve this problem. In this approach, nine cameras are placed in a 3 × 3 array on a plane with the same length of baseline between neighboring cameras on both vertical and horizontal directions. The matching algorithm uses the geometric relation among captured images to distinguish occlusions from false targets. 1.2.

Previous Work in Genetic Algorithm

The genetic algorithm (GA) proposed by John Holland (1975) is a general-purpose global optimization technique based on randomized search and incorporates some aspects of iterative algorithms. GA is often regarded as an alternative method for solving complex optimization problems, especially for combinatorial optimization problems or problems whose derivatives cannot be computed numerically. The research on GA and its applications are a topic of active research (Srinivas and Patnaik, 1994). Some researchers have applied GA to stereovision (Han et al., 2001; Saito and Mori, 1995). In the approach proposed by Saito and Mori (1995), different window sizes are used to calculate several candidate disparity values at each pixel location. The GA is then applied to select the disparity of each pixel from these candidates. To simplify the calculation, the disparity map is partitioned into 8 × 8 blocks and the GA is applied to analyze each block. This approach is limited because it assumes that the correct disparity value is

Genetic-Based Stereo Algorithm and Disparity Map Evaluation

one of the candidates calculated with different window sizes. This assumption may not hold when the images are noisy or when part of the scene has a plain color and lacks in features for matching. Han et al. (2001) divide the reference image into regions using a modified nonlinear Laplace (MNL) filter. Chromosomes are then determined adaptively based on the extracted regions. Eventually, disparities are calculated by selecting a fitness value among several candidate genes using a genetic algorithm. The quality of the disparity map generated by this approach highly depends on the region extraction result of the MNL filter. 1.3.

Motivation of Our Approach

The approach presented in this paper is an intensitybased approach. Our motivation is to improve the accuracy of the disparity map generated by removing the mismatches caused by both occlusions and false targets. Therefore, multiple-view stereo images are used to detect occlusions, and global constraints are applied in the matching process to eliminate false targets. A problem associated with using global constraints is how to process around the boundaries. We notice this problem when we experimented with the iterative algorithm implemented by Zitnick and Kanade, which involves the summation of a three-dimensional local support area (Zitnick and Kanade, 2000). The boundary problem not only manifests itself at the boundary of the disparity maps, which shows up as a black border, but also introduces errors when the true disparity is close to the given minimum or maximum disparity because the summation is also conducted in the disparity dimension. A possible solution to the second problem is to enlarge the disparity search range. However, it will increase the memory requirement, the time needed for calculation, and also the possibility of false matches. In this paper, to avoid the above-mentioned boundary problem, a novel approach is used to apply the global constraints in the matching process. Here, we borrow the idea from neighborhood-based segmentation and use the Markov Random Fields (MRFs) to model the interactions between neighboring pixels. Within the MRFs framework, the stereo matching process is equivalent to finding the optimum state of the MRF. Because of the Gibbs’ equivalence, the probability that the MRF is in a particular state can be calculated using local energies. Consequently, a given stereo matching result can be evaluated by modeling local interactions,

65

and the problem of finding the best stereo matching can be viewed as finding the solution to a combinatorial optimization problem. 2.

Genetic-Based Stereo Matching

The outline of our algorithm is as follows: First the three-dimensional disparity space is populated with dissimilarity values based on the source images. In the case of multi-view stereo images, the space will be filled with the output of an occlusion detection function proposed in SEA. A fitness function is defined based on the MRF to evaluate a given disparity map. A genetic algorithm is used to extract the best surface from the disparity space with respect to the fitness measure. In addition, the quadtree structure is used to represent possible disparity maps. Since nodes at different level of the quadtree contain different number of pixels, selecting nodes at different resolution levels gives a similar effect to that of adjusting the window size at different locations of the image. In the following subsections, detailed issues are discussed first, which include initialization of the disparity space, the encoding mechanism for all possible disparity maps, the formulation of the fitness function, and the appropriate crossover and mutation operators. An outline of the optimization process is given latter. 2.1.

Initialization of the Disparity Space

The disparity space is initialized based on the input stereo images. If only two stereo images, I and I ref , are available, this space will be filled directly by the dissimilarity measure between corresponding pixels of I and I ref , which is calculated by: S(r, c, d) = eref (r, c, d)  = ρ(I (r + i, c + j), −w≤i, j≤w

I ref (r + i, c + j + d)) (1) where w is the window radius, I (x, y) the color of pixel (x, y), and ρ the color dissimilarity function. The color dissimilarity function could be any function that produces low values for correct matches, such as the squared difference or the absolute difference. In this paper, the absolute difference function is used and the window radius is set to 1. If multi-view stereo images captured by a camera matrix are available, occlusion detection functions

66

Gong and Yang

proposed in SEA can be applied. Basically, eight stereo pairs are formed using the central camera, I (0,0) , and each of the eight peripheral cameras, I (m,n) . For each pixel in the center image, the dissimilarity values of all eight stereo pairs are computed. The disparity space is then filled using the following function: S(r, c, d)  (−1,−1)  e (r, c, d), e(−1,0) (r, c, d),  e(−1,1) (r, c, d), e(0,−1) (r, c, d),    = σ   e(0,1) (r, c, d), e(1,−1) (r, c, d), 

(2)

e(1,0) (r, c, d), e(1,1) (r, c, d) (m,n)

where e (r, c, d) is the dissimilarity between corresponding pixels of I (0,0) and I (m,n) , and σ the occlusion detection function. Different occlusion detection functions can be applied. Satoh and Ohta give a systematic comparison of different functions (Satoh and Ohta, 1996). The “sorting summation” function is used in this paper since it can detect most of mismatches caused by occlusions. This algorithm normally does not perform as good as some other functions in removing mismatches caused by false targets. However, our algorithm will rely on the genetic-based optimization process to get rid of these mismatches. 2.2.

Encoding Scheme for Disparity Maps

Fundamental to all GA’s is the encoding scheme for representing the solutions of the corresponding optimization problems. Normally, the method to encode the solutions depends on the applications to which the GA is applied. In our approach, the quadtree structure is used to provide the multi-resolution scheme. Therefore, the possible solutions, i.e. the disparity maps, are encoded by encoding the corresponding quadtrees. The same encoding scheme has been recently applied to color image segmentation (Gong and Yang, 2001). The quadtrees used to represent the disparity maps must satisfy the following two constraints: • Every leaf (i.e. a node with no child) k of the quadtree has an associated disparity value x, which implies that the disparity values of all the pixels covered by k are x. A leaf k is said to cover pixel p if k contains p. • Any interior node in the quadtree cannot have all its descendents assigned to the same disparity value;

otherwise, all the descendents of the node should be removed and the node itself should be selected as a leaf. Now we need to find a way to encode all the possible quadtrees. In this paper, an array representation of a complete quadtree is used so that we can encode different quadtrees using a one-dimensional integer string. In such a representation, the index of node k in the string can be computed by:

h−1  i 4 + y × 2h + x (3) k= i=0

where h, x, and y are the height and the x-, ycoordinates of node k, respectively. The content at location k in the integer string is determined by: x if k is a leaf and its disparity is x; D[k] = −1 otherwise (4) 2.3.

Fitness Evaluation

The fitness of a given chromosome, which is represented as a string, controls the evolution process. The fitter the chromosome, the greater is its probability to survive from one generation to the next. In the context of stereo matching, we need a way to evaluate different disparity maps. Here we use an energy function, which is based on the MRF, as the fitness evaluation tool. The smaller the value of the energy function, the better the disparity map. The format of the function is:

  f (D) = (5) S(r, c, D[k]) + λTk k∈P

(r,c)∈k

where D denotes the disparity map, P the set that contains all the leaves in the quadtree, D[k] the disparity value of leaf k. λ is the weight of the penalty term with a small value of λ favors a detailed disparity map and a large value encourages a coarser result. Tk is the length of the boundary of leaf k, which is defined as:  Tk = (1 − δ(D[k] == D[g])) × tk,g (6) g∈P∩Q k

where δ(true) = 1, δ(false) = 0, Q k the set that contains all the nodes that share boundary with node k, tk,g the length of the shared boundary between leaf k and leaf g.

Genetic-Based Stereo Algorithm and Disparity Map Evaluation

2.4.

Initial Population Generation

The GA needs a number of initial disparity maps as the initial population to start with. The choice of the population size is very important. If the selected population size is too small, the algorithm may result in premature convergence without finding an appropriate solution. On the other hand, a large population size will lead to long computation time. Our experiments show that premature convergence is likely to occur when the population size is smaller than 30. In this paper, the population size is set to 40, which appears to be sufficient to avoid this problem. The initial disparity maps are generated purely randomly. First, a recursive procedure is invoked using the root of the quadtree as the input parameter. Inside the procedure, whether or not the given parameter, the node k, is selected as a leaf depends on a random number. If the decision is to select then the best disparity, which can minimize the sum of the dissimilarities between all the pixels in the node and their corresponding pixels, is assigned to node k. Otherwise, the procedure invokes itself recursively using the four child nodes of k until the bottom level is reached. 2.5.

67

Crossover Operator

In our scenario, it will be inefficient to apply the commonly used crossover operators, such as the twopoint crossover, because they do not guarantee that the crossover results of the two quadtrees will still be legal quadtrees. To address this problem, a recently proposed crossover operation for color image segmentation, called graft crossover, is applied (Gong and Yang, 2001). For completeness, details of this operator are discussed in the following. Given two strings, which represent two quadtrees, we want to compare them and find out all the leaves that appear in only one of the two quadtrees. The crossover process will be terminated if no such leaves are found. Otherwise, we randomly select one of these leaves as a seed node. For example, after comparing the two quadtrees shown in Fig. 1(a) and (b), we can pick leaf u since it appears only in the left quadtree. Then the cover node that is the predecessor of the seed node and appears in both quadtrees is determined (it is node v in our example). Finally, we swap all the nodes that are the descendents of the cover node. Figure 1(c) and (d) show the results after we do the graft crossover at cover node v.

Figure 1. Graft crossover for quadtrees, (a) and (b) the parents; (c) and (d) the offspring.

By construction, this algorithm guarantees that the results of crossover will still be legal quadtrees. After crossover, the energies of the two quadtrees may change and one of the offspring’s strings may have a lower energy than either of its parents. 2.6.

Mutation Operator

The mutation operation is important for the GA since the crossover operator cannot generate offspring that have genes that do not appear in the initial population. In our approach, three mutation operations, which are the splitting, the merging, and the alteration are adopted in this paper (Gong and Yang, 2001). The mutation operator randomly selects a number of pixels from the original image. In our experiments, the number of pixels selected is equal to half of the perimeter of the image. For each pixel, we search for leaf k in the quadtree that covers this pixel. One of the splitting, merging, and altering operations will be applied with equal chance. When trying to merge leaf k with its siblings, we need to find out all of its siblings first. The merging operation will be inhibited if any of its siblings has children. Otherwise, the local energy is computed, and the merging operation will happen only if the energy is lower after merging than that before merging. When trying to split leaf k, the process is similar. The local energies are computed for disparity maps before and after splitting. The splitting operation will occur only when the energy decreases after splitting. When trying to alter leaf k, all the neighbors of k are found first. The alteration operation will change the label of leaf k to the label of one of the labels of its neighbors, which can reduce the energy.

68

Gong and Yang

2.7.

Minimization Process

After the above issues are addressed, the GA can be implemented as an iterative procedure. When the population evolves from one generation to the next, two strings are picked randomly each time to do the crossover and mutation until all the strings in the population are processed. The elitist strategy (Goldberg, 1989) is applied when selecting strings for the next generation. The energy value f (Dim ) of the best string Dim of generation m is compared with the energy value f (Dm+1 ) of the worst j m+1 m of generation m + 1. If f (D ) ), string Dm+1 i < f (D j j m+1 m then string Di is substituted for D j . By means of this strategy, the minimum energy value of the population will not increase during the process of evolution. The above process is repeated until the termination condition is satisfied. In our approach, the evaluation process will be terminated if the energy difference between the best string and the worst string in the population is smaller than 0.01 percent of the average energy value. 3.

Disparity Map Evaluation

Our genetic-based stereo matching algorithm contains only one parameter, which is λ. As we mentioned, a small value of λ favors a detailed disparity map and a large value encourages a coarser result. On the one hand, this feature gives the user the flexibility to fine tune the stereo matching results. On the other hand, the user has to face with the problem of choosing a suitable value of λ for a given dataset. To alleviate the problem, in this section, we will discuss the evaluation techniques, which can be used to automatically pick the best disparity map.

a synthetic one, or when equipments like a laser range scanner can be applied to survey the scene. However, in general cases, the ground truth depth maps are difficult to obtain. Furthermore, since the correct matching rate is a global measure, a noisy disparity map may have a higher correct matching rate than a visually better disparity map. The second approach tests how well the disparity maps can predict novel views. It is a weaker requirement since incorrect disparity maps can also predict correct novel views. This methodology is appropriate when the stereo results are to be used for image-based rendering. Nevertheless, it requires additional views of the scene. In addition, since an image-based rendering algorithm is involved in generating novel views, the algorithm used may also introduce errors into the measurement results. 3.2.

The Parameter-Free Measures

To pick the best one from disparity maps generated, we need to evaluate them without additional knowledge of the scene, such as depth information or novel views. Therefore, both approaches listed above cannot be applied. We notice the similarity between the areas of image segmentation and stereovision. In the first area, the desired segmentation should be smooth and should have a low color error. In the second area, the preferred disparity map should also be smooth and should produce less color dissimilarities between matching pixels. Based on these observations, we propose to adopt parameterfree measures in the area of image segmentation to evaluate disparity maps. Liu and Yang propose a parameter-free measure for image segmentation (Liu and Yang, 1994), which is defined as: √

3.1.

Existing Evaluation Methodology

Two different approaches have been used before to measure the quality of the disparity maps generated. The first one is to compare with ground truth depth maps, and the second one is to measure novel view prediction errors (Szeliski and Zabih, 1999). The first approach calculates the correct matching rate of the disparity maps generated. It assumes that the correct depth information of the scene is available. This evaluation methodology can be used when the scene is

F(I ) =

R  ei2 R × √ 103 (M × N ) i=1 Ai

(7)

where R is the number of regions, M × N the image size. Ai and ei are the area and the error of the ith region, respectively. The error of the region is defined as the sum of the Euclidean distance of the color vectors between the original image and the segmented image of each pixel in the region. Recently, Borsotti et al. (1998) give two enhanced functions that correspond more closely to visual

Genetic-Based Stereo Algorithm and Disparity Map Evaluation

judgment. These two measures are defined as: 1 104 (M × N )

Max R

  e2 1+A × R(A) A × (8) √i Ai A=1 i=1 √ R Q(I ) = 4 10 (M × N )   R   ei2 R(Ai )2 (9) × + 1 + log Ai Ai i =1 P(I ) =

where R(A) is the number of regions having exactly area A, and Max the area of the largest region. We notice that, for image segmentation, the above measures do not penalize regions that have a large color error sufficiently. Therefore, we propose a new measure by modifying “the error of the region” term in Eq. (7). The definition of the new criterion is Gong and Yang (2001): √

G(I ) =

R  E i2 R × √ 106 (M × N ) i=1 Ai

(10)

where E i is defined as the sum of the squares of the Euclidean distance of the color vectors between the original image and the segmented image of each pixel in the region. To apply the above measures in stereo matching field, we need to decide how to define regions and how to define the error of different regions. In image segmentation applications, two pixels are classified into the same region as long as they have the same label and there exists a 4-connected path on which all the pixels also have the same label. We notice that this definition is not suitable for stereovision applications. For example, according to this definition, 37 regions exist in the “head and lamp” ground truth image. The background alone is broken into 21 regions because of the occlusions by foreground objects. To avoid this problem, we classify two pixels into the same region as long as they have the same disparity values and there exists a 4-connected path on which all the pixels have the same or higher disparity values. Using the new definition for regions, only 8 regions exist in the same ground truth image. In image segmentation, the error is defined according to the color difference between the original image and the segmented image for each pixel in the region. For

69

stereo matching, we can use different color dissimilarity functions to define errors. Here we use the same function as the one we used to fill the disparity space. This simplifies the error calculation since we can look up the disparity space directly. Therefore, we have:  S(r, c, D(r, c)) ei = (r,c)∈Ri

Ei =



(S(r, c, D(r, c)))2

(11)

(r,c)∈Ri

where D(r, c) is the disparity value of pixel (r, c) in the disparity map. Obviously a suitable measure for stereovision should satisfy the following requirements. The above parameter-free measures are evaluated in Section 4.4 based on these principles. • It should give the best evaluation to the ground truth. • For two disparity maps that have similar number of regions, it should give better evaluation to the one that has a higher correct matching rate. • For two disparity maps that have a similar matching rate, it should give better evaluation to the one that is less noisy, i.e. has fewer regions. 4.

Experimental Results

Four datasets are used to test the presented algorithm. Section 4.1 gives the descriptions of the datasets we used. Section 4.2 shows the disparity maps generated using our approach and two other existing approaches. Section 4.3 compares the accuracy of the results quantitatively. The parameter-free measures are tested in Section 4.4. 4.1.

Data Sets

Figure 2 shows the first dataset, “head and lamp”, which is a multi-image stereo set from the University of Tsukuba. It contains nine color images and is used in Satoh and Ohta (1995). The ground truth is available for the center image, in which every pixel has been labeled by hand. However, during our experiments, we found that this ground truth is not perfect. The following changes are made to make it close to the real truth. • A border is added to make the ground truth to have the same size as the original images. • Two errors are corrected in the area around the books (enclosed by two rectangles).

70

Gong and Yang

Figure 2.

“Head and lamp” multi-view stereo (a) center image (b) ground truth (c) updated ground truth.

Figure 3.

“Plant” multi-view stereo (a) center image (b) left (reference) image (c) top (reference) image.

• The disparity in the head area is simplified since these tiny regions are very likely caused by noise. After the above changes, the number of regions in the ground truth drops from 15 to 8. The average dissimilarity for different regions also drops by 0.15% to 7.176. These justify our changes. The second dataset, “plant”, is also a multi-view stereo from University of Tsukuba captured by the camera matrix. As shown in Fig. 3, many occlusions exist in this dataset since the scene is very complex and the length of the baseline is quite large. The third one, “slanted plane”, is a binocular stereo, which contains two grayscale images. It is created by Szeliski and Zabih (1999) to test the performance of

Figure 4.

stereo algorithms on slanted surfaces. The scene, together with the ground truth, is shown in Fig. 4, where the grid pattern is used in the ground truth to mark areas that are not visible in the reference image. Figure 5 shows the last dataset, “random dot stereogram”. It is a synthetic stereo image pair with 50 percent density and is used in Zitnick and Kanade (2000). Again, grid pattern areas shown in the ground truth are occluded. 4.2.

Disparity Maps Generated

The SEA approach (Satoh and Ohta, 1995) and the cooperative algorithm (Zitnick and Kanade, 2000) are used for comparison. All the cooperative algorithm

“Slanted plane” binocular stereo (a) left (reference) image (b) right image (c) ground truth.

Genetic-Based Stereo Algorithm and Disparity Map Evaluation

Figure 5.

71

“Random dot stereogram” (a) reference (left) image (b) right image (c) ground truth.

results shown below are generated using the “ZK Stereo” program written by Zitnick (2000). The default values are used for most of the parameters. In particular, the window radius for the initial match is set to 1, the local support radius for row and column dimensions is set to 2, and the local support radius for disparity dimension is set to 1. The values for the minimum disparity and maximum disparity depend on the particular stereo dataset. In order to make sure that the algorithm converges, the number of iterations is changed to 15, which is the upper bound according to the document of the program (Zitnick, 2000). The program provides two different ways to compute the initial match values. The sum-of-absolute-difference is used because it gives

better results and also because we use the same way to fill the disparity space when only two stereo images are available. Figure 6 shows the experimental results for the “head and lamp” dataset. The true disparity range for this stereo dataset is [5, 14]. As shown in Fig. 6(a)–(c), given different values of λ, our approach will give disparity maps of different coarseness. The results of using the SEA algorithm with different window sizes are shown as Fig. 6(d) and (e). Fgure 6(f ) is generated by the cooperative algorithm, which is quite good considering it uses two original images only.1 However, in order to generate this result, we have to enlarge the disparity range to [2, 17]. Poor results will be produced

Figure 6. “Head and lamp” multi-view stereo (a) our approach (λ = 1) (b) our approach (λ = 4) (c) our approach (λ = 14) (d) SEA (3 × 3 window) (e) SEA (5 × 5 window) (f ) cooperative algorithm.

72

Gong and Yang

Figure 7. “Plant” multi-view stereo (a) our approach (λ = 1) (b) SEA (3 × 3 window) (c) SEA (5 × 5 window) (d) cooperative algorithm (disparity [12, 51]).

if a tighter bound is used. Since the disparity range is expanded by 60 percents, it is reasonable to believe that the computational time and the space required are also increased by 60 percent. Figure 7 shows the experimental results for the “plant” dataset. The true disparity range for this stereo dataset is [15, 48]. Figure 7(b) shows that the SEA approach can detect most of the occlusions. However, many mismatches caused by false targets still exist. For instance, incorrect depth variations show up in the rectangle region. Figure 7(c) shows that the mismatches cannot be corrected by simply increasing the window size. As shown in Fig. 7(a), our approach successfully removed these mismatches. The result generated by the cooperative algorithm, shown in Fig. 7(d), looks smooth. However, upon closer inspection of the picture shows that many details are lost. For example, the leaves are broken, “tears” shows up in the rectangle area, and part of the background is covered in the oval area. These problems might be caused by occlusions, which is not recoverable using only two images. Figure 8 shows the experimental results for the “slanted plane” dataset. The true disparity range for this

stereo dataset is [4, 28]. Although mismatches caused by occlusions exist in the result of our approaches, comparing with the result using local matching approach with the same window size (Fig. 8(d)), our results show that increasing the value of λ can effectively reduce mismatches caused by false targets. Figure 8(e) shows the result of the cooperative algorithm using an enlarged disparity range. Since interpolation is applied, the result looks smooth. However, after we remove the smoothing effect by rounding all the disparity values to the closest integers, as shown in Fig. 8(f ), small errors start to show up. Figure 9 shows the experimental results for the synthetic scene. The true disparity range for this stereo dataset is [0, 20]. The result of our approach shows that, in area without occlusion, most of the disparities given are accurate. Figure 9(b) shows the boundary problem of the cooperative algorithm when a tight disparity range is given. Figure 9(c) shows that even when the disparity range is increased to [−3, 23], the brightest vertical bar still does not show up. This might be caused by the boundary problem since this area has the highest disparity.

Genetic-Based Stereo Algorithm and Disparity Map Evaluation

73

Figure 8. “Slanted plane” binocular stereo (a) our approach (λ = 1) (b) our approach (λ = 4) (c) our approach (λ = 14) (d) local matching (3 × 3 window) (e) cooperative algorithm (disparity [1, 31]) (f ) cooperative algorithm after removing the smoothing effect.

Figure 9. “Random dot stereogram” (a) our approach (λ = 15) (b) cooperative algorithm (disparity [0, 20]) (c) cooperative algorithm (disparity [−3, 23]).

4.3.

Accuracy Comparison

The “head and lamp” and “slanted plane” datasets are real scenes and have the ground truth. Therefore, they Table 1.

are used to numerically evaluate our approach with the SEA algorithm and the cooperative algorithm. Table 1 gives the comparison of the SEA algorithm and our approach for the “head and lamp” dataset. The

Comparison for the “head and lamp” dataset. SEA (3 × 3 window)

SEA (5 × 5 window)

Our approach (λ = 1)

Our approach (λ = 4)

Our approach (λ = 14)

% of correct disparities

77.9215

82.2709

85.7033

88.859

89.2207

% of the disparities within truth value ±1

92.589

94.7275

97.6119

97.1499

96.9555

51

28

17

# of regions Average dissimilarity

2226 6.67499

801 6.81076

6.78444

6.82959

6.9256

74

Gong and Yang

result shows that our approach gives higher correct matching rate than the SEA algorithm even when a small value of λ is given. In addition, the disparity maps generated by our approach have much fewer regions then those generated by the SEA algorithm. The result also shows that when the value of λ increases, the number of regions drops and the average dissimilarity increases. This complies with our expectations. Table 2 gives the comparison of the cooperative algorithm and our approach for the “slanted plane” dataset. The correct matching rate of the cooperative algorithm is calculated based on the disparity output of the “ZK Stereo” program. The result shows that our approach gives more accurate results when λ ≥ 4. The average dissimilarities of our results are smaller than that of the cooperative algorithm for all different values of λ. The result also shows that the number of regions drops to 61 when λ = 14. This number can be lower to 23 if we do not count the regions within the area that is occluded.

Figure 10.

Table 2. Comparison for accuracy for the “slanted plane” dataset (in area without occlusion only). Our Our Our Cooperative approach approach approach algorithm (λ = 1) (λ = 4) (λ = 14) % of correct disparities

69.1316

75.4092

81.4174

86.6738

% of the disparities within truth value ±1

97.2354

96.3646

98.3563

99.4789

# of regions Average dissimilarity

4.4.

176 33.4677

478 26.0137

179 26.6831

61 27.6075

Parameter-Free Measures

Here, the “head and lamp” and “slanted plane” datasets are used to test different parameter-free measures. For both datasets, our approach is used to generate 30

Relations between different measures and correct matching rate for “slanted plane” dataset.

Genetic-Based Stereo Algorithm and Disparity Map Evaluation

Figure 11.

75

Relations between different measures and correct matching rate for “head and lamp” dataset.

different disparity maps with the value of λ varies from 0.5 to 15, in 0.5 interval. For the “head and lamp” dataset, the SEA algorithm is used to generate 5 disparity maps with the window size varying from 1 × 1 to 9 × 9. For the “slanted plane” dataset, since only two images are available, the local matching approach is used instead with the window size varying from 1 × 1 to 9 × 9. Figure 10 shows the relations between different measures and correct matching rate for the “slanted plane” dataset. It shows that measures F, P, and Q give the best evaluation to the ground truth. The F measure and Q measure tend to give better evaluation to the disparity map that has a higher correct matching rate. The reason that G measure does not give the best evaluation to the ground truth may be because it penalizes the regions with large dissimilarities too much. Although it is suitable for image segmentation, but seems not suitable for stereovision application.

Figure 11 shows the same relations for the “head and lamp” dataset. Since the updated ground truth have fewer regions and have smaller average dissimilarities, all four measures give it a better evaluation than that they give to the original ground truth. The results also show that the measures F, P, and Q give the updated ground truth the best evaluation. In addition, when two disparity maps has similar correct matching rate, all four measures favor the disparity maps generated by our approach than those generated by the SEA algorithm since our results have fewer regions. The above results suggest that the F measure and Q measure are suitable evaluation tools for disparity maps. Therefore, the user can choose to generate a set of results using our algorithm with different values of λ, and use either the F or Q measure to pick the best one. The disparity map that is picked tends to have a higher correct matching rate than others. However, in practice, we find sometimes that the disparity map picked may

76

Gong and Yang

not be visually the best. For instance, in the “head and lamp” dataset, both the F and Q measures give better evaluations to the disparity map shown in Fig. 6(c) than the other two results, although the disparity map shown in Fig. 6(b) is visually the best. 5.

Conclusions

In this paper, we proposed a novel genetic-based stereo matching algorithm. The algorithm can be applied to both binocular stereo data and multi-view stereo data that are captured by a camera matrix. When multi-view stereo images are available, they can be fully utilized to detect occlusions and to improve accuracy. A fitness function, which considers both intensity similarity and disparity smoothness, is introduced to evaluate a given disparity map. The genetic algorithm is used to find the best disparity map with respect to the fitness function. The quadtree structure is used to implement the multi-resolution framework, which gives a similar effect as adjusting the window size at different locations of the image. To apply the genetic algorithm under the quadtree structure, an encoding mechanism for the quadtree structure is used. Graft crossover, splitting mutation, merging mutation, and alteration mutation, which are suitable for the quadtree structure, are applied. Comparing with the SEA algorithm, our approach is more flexible since it can also work with binocular stereo datasets and produces acceptable results. Our approach adopts the idea of detecting occlusions using the geometric relation among captured images. However, different from the SEA algorithm, rather than using the output of the occlusion detection function to determine disparities directly, our approach uses the output to fill the disparity space, which makes it possible to further remove mismatches. Comparing with the iterative-based cooperative algorithm, our approach uses the MRF framework to apply the global constraints in the matching process. The boundary problem is solved since no summation is needed in the process. In addition, our approach can fully utilize multi-view stereo images to remove mismatches caused by collusions. The cooperative algorithm cannot be applied to multi-view stereo directly since the uniqueness assumption it uses does not hold. In other words, it is possible that more than one match is projected to the same pixel in one of the reference images.

To the best of our knowledge, we are the first to introduce the segmentation evaluation techniques into the stereovision area. Comparing with two existing disparity evaluation approaches, the new technique can evaluate the disparity maps generated without additional knowledge of the scene, such as correct depth information or novel views. These parameter-free measures make it possible to automatically pick the best disparity map from the results generated by our algorithm with different values of λ. However, in practice, we find the measures tend to pick the disparity map that has a higher correct matching rate, rather than the one that is visually better. The appropriate measure deserves further investigation. The computational costs of genetic-based algorithms are relatively high. Our experiments show that the new algorithm takes about 500–800 generations to converge. However, since the dissimilarity values are precalculated and stored in the disparity space, the computation time is reduced. On our 1.5 GHz Pentium 4 computer with 256 MB RAM running Windows 2000, less than two minutes are needed to calculate the disparity map for the “head and lamp” example, which is a 384 × 288 multi-view stereo. In summary, the proposed algorithm naturally integrates together the idea of occlusion detection using a camera matrix, matching with adaptive window size, and incorporating support from neighboring pixels. The experimental results show that the algorithm can generate better disparity maps than two existing approaches. Future research in this direction is warranted. Acknowledgments The authors would like to acknowledge financial supports from NSERC and the University of Alberta. The authors would also like to thank Y. Ohta of the University of Tsukuba for supplying the “head and lamp” and “plant” stereo images, R. Szeliski of Microsoft Research for the “slanted plane” stereo images, and L. Zitnick of Carnegie Mellon University for the “random dot stereogram” stereo images. We also thank L. Zitnick for sharing his software on the web. Note 1. The cooperative algorithm cannot be adjusted to make use of multi-view stereo images because of the uniqueness assumption it uses.

Genetic-Based Stereo Algorithm and Disparity Map Evaluation

References Borsotti, M., Campadelli, P., and Schettini, R. 1998. Quantitative evaluation of color image segmentation results. Pattern Recognition Letters, 19(8):741–747. Chen, Q. and Medioni, G. 1999. Volumetric stereo matching method: Application to image-based modeling. Conference on Computer Vision and Pattern Recognition, vol. 1, IEEE Computer Society, Fort Collins, CO, USA, pp. 29–34. Goldberg, D.E. 1989. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley: Reading, MA. Gong, M. and Yang, Y.-H. 2001. Genetic-based multiresolution color image segmentation. In Vision Interface, Canadian Image Processing and Pattern Recognition Society: Ottawa, Ontario, Canada, pp. 71–80. Grimson, W.E.L. 1981. A computer implementation of a theory of human stereo vision. Philosophical Transactions of the Royal Society of London, B 292:217–253. Grimson, W.E.L. 1985. Computational experiments with a feature based stereo algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(1):17–34. Han, K.-P., Song, K.-W., Chung, E.-Y., Cho, S.-J., and Ha, Y.-H. 2001. Stereo matching using genetic algorithm with adaptive chromosomes. Pattern Recognition, 34(9):1729–1740. Hoff, W. and Ahuja, N. 1989. Surfaces from stereo: Integrating feature matching, disparity estimation, and contour detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(2):121–136. Holland, J.H. 1975. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence. University of Michigan Press: Ann Arbor, MI, USA. Intille, S.S. and Bobick, A.F. 1994. Disparity-space images and large occlusion stereo. European Conference on Computer Vision, Stockholm, Sweden, May 2–6, pp. 179–186. Kanade, T. and Okutomi, M. 1994. Stereo matching algorithm with an adaptive window: Theory and experiment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9):920–932.

77

Levine, M.D., O’Handley, D.A., and Yagi, G.M. 1973. Computer determination of depth maps. Computer Graphics and Image Processing, 2(4):131–150. Liu, J. and Yang, Y.H. 1994. Multiresolution color image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(7):689–700. Okutomi, M. and Kanade, T. 1993. A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(4):353–363. Saito, H. and Mori, M. 1995. Application of genetic algorithms to stereo matching of images. Pattern Recognition Letters, 16(8):815–821. Satoh, K. and Ohta, Y. 1994. Passive depth acquisition for 3D image displays. IEICE Transactions on Information and Systems, E77D(9):949–957. Satoh, K. and Ohta, Y. 1995. Occlusion detectable stereo using a camera matrix. Asian Conference on Computer Vision, International Association for Pattern Recognition, Singapore, Dec. 5–8, 1995, pp. 331–335. Satoh, K. and Ohta, Y. 1996. Occlusion detectable stereo— systematic comparison of detection algorithms. International Conference on Pattern Recognition, vol. 1, International Association for Pattern Recognition, Los Alamitos, CA, USA, Aug. 25–29, 1996, pp. 280–286. Srinivas, M. and Patnaik, L.M. 1994. Genetic algorithms: A survey. Computer, 27(6):17–26. Szeliski, R. and Zabih, R. 1999. An experimental comparison of stereo algorithms. International Workshop on Vision Algorithms, Kerkyra, Greece, Sept. 21–22, 1999, pp. 1–19. Yang, Y., Yuille, A., and Lu, J. 1993. Local, global, and multilevel stereo matching. Conference on Computer Vision and Pattern Recognition, IEEE, New York City, NY, USA, June 15–17, 1993, pp. 274–279. Zitnick, L.C. 2000. Software “ZK Stereo,” available at http://www. cs.cmu.edu/∼clz/stereo.html. Zitnick, L.C. and Kanade, T. 2000. A cooperative algorithm for stereo matching and occlusion detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7):675–684.