A Novel Algorithm for Automatic 3D Model-based Free ... - CiteSeerX

81 downloads 69 Views 1MB Size Report
novel algorithm for automatic 3D model-based free-form ob- ject recognition. We use a robust tensor-based representation for matching surface patches from the ...
A Novel Algorithm for Automatic 3D Model-based Free-form Object Recognition∗ Ajmal S. Mian, M. Bennamoun and R. A. Owens School of Computer Science and Software Engineering The University of Western Australia 35 Stirling Highway, Crawley, WA 6009, Australia {ajmal, bennamou, robyn}@csse.uwa.edu.au Abstract – Recognition of free-form objects from unknown viewpoints is a difficult task, especially in the presence of occlusions. In this paper we address this problem and present a novel algorithm for automatic 3D model-based free-form object recognition. We use a robust tensor-based representation for matching surface patches from the scene with a model library. A 4D hash table is built from the tensors during the offline phase in order to make the online matching efficient. Our algorithm can accurately identify objects in occluded environments and calculate their pose while matching the scene and the model at a very low resolution. Preliminary experiments on a library of real objects show that our algorithm is applicable to free-form objects, accurate, efficient and robust to occlusions. Keywords: 3D object recognition, free-form objects, 3D object representation, matching.

1 Introduction Recognition of free-form objects in 3D is an important problem in computer vision. It has a large number of applications including robot vision, industrial quality control, automated search in hazardous environments and automated surgery. One paradigm to recognition is model-based recognition in which the 3D models (see Fig. 1(c)) of objects of interest are saved in a model library during an offline phase. During the online recognition phase, a view of a scene object (see Fig. 1(a) and (b)) is matched with the model library in order to find the identity and pose (location and orientation) of the scene object. 3D object recognition is an active area of research and reliable recognition algorithms are still being sought. Existing techniques have various limitations mainly in terms of their applicability to free-form objects, accuracy, efficiency and robustness to occlusions. The following is a review with respect to these criteria of some related work in this area. Johnson’s spin image matching [9] is sensitive to the resolution and sampling of the models and scene. The spin image ∗ 0-7803-8566-7/04/$20.00

c 2004 IEEE.

Figure 1: (a) A point cloud of a 2.5D view of the isis. (b) A surface fitted 2.5D view of the isis. (c) A complete 3D model of the isis. representation is non-unique because it involves the mapping of a 3D surface into a 2D histogram. This results in many ambiguous matches which must be passed through a number of filtration stages to prune incorrect ones. The matching time in [9] grows linearly with the size of the model library. An improvement to this technique [2] overcomes the sensitivity of spin images to resolution and sampling. However, the latter two problems of non-uniqueness of the representation and the inefficiency of the algorithm remained unsolved. COSMOS [5] requires the calculation of principal curvatures which are not only sensitive to noise but also require the underlying surface to be smooth and twice differentiable. Moreover, it cannot be used for the recognition of occluded objects. The splash representation [15] makes assumptions about the shape of the objects and is also sensitive to noise and occlusions. Recognition techniques based on matching B-spline curves such as [4][17] remain sensitive to the unsolved knot problem. The B-Spline representation

6348

is not unique since the location of the knots cannot be accurately calculated and is not repeatable for a given object. SAI matching [8] is limited to surfaces which are free of topological holes and is inefficient. The derivation of point signatures in [3] is vulnerable to surface sampling and noise and therefore may result in ambiguous representations. CEGI based techniques such as [11] cannot be applied to free form objects. The recognition based on HOT curves [10] relies on the accurate localization of inflection points which are sensitive to noise. The recognition of 3D objects from 2D images is an appealing approach due to the common availability of low cost digital cameras. However, most of these techniques, such as [18], are based on a number of assumptions related to the type and shape of the 3D objects. These techniques are also sensitive to illumination, shadows and occlusions. Moreover, the availability of low cost rapid range image acquisition systems [19] annihilates the main advantage of these 2D image based recognition techniques. In this paper, we present a novel algorithm1 for automatic 3D model-based free-form object recognition based on our robust tensor representation [12] in order to overcome the deficiencies of the existing algorithms. The combination of the strengths of our representation [12] (modified and adapted in this context to represent the whole object as opposed to a single view as in [12]) and the appropriate use of a 4D hash table for matching constitute the basic ingredients of this novel 3D recognition algorithm. Our algorithm is fully automatic and requires no user intervention at any stage including the offline 3D modeling, the online object identification and the pose estimation. Briefly, our algorithm proceeds as follows. During an offline phase, the 3D models of objects of interest are stored in a model library along with their tensor representations. In a tensor representation, local surface patches of the models are represented by third order tensors (details in Section 2.1). A 4D hash table is built from these tensors for efficient matching (Section 2.2). During the online recognition phase, the tensors of the scene object are extracted and efficiently matched with the tensors of the models in the library using a voting scheme (Section 3). Matching these tensors amount to matching the local representation of the surface patches of the scene object and model object and hence result in the recognition of the scene object. Matching tensors are also used for the 3D registration of the model with the scene to calculate the pose of the identified object. We performed our experiments on a model library of seven real objects and our results show that our algorithm is applicable to free-form objects, accurate, efficient and robust to occlusions (Section 4).

2 Offline Representation Phase During the offline phase, 3D models of objects of interest are stored in a model library along with their tensor representations. Our algorithm can accurately recognize objects in 1 Australian

Patent Application No. 2004902436

Figure 2: The 3D model library. The high resolution models (first and third row) were constructed using our automatic 3D modeling algorithm [13][14]. The number of faces in each high resolution model is written on its top. A mesh reduction algorithm [7] was applied to simplify these models to 600 faces per model (second and last row). Only the low resolution models along with their tensor representations were stored in the model library for the purpose of recognition. the scene at a very low resolution (see Section 4). This is an attractive feature as it can be used with cheap low resolution scanning sensors. An additional advantage of this feature is that the models can be stored in the model library at low resolution hence reducing memory utilization (see Fig. 2). Low resolution models can be obtained in two ways. First, the objects can be modeled at a low resolution using a low resolution range scanner. Second, if high resolution 3D models of these objects are available then a mesh reduction algorithm can be used to reduce the resolution of these models. Since we had already modeled some objects at high resolution using our automatic 3D modeling algorithm [13][14], we used Garland’s mesh simplification algorithm [7] to reduce their resolution to 600 faces per model. The original number of faces of each model is written on its top in Fig. 2. The tensor representation of the 3D models along with the hash table construction are also performed during the offline phase. Our automatic correspondence algorithm in [12] gives a detailed description of the tensor computation for a

6349

single 2.5D view (see Fig. 1(a) and (b)) of an object. In this paper we present a variant of the same tensor representation to build the representation of the complete 3D models (see Fig. 1(c)). Complete 3D model representations are indispensable in order to achieve object recognition from any viewpoint. A brief description of the hash table building is also necessary in order to fully understand our automatic 3D object recognition algorithm. Section 2.1 and Section 2.2 give a brief description of the tensor computation and the construction of the hash table respectively.

2.1

Building the Tensor Representation

In order to compute the tensors of a model or a scene, they are first converted to a triangular mesh (see Fig. 3(a)) and normals are calculated for each vertex of the mesh. The vertices of a mesh are then paired such that two vertices in any pair satisfy the following two constraints. First, the distance between them should be between dmin and dmax . Second, the angle between their normals should be no more than 60o . There are two advantages to these constraints. First, they ensure that the vertices in any pair are visible from a single viewing angle. Second, they avoid the C2n combinatorial explosion of vertices (where n is the total number of vertices in a mesh). dmin should be selected large enough so that the tensor computation is not sensitive to noise. In our experiments we selected dmin equal to twice and dmax equal to three times the average resolution of the models. The vertices in each pair are then used to define a local 3D basis over the surface of the mesh as follows. The middle point between the two vertices defines the origin. The average of the normals of the two vertices makes the z-axis and their cross product makes the x-axis. Finally, the cross product of the z-axis with the x-axis makes the y-axis. This 3D basis is used to define a 3D grid centered at the origin (see Fig. 3). Two parameters need to be chosen. The number of bins in the grid and the size of each bin. Choosing more bins will enclose more surface inside the grid. The size of an individual bin governs the level of granularity at which the surface is represented. In our experiments we selected a 10×10×10 grid based on extensive surface matching experiments for surface matching [13]. We set the bin size equal to half the mean resolution of the model meshes in the library. Once the 3D grid is defined, the surface area of the mesh crossing each bin of the grid is computed and stored in a third order tensor. Each element of the tensor is the mesh surface area that is present inside its corresponding bin in the 3D grid (Fig. 3(b)). The area of intersection of the mesh with the grid bins is calculated using Hodgman’s polygon clipping algorithm [6]. A polygon inside the grid is considered for contributing toward the computation of a tensor only if its normal makes an angle of less than 90o with the z-axis of the grid. There are two ways in which a pair of vertices can be ordered. Therefore, a 3D basis and a corresponding tensor is calculated for each ordering of a pair of vertices. Most of the bins in the 3D grid are likely to be empty resulting in a lot of zero entries in their corresponding ten-

Figure 3: (a) A mesh of the balljoint model. (b) A 10 × 10 × 10 grid defined over the surface of the balljoint. The triangles that are shown are the ones which contribute toward the corresponding tensor. sors. Therefore, these tensors are reduced to sparse arrays in order to reduce memory utilization by approximately 85%. Moreover, tensors with more than 95% zero elements are discarded since they are unlikely to give a correct match.

2.2

Hash Table Construction

The tensors of the models are used to fill up a 4D hash table. Three dimensions of the hash table correspond to the x,y,z indices of the tensor elements whereas the fourth dimension is defined by the definition angle θdef of the tensors. We define θdef of a tensor as the angle between the normals of the pair of vertices that is used to construct the 3D grid from which that tensor is computed. θdef is quantized into bins of ∆θdef . Choosing a lower ∆θdef will reduce the number of possible matches for a tensor but will also increase the risk of missing a correct match due to noise and sampling errors in θdef . During our multiview correspondence experiments [14] we found that a ∆θdef = 5o gives good results. The 4D hash table is filled up as follows. For each tensor of every model, the tuple (tensor number, model number) entry is made in all the bins of the hash table corresponding to the x,y,z indices of the non-zero elements of the tensor and the θdef of the tensor.

3 Online Recognition Phase During the online recognition phase, a view of the scene is acquired at a low resolution (using a low cost scanner) and converted into a triangular mesh. Next, a subsample of vertices (approx. 200), in pairs of two, are selected from the scene such that these pairs are uniformly distributed in the mesh representing the scene. The pairs must satisfy the two constraints given in Section 2.1. Each pair of vertices is used to compute a tensor as described in Section 2.1. Next, the x,y,z indices of the non-zero elements along with the θdef of each tensor are used to cast votes to all the tuples (tensor number, model number) present at the corresponding index positions (x,y,z,θdef ) in the 4D hash table. The model that receives the most votes is hypothesized to be present in the scene. The recognition hypothesis is verified as follows. The correlation coefficient Cc for each scene tensor Ts and its cor-

6350

responding model tensor Tm are calculated in their overlapping bins only (Eqn. 1). Cc is calculated in the overlapping region of the two tensors to cater for occlusions.

Table 1: Confusion matrix for the recognition results of 10 unoccluded views of each object. Model

Dino Isis Balljoint Dinopet Dog Hip Dragon

(1)

Scene

In Eqn. 1, Ims is the intersection of the non-zero elements of Tm and Ts . Tm (Ims ) and Ts (Ims ) are the values of the model and scene tensors respectively, in their region of overlap. Corresponding tensors which have a Cc less than a prespecified threshold tc are discarded. We selected tc = 0.5 on the basis of our extensive experimental analysis of pairwise surface matching [13]. The remaining corresponding pairs of tensors are sorted according to their decreasing Cc . The pair with the highest Cc is then used to transform the model to the scene coordinates using the rotation matrix R (Eqn. 2) and the translation vector t (Eqn. 3).

Dino

8













Isis

1

9*











Balljoint −



10









Dinopet





8







Dog



− −





7





Hip











10















8

Cc = correl coeff(Tm (Ims ), Ts (Ims ))

R

=

B> m Bs

Dragon

* Twice recognized with incorrect pose

(2)

t = Os − Om R (3) In Eqn. 2, Bm and Bs are the matrices of coordinate basis used to define the model and scene tensors respectively. B> m is the transpose of matrix Bm . In Eqn. 3, Om and Os are the coordinates of the origins of Tm and Ts respectively. Once the model and the scene are registered, the surface match is verified by refining the registration with the ICP algorithm [1]. If the model and scene surfaces have a significant overlap, the hypothesis is accepted. Otherwise, the next pair of corresponding tensors is used to register the model with the scene. More details of the surface matching algorithm (although used in the context of 3D modeling) can be found in [12][14]. When a recognition hypothesis has been accepted, the final rotation matrix and translation vector are used to represent the pose of the model object in the scene.

4 Results and Discussion We performed our experiments on a model library of seven real objects namely the dinosaur, the isis, the balljoint, the dinopet, the dog, the hip and the dragon (see Fig. 2). We modeled these objects at high resolution using our automatic 3D modeling algorithm [13][14] (range data taken from [16]). Next, we applied a mesh reduction algorithm [7] to these models and reduced their resolution to 600 triangular faces per model. This is an extremely low resolution. Fig. 2 shows the 3D models of the seven objects at high resolution (row one and three) and their corresponding low resolution models (row two and four). We only stored the low resolution models along with their tensor representations in the model library. The high resolution models are only shown for illustration purposes. The average recognition time of our algorithm was 70 seconds with a MATLAB implementation on a P-4 PC machine. This time is expected to reduce significantly, once the algorithm is re-implemented in C++. Note that the timing given in [9] is for matching a single spin image with the model library as opposed to the total recognition time. Since 100 spin images were matched in [9], the total recognition time will

Figure 4: Example views of the five objects (400 faces per view) which were not recognized during our experiments. be well above 100 times the matching time for a single spin image. Therefore, if we take the whole recognition time into account, our algorithm is more efficient particularly keeping in view that the code in [9] was written in C. Moreover, our algorithm uses a hash table for efficient matching as opposed to the one to one matching scheme of the spin images in [9]. The matching time in [9] increases linearly with the size of the model library. Since our algorithm uses a hash table based voting scheme, the recognition time is unlikely to be significantly affected by the size of the model library. Ten different low resolution views (taken from different angles) of each of the seven objects of Fig. 2 were fed one by one to our automatic 3D object recognition algorithm for recognition and pose calculation. The resolution of these views was 400 faces per view. Table 1 shows the confusion matrix for our recognition results. Notice that there is only one false-positive i.e. the isis is recognized as the dinosaur. There were overall 9 false-negatives out of a total of 70 trials. The balljoint and hip were recognized at 100% accuracy. The overall recognition rate was 86%. We define the recognition rate as the number of true-positives divided by the total number of trials. Note that this recognition rate was achieved at a

6351

Figure 5: Recognition rate vs. occlusion. Note that the recognition rate is 100% at zero occlusion because these tests were performed on a subsample of unoccluded views which were correctly recognized.

Figure 7: Recognition rate vs. noise. Note that the recognition rate is 100% at zero noise because these tests were performed on a subsample of noiseless views which were correctly recognized.

Figure 6: Example views of 3 objects with 60% occlusion. very low resolution. We have included one example view of each object which was not recognized correctly during our experiments in Fig. 4. We also tested our algorithm to recognize occluded views of the objects. We define occlusion according to Eqn. 4. Av (4) occlusion = 1 − At In Eqn. 4, Av is the visible surface area of the occluded object in the scene and At is the total surface area of the object that would be visible without occlusion. Note that in the case of [9], At is defined as the total surface area of the model as opposed to the total surface area of the view as in our case. The occlusion equation given in [9] gives at least 50% occlusion even for full views (not models) of objects that are topologically equivalent to a sphere. For full views of free-form objects, the value of occlusion is even greater because of greater self occlusions in their case. The results of our occlusion tests are compiled in Fig. 5. Note that the recognition rate is 100% at zero occlusion because these tests were performed on a subsample of unoccluded views which were correctly recognized in Table 1. Fig. 5(a) shows the recognition rate versus the percentage occlusion. Fig. 5(b) shows the maximum percentage occlusion at which each individual object was correctly recognized. We have included three views of the objects with 60% (Eqn. 4) occlusion in Fig. 6 for illustration purposes. According to the definition in [9], the views of the balljoint, dinopet and the dog in Fig. 6 are 87.05%, 89.5% and 88.33% occluded, respectively as opposed to 60% according to our definition. Our final test was to check the robustness of our recognition algorithm to noise. We injected different levels of Gaussian noise to different views of the objects and recorded the recognition rate of our algorithm. Fig. 7 shows the results of this experiment. Note that the recognition rate is 100% at zero noise because these tests were performed on a sub-

Figure 8: A view of the dog and the dinopet (first row). After the addition of Gaussian noise with σ = 2.0cm. sample of noiseless views which were correctly recognized in Table 1. The recognition rate was 100% up to a noise with standard deviation σ of 2.0cm. At σ = 2.0cm the recognition rate dropped to 90% only and from there onwards it started to decrease rapidly. Fig. 8 shows the views of two objects before and after the addition of noise with σ = 2.0cm. The insensitivity of our algorithm to noise is due to the robustness of our tensor representation to noise (as demonstrated in the case of 3D modeling in [13]). In order to investigate the reason for the nine views which were categorized as false-negatives, the single view which was categorized as a false-positive and the two views of the isis which were identified correctly but with incorrect poses, we repeated our recognition experiment for these 12 views at a higher resolution. We increased the resolution of the models in the library to 1200 faces per model and the resolution of the views to 600 faces per view. In this case, all these 12 views were correctly recognized (true-positive) except for one of the views of the isis which was again identified correctly but with an incorrect pose. Thus, our algorithm reached a 100% object identification rate but only a 98.5%

6352

pose estimation rate at a resolution of 1200 faces per model and 600 faces per view. Note that this resolution is still lower than the resolution of the models and views used in [9]. The following is a brief qualitative comparison of our algorithm to [9] based on our preliminary results. Our representation is superior to the spin image representation because it is unique as opposed to the spin image representation. Due to the mapping of 3D surfaces into 2D histograms, it is possible for two different surfaces to have the same spin image. In our case, only similar surfaces can have matching tensors. Our preliminary results show that our algorithm can recognize objects under greater occlusions as compared to [9]. Our results also show that our algorithm is robust to noise whereas the algorithm in [9] was not tested for noise. A more rigorous and quantitative comparison will be provided in future publications.

5 Conclusion In [12] we demonstrated the use of our tensor representation for automatic pairwise correspondence and registration between two different views of an object. In [14] we used our tensor representation to solve the correspondence problem between unordered views of an object for automatic 3D modeling. In this paper, we successfully used a variant of our tensor representation for 3D model-based free-form object recognition. Our algorithm is fully automatic with no user intervention at any stage including the offline 3D modeling phase and the online object identification and pose estimation phase. Preliminary results show that our recognition algorithm has a high accuracy even at a very low resolution and is therefore efficient in terms of memory utilization and performance. Moreover, it requires a cheap low resolution scanner and is therefore economical. Experiments on real free-form objects were performed. These results show that our algorithm is accurate, efficient and robust to occlusions.

Acknowledgments The authors would like to thank The University of Stuttgart for providing range data and Carnegie Mellon University for providing mesh reduction software. This research is sponsored by ARC grant number DP0344338.

References [1] P. J. Besl and N.D. McKay, “Reconstruction of Realworld Objects via Simultaneous Registration and Robust Combination of Multiple Range Images,” IEEE TPAMI, vol. 14, no. 2, pp. 239–256, 1992. [2] O. Carmichael, D. Huber and M. Hebert, “Large Data Sets and Confusing Scenes in 3-D surface Matching and Recognition,” 3DIM, pp. 358–367, October, 1999. [3] C. S. Chua and R. Jarvis, “Point Signatures: A New Representation for 3D Object Recognition,” IJCV, vol. 25, no. 1, pp. 63–85, 1997. [4] F. S. Cohen and J. Wang, “Part I: Modeling Image Curves Using Invariant 3-D Object Curve Models - A

Path to 3-D Recognition and Shape Estimation from Image Contours,” IEEE TPAMI, vol. 16, no. 1, pp. 1–12, 1994. [5] C. Dorai and A. K. Jain, “COSMOS: A Representation Scheme for 3D Free-Form Ojbects,” IEEE TPAMI, vol. 19, no. 10, pp. 1115–1130, 1997. [6] J. Foley, A. van Dam, S. Feiner and J. Hughes, Computer Graphics-Principles and Practice, Addison-Wesley, 2nd Edition, 1990. [7] M. Garland and C. Heckbert, “Surface Simplification Using Quadric Error Metrics,” SIGGRAPH, pp. 209–216, 1997. [8] M. Hebert, K. Ikeuchi and H. Delingette, “A Spherical Representation for Recognition of Free-Form Surfaces,” IEEE TPAMI, vol. 17, no. 7, pp. 681–690, 1995. [9] A. E. Johnson and M. Hebert, “Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes,” IEEE TPAMI vol. 21, no. 5, pp. 674–686, 1999. [10] T. Joshi, J. Ponce, B. Vijayakumar and D. Kriegman, “Hot Curves for Modeling and Recognition of Smooth Curved 3D Objects,” IEEE CVPR, pp. 876–880, 2002. [11] S. B. Kang and K. Ikeuchi, “The Complex EGI: A New Representation for 3D Pose Determination,” IEEE TPAMI, vol. 15, pp. 707–721, 1993. [12] A. S. Mian, M. Bennamoun and R. A. Owens, “Matching Tensors for Automatic Correspondence and Registration,” ECCV, part 2, pp. 495–505, 2004. [13] A. S. Mian, M. Bennamoun and R. A. Owens, “Performance Analysis of an Improved Tensor Based Correspondence Algorithm for Automatic 3D Modeling,” IEEE ICIP, October, 2004. [14] A. S. Mian, M. Bennamoun and R. A. Owens, “From Unordered Range Images to 3D Models: A Fully Automatic Multiview Correspondence Algorithm,” TPCG, pp. 162–166, 2004. [15] F. Stein and G. Medioni, “Structural Indexing: Efficient 3-D Object Recognition,” IEEE TPAMI, vol. 14, no. 2, pp. 125–145, 1992. [16] The University of Stuttgart, “Stuttgart Range Image Database,” http://range.informatik.unistuttgart.de/htdocs/html/. [17] J. Wand and F. S. Cohen, “Part II: 3-D Object Recognition and Shape Estimation from Image Contours Using BSplines, Shape Invariant Matching, and Neural Network,” IEEE TPAMI, vol. 16, no. 1, pp. 13–23, 1994. [18] I. Weiss and M. Ray, “Model-Based Recognition of 3D Objects from Single Images,” IEEE TPAMI, vol. 23, no. 2, pp. 116–128, 2001. [19] L. Zhang, B. Curless and S. M. Seitz, “Rapid Shape Acquisition Using Color Structured Light and Multi-pass Dynamic Programming,” 3DPVT, pp. 24–36, 2002.

6353