[3] show very high detection and recognition accuracy in general environments. However, as the number ... (Eds.): ACCV 2006, LNCS 3852, pp. 561â570, 2006.
Scalable Representation and Learning for 3D Object Recognition Using Shared Feature-Based View Clustering Sungho Kim and In So Kweon Dept. of EECS, Korea Advanced Institute of Science and Technology, 373-1 Gusong-Dong, Yuseong-Gu, Daejeon, Korea {sunghokim, iskweon}@kaist.ac.kr Abstract. In this paper, we present a new scalable 3D object representation and learning method to recognize many objects. Scalability is one of the important issues in object recognition to reduce memory and recognition time. The key idea of scalable representation is to combine a feature sharing concept with view clustering in part-based object representation (especially a CFCM: common frame constellation model). In this representation scheme, we also propose a fully automatic learning method: appearance-based automatic feature clustering and sequential construction of view-tuned CFCMs from labeled multi-views and multiobjects. We applied this learning scheme to 40 objects with 216 training views. Experimental results show the scalable learning results in almost constant recognition performance relative to the number of objects.
1
Introduction
Object recognition has become mature in terms of identification level with local feature-based approaches. Local features are extracted by the following process: interest point detection [1], region selection [2], and region description [3][4]. Based on these local features, several object recognition methods such as the probabilistic voting method [5], constellation model-based approaches [6], and SVM, Adaboost [7] are introduced. The state-of-the-art methods such as SIFT [3] show very high detection and recognition accuracy in general environments. However, as the number of objects increases, the issue of scalability becomes more important. Conventional object representations require linear memory and recognition time proportionate to the number of objects. This problem can be more severe if the objects are 3D. Storing all the multiple views of 3D objects is almost impractical. Recently, some feasible approaches have been proposed to alleviate the scalability problem. Torralba et al. [8] modified the Adaboost to recognize multiclass object using a feature-sharing concept. They demonstrated that shared features outperform independently learned features. Murphy-Chutorian and Triesch adapted feature clustering to solve the problem [9]. This method recognizes objects by nearest voting using the clustered-feature database. Lowe proposed a local feature-based view-clustering scheme to represent multiple views of 3D P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3852, pp. 561–570, 2006. c Springer-Verlag Berlin Heidelberg 2006
562
S. Kim and I.S. Kweon
Scalable object representation & automatic learning
Feature sharing
View clustering
Part-based representation
Many 3D objects with multi-views
Fig. 1. Key idea of scalable object representation: We apply the feature sharing and view clustering to multiple views of 3D object for scalable object representation
objects [10]. But, these approaches are partial works to minimize the scalability problem in terms of feature level and multiple view level. How can we reduce the DB size from many objects and views without degrading recognition performance? In this paper, we present a new object representation and learning method by combining a feature-sharing concept [9] and a view-clustering concept [10] in part-based object representation [6] as shown in Fig. 1. In Section 2, we introduce a scalable 3D object representation scheme. Sections 3 and 4 explain a proposed learning method for the representation. Section 5 details a recognition method for the validation. In Section 6, we demonstrate the scalability of the proposed method and conclude in Section 7.
2
Scalable 3D Object Representation
As we discussed, simply storing all possible views of many 3D objects requires huge memory and recognition time. The main cause is originated from the redundancy in DB generation. We have to remove the redundancies effectively to get minimal DB construction. In advance, we adapt a part-based object representation, specifically a common-frame constellation model (CFCM) [6] instead of a holistic appearance representation. The CFCM representation scheme provides useful advantages in terms of computation and redundancy. Computational efficiency : An object can be represented as a set of visual parts. The well-known mathematical model is a fully parameterized constellation model as in Fig. 2 (a) (top) [11]. The circle means an object part which contains appearance information and part pose. If each part is xi and the number of part N , then it can be modeled as full covariance-based joint pdf, p(x1 , x2 , . . . , xN ). The DOF (degree of freedom) of required parameters O(N 2 ). However, if we fix the object ID and view point, each part can share the viewing parameters, θ = [objectID, pose] as Fig. 2(a) (bottom) [6]. Then the mathematical representation can be reduced to the product form conditioned on an object parameter like N p(xi |θ). In this scheme, the order is reduced to O(N ) which is useful during Πi=1 object recognition. We refer to this part-based object representation as CFCM since each part shares object parameters (objectID, pose). Easy redundancy removal : In a CFCM, we can find the source of redundancies easily. One source of the object parts and the other is the object parameters
Scalable Representation and Learning for 3D Object Recognition Object 1
View clustered CFCMs
563
Object 2
Part ID Pose
links 1
2
3
4
5
6
7
8
9
10 11 12
App. Lib.
(a)
…
…
(b)
Fig. 2. (a) (top) Fully connected constellation model: It can model objects which contain up to 5 7 parts. (bottom) Common-frame constellation model: It can model objects which have hundreds of parts. (b) Any 3D objects can be rerepsented by shared partbased CFCMs which are view clustered.
of object ID and view point. Since training images are composed of many multiple views of 3D objects, there exist redundant parts and views. We can reduce the redundancies by applying a clustering concept to both parts and views. Based on these motivations, the proposed scalable object representation framework is shown in Fig. 2(b). The bottom table is the feature (appearance of part) library. Each feature represents an appearance vector which is obtained by vector clustering. The appearance feature of an individual part can be anything such as a SIFT descriptor [3], PCA [12], or moments [13]. A 3D object is represented as a set of view-clustered CFCMs. Each CFCM contains object parts which have part pose and the link indices to part libraries (appearance). The part pose represents part size, part orientation, and position in a CFCM. These kinds of information are available in [1][3]. Likewise, each element in the library contains all the links to the parts in the CFCMs. We can use this fact to generate hypotheses during object recognition. The next two sections explain the details of learning by feature and by view clustering respectively.
3 3.1
Visual Feature and Clustering Generalized Robust Invariant Feature
We detect visual parts based on object structures. First, high-curvature points and radial symmetry centers are extracted using the Harris corner and DoG (difference of Gaussian) methods respectively. Second, part size is determined at the local maxima of convexity where DoG is compared in scale space (see Fig. 3). This method can extract complementary object parts. Dominant orientation of visual part is calculated using a weighted steerable filter. Finally, the detected convex part is encoded using a set of localized histograms (a total of 21) of edge
564
S. Kim and I.S. Kweon
(a)
(b)
(c)
Fig. 3. We can detect structure-based object parts: (a) radial symmetry part (b) corner-like part (c) complementary visual part detector (proposed) [14]
orientation (4 bins), edge density (1 bin), and hue (4 bins). This is a generalized form of contextual descriptor [3][4]. The feature dimension is 189 (21*(4+1+4)). More details is explained in [14]. We call the feature G-RIF for its properties. We will use the term G-RIF throughout this paper. 3.2
Automatic Feature Clustering
A feature library or code book can be generated by feature clustering from training features. There are several clustering methods, such as k-means algorithm, vector quantization [15]. These methods are based on iterative optimization starting from random cluster centers with a predetermined number of clusters. In our database, the dimension of a feature is over hundred (189) and the size of a feature is more than several hundred thousand. In this case, the conventional energy minimization-based approach is impractical due to the convergence time. The main problems of k-means algorithm for huge data are: – How to set the cluster size. – How to set the initial cluster centers. – How to effectively compare distances between data and cluster centers. We propose a simple and practical clustering algorithm suitable for high dimensional visual features. We solve the above problems by utilizing the properties of part structures, and a nearest-neighbor search using a k-d tree [16]. As we can see in Fig. 4(a) (top), we can cluster visually similar parts using only the distance threshold (ε) between normalized feature vectors. As the threshold becomes larger, roughly similar structures are clustered. In part-based object recognition, part structures have very important roles. So, first we find rough structure centers by sequentially performing the ε-nearest neighbor search as in Fig. 4(a) (bottom). The clustered features are removed in search space. Then, cluster centers are optimized using k-means clustering. This process corrects the features on the cluster boundaries. By merging the ε-nearest neighbor search, k-d tree-based distance calculation, and k-means algorithm, we can solve the above three problems simultaneously. Fig. 4(b) shows the convergence rate of clustering with the proposed initialization and random samples in k-means clustering. The proposed automatic clustering is almost converged within two iterations due to the good initial estimation of cluster centers.
Scalable Representation and Learning for 3D Object Recognition Convergence of k−means clustering
4
ε
0.15
x 10
0.25
0.2
565
Proposed initialization+k−means Random initialization+k−means
Clustered visual patches
Seed feature
ε
Cluster center
Clustering error
2
1.5
1
0.5
ε
ε 0
2
4
6
8
10
12
Iteration
(a)
(b)
Fig. 4. (a) ε-NN search results from training parts (top) automatic sequential clustering proces (bottom), (b) Convergence comparisons between the proposed automatic clustering and conventional k-means algorithm
4
Sequential Construction of Scalable Object Model
As we said, we represent a 3D object by a set of view-tuned CFCMs. Visual parts in a CFCM are conditioned on the view-tuned parameters. The term view-tuned means view clustering in a similarity transform space. Fig. 5(a) shows the overall object learning structure. Given labeled multi-views and multi-object images, we have to find view-tuned CFCMs. In a CFCM, each part is represented in terms of pose and the appearance index to the shared feature libraries learned in the Labeled Training Images
CFCM Models
Extract Local Features
Represent using Shared feature lib.
Set as ref. CFCM
Training images
Learned CFCM
Initial image Extract Local Features
Search matching pairs by Hough & Perform Similarity trans.
Next image NO
Decision1: enough match? YES
Represent using new Feature Lib.
Create new CFCM
Calculate spatial match … Decision2: error