Multi-reference object pose indexing and 3-d

3 downloads 0 Views 287KB Size Report
ume of a rigid opaque object from a sequence of images show- ing the object in various poses from unknown camera viewpoints. Volume reconstruction of an ...
MULTI-REFERENCE OBJECT POSE INDEXING AND 3-D MODELING FROM VIDEO USING VOLUME FEEDBACK Alireza Nasiri Avanaki, Babak Hamidzadeh, Faouzi Kossentini, and Rabab Ward University of British Columbia Department of Electrical and Computer Engineering 2356 Main Mall, Vancouver, B.C., Canada, V6T 1Z4. ABSTRACT A system for 3-D reconstruction of a rigid object from monocular video sequences is introduced. Initially an object pose is estimated in each image by locating similar (unknown) texture assuming flat depth map for all images. Shape-from-silhouette [1] is then applied to construct a 3-D model which is used to obtain better pose estimates using a model-based method. Before repeating the process by building a new 3-D model, pose estimates are adjusted to reduce error by maximizing a quality measure for shape-fromsilhouette volume reconstruction. Translation of the object in the input sequence is compensated in two stages. The volume feedback is terminated when the updates in pose estimates become small. The final output is a pose index (the last set of pose estimates) and a 3-D model of the object. Good performance of the system is shown by experiments on a real video sequence of a human head. Our method has the following advantages: 1. No model is asssumed for the objest. 2. Feature points are neither detected nor tracked, thus no problematic feature matching or lengthy point tracking are required. 3. The method generates a high level pose index for the input images, these can be used for content-based retrieval. Our method can also be applied to 3-D object tracking in video. 1. INTRODUCTION AND BACKGROUND In this paper, we present a novel method that reconstructs the volume of a rigid opaque object from a sequence of images showing the object in various poses from unknown camera viewpoints. Volume reconstruction of an object from a regular (monocular) footage is of critical importance in applications such as 3-D object tracking in video, digital museum, and generation of virtual environment (e.g., photo-realistic computer generated animation). An index of object pose for the input image sequence is also generated by our proposed method. Such index can be used as an aid to existing content-based retrieval engines to answer higher-level, more complex queries. Three dimensional perception of an object is closely related to our ability to distinguish between various poses of the object. Several methods such as shape-from-silhouette [1], light field rendering [2], and space carving using photo consistency [3] have addressed reconstruction of an object from its images taken from known poses (camera viewpoints). The pose information required for object reconstruction methods can be provided by three classes of methods: model-based, template matching, and feature-based. These methods, however, have limitations as described below. Model-based methods cannot

;‹,(((

be applied to a general object. For their destined objects, however, they are optimized to perform efficiently in terms of computational complexity. Template matching is more of a technique for pose recognition rather than pose estimation because it requires the appearance of the object in various poses that are recorded in advance. Feature-based methods [4] are based on the premise that some points on the objects are special: feature points on the object that can be easily recognized and tracked through multiple images. However, the number of available feature points are not usually enough to describe object structure at the desired resolution. Therefore, one needs to use less “easily recognized” points to increase this number. In that case, the task of tracking these points in multiple images becomes a problem. In spite of the large volume of research, solutions for this problem are still computationally intensive and prone to error. Other pose estimation methods assume depth map information available for one frame [5] or all frames in the input sequence [6]. Although these methods can be used for pose estimation of a general object, they fall into the model-based category since the depth map actually constitutes information about the 3-D structure of the object. To avoid problems of feature-based methods and limitations of model-based methods, we propose a novel method comprised of a 3-D reconstruction module (shape-from-silhouette for simplicity) and a model-based pose estimation module [5]. Preceded by an initial pose estimation (flat depth map fed to the technique specified in [5]), these two modules form part of a volume feedback loop structure (Fig. 1). This system has proved to be functional (Section 4) by adding a block to improve pose estimates using a measure of reconstruction quality by shape-from-silhouette introduced in [7]. Without assumption of any model for the object, pose indexing and 3-D volume reconstruction are performed at the same time by our system from monocular video. That makes our proposed method functionally compatible with feature-based methods and better than the others. As compared to feature-based methods, the advantages of our method are 1. It does not require detection or tracking of the feature points. 2. While the 3-D model constructed by feature-based methods is a set of the coordinates of the feature point (a “point cloud”), the object model reconstructed by our system is a rigid volume whose precision is independent of feature point density. Compared to [7], our present system has the following advantages: 1. Multi-reference flat depth map pose estimation allows for the processing of the input sequences with arbitrary frame order. 2. Translation of the object in the input sequence is now compen-

,,,

,6&$6

sated in two stages (Sections 2.1 and 2.2), whereas the previous system did not permit object translation in the input sequence. The abilities of the improved system are showcased by the reconstruction of the volume of a human head using a real video sequence of the head in different poses. The remainder of this paper is organized as follows. Section 2 provides an overview of the system and describes its modules. A brief complexity analysis of the system is given in Section 3. Experimental results demonstrating good performance of the system are reported in Section 4. A summary and a few lines of future work conclude the paper in Section 5. 2. SYSTEM DESCRIPTION We first introduce a few terms, then we give descriptions of system (Fig. 1) modules along with the system’s functional. The relative pose of the object in frame B with respect to frame A is the amount of rotation needed to take the object from its pose shown in frame A (reference) to the pose shown in frame B (target). We call this parameter the relative pose between frames B and A. For each frame, an absolute pose is defined as the pose of the object with respect to some global (fixed) pose reference. First, the input sequence is compensated for possible translations of the silhouettes of the object at different poses (Section 2.1). Next, an initial estimation of the object pose in each frame is made, after re-ordering the frames (Section 2.3). Then a volume of the object is reconstructed by shape-from-silhouette using the initial pose estimates. This volume provides the depth information required for the enhanced translation-compensated relative pose estimation (Section 2.2). The new estimates are used to reconstruct a new volume of the object. This volume goes through a quality control mechanism [7] and is fed to the TRPE to close the feedback loop. The convergence is checked by comparing the resulting vector of relative poses to that of the previous iteration: if these two vectors are close, the loop is terminated. 2.1. Preliminary Translation Compensation (PTC) To compensate for possible translations of the silhouettes of the object at different poses, the center of mass for each silhouette is computed. Next, each frame is copied into a template so that the center of mass of its silhouette falls at the same point as the center of the template. The template should be large enough to accommodate all images without cropping. Performing this task prevents the opposing silhouettes from carving out details generated by each other (Fig. 2).

Fig. 2. The silhouettes of frames 1 (M1 ) and 11 (M11 ) of the head sequence (Section 4), and M11 subtracted from flipped M1 . Details (black/white parts) carved by M11 /M1 are destroyed by the other silhouette, unless the silhouettes are translated to have a common center of mass. The relative pose1 between frames i (reference) and j (target) is estimated by Pˆij = arg minr {τ (r)}. r denotes the amount of rotation with respect to the initial pose of the volume which is set to i k=1 Pk−1,k where Pk−1,k is the relative pose between frames k − 1 and k from last round of estimation. Texture difference function (to be minimized to maximize similarity) is given by:  τ (r) =

p∈MRef ∩MT arget

|IRef (p) − IT arget (p)|

MRef ∩ MT arget 

Where MX is the silhouette (binary mask) and IX (p) is the intensity at pixel p of image X. Y  is the number of “on” pixels in binary mask Y . Compensation for the translation between the reference patch and the target is performed by computing a single optic flow vector for all pixels on the foreground using the first method described in [8]. The target is then moved according to the computed translation vector. The usual problems with optic flow computation methods are avoided since only a single vector is computed based on the motion information of a large number of pixels. The above formulation is for gray scale images. To extend the method for color images, Pˆij is computed for each color component (color pose). The average of color poses is then used as the object relative pose estimate. To reduce the effect of noise introduced in imaging process (e.g., due to miscalibration), we employed a simple filter on the difference image (|IRef (p) − IT arget (p)|) to eliminate low intensity pixels from the sum. 2.3. Multi-reference FDMPE (flat depth map pose estimation)

2.2. Translation-compensated Relative Pose Estimation (TRPE) The idea of the TRPE, a variant of the technique introduced in [7], is to rotate the reference 3-D patch (reference frame projected onto the reference depth map) until its image becomes “similar” to the translation-compensated target frame. The relative pose between the reference and the target frames is then estimated by the amount of rotation required for the above mentioned matching. TRPE operates on consecutive frames. That is, each frame is used as the reference for the next one. Thus, the very first frame has no reference. We consider the first frame as our global pose reference with the absolute pose of 0.

To achieve an initial estimate of object pose in each frame, one can apply the TRPE (Section 2.2) between consecutive frames, assuming flat depth maps for all frames [7]. Other assumptions for depth can be made without loss of generality, however flat depth map seems to be the most impartial. This simple FDMPE method, however, can be used only in cases where we know the order of input frames in advance, such as frames extracted from a video clip (assuming a high frame rate). When we do not have any information about the order of input 1 Generally pose variables (P ˆ , P and r) are 3 × 1 vectors. For simplicity, we assumed scalar pose in this work.

,,,

Ordered sequence Segmented image sequence

PTC

Multi-reference FDMPE

SFS

TRPE

Initial pose estimates

SFS & QC

Converged?

Yes

Object volume & relative poses

No

Fig. 1. Block diagram of the proposed system. Thick lines indicate data flow of both volume and set of relative poses. PTC, SFS and QC stand for preliminary translation compensation, shape-from-silhouette and quality control. Other abbreviations are defined in Section 2.

frames, we use multiple reference frames for each target, and estimate the relative pose (with the assumption of a flat depth map) between them. In the multi-reference FDMPE, for each target, we select one frame as the reference. The selected frame should be the one most similar to the target (i.e., has the minimum difference measure). Also to compute the absolute poses required for volume reconstruction, one should be able to connect any arbitrary pair of frames through a chain of reference-target frames. To achieve this aim, we may have to select the second most similar frame as the reference (i.e., relaxing the first rule above). Algorithm 1 is devised according to these rules. Algorithm 1 Multi-reference FDMPE. 1: Sort pairs of frames in ascending order of their difference measures and put the result on the list L. Therefore, the first frame pair on L (i.e., L(1, 1) and L(1, 2)) has the minimum difference measure. 2: Construct the frame list F of length N (number of frames); Each element has a list of possible connected frames (CF ; initialized empty) and a binary flag for connectivity check (CCF ; initialized 1 for one arbitrary element, and 0 for others). 3: for K = 1 to length(L) do 4: Append L(K, 2) to F (L(K, 1)).CF . 5: Append L(K, 1) to F (L(K, 2)).CF . 6: if F (L(K, 1)).CCF OR F (L(K, 2)).CCF then 7: Set CCF s of F (L(K, 1)), F (L(K, 2)) and every frame on their CF lists to 1. 8: end if {Connecting L(K, 2) & L(K, 1): Done.} 9: if all frames are connected (AND CCF s of all F elements) then 10: Break. 11: end if 12: end for 13: Re-order the sequence so that the connected frames come consecutively. 14: In addition to the re-ordered sequence, output the flat depth map pose estimates between all pairs of consecutive frames.

3. COMPLEXITY ANALYSIS The proposed system (Fig. 1) has one feedback loop (volume feedback) preceded by one processing block (flat depth map pose estimation). The processing time also depends on how many times the loop is iterated before convergence. Multi-reference FDMPE takes O(N 2 ) to complete since it performs computations for all possible pairs of N input frames. In the case of ordered frames in the input (assumed in [7]), however, the run time of this block reduces to O(N ) since computations are performed for all pairs of consecutive frames only. TRPE takes the main part of processing time of the loop. Its processing time is O(N ) since the frames are already ordered and

computations are needed for all pairs of consecutive frames. Assuming the number of iterations needed for convergence is almost constant, the total processing time for the system is approximated by O(N ) + O(N 2 ). This is a higher complexity as compared to O(N ), the complexity of the system reported in [7]. The extra computational burden gives the ability of processing input sequences with unknown frame order. 4. EXPERIMENTAL RESULTS In the experiment, eleven frames of a sequence showing a human head are fed to the system as input. Frames 1, 6, and 11 are shown in Fig. 3. All frames and their silhouettes (semi-automatically computed based on color) can be obtained from [9]. The results show the robustness of the system to the following effects. Shadow: Unlike frames of the sequences used in the experiments reported in [7], shadow is present in parts of the frames in the head sequence. Imperfect segmentation: The employed segmentation method classifies some background pixels as the silhouette of the object. Translation: The object (head) in this sequence undergoes translation as well as rotation.

Fig. 3. Frames 1, 6, and 11 of the input (head) sequence. Except for the sum of relative poses (i.e., the amount of absolute pose for the last frame), we do not have any other information about the true values of poses, either relative or absolute. That is because the input sequence is shot in an uncontrolled environment with a regular consumer camera. Therefore, the performance cannot be checked by comparison of the pose estimates and the true values. To see the improvement of the pose estimates through iterations, the sum of all relative poses is monitored (Fig. 4(left)). The sum approaches its true value, 180 degrees. At the final iteration the sum differs only 7 degrees from the true value. This means an average error of only 0.7 degree on each relative pose. The reconstruction quality factor [7] also increases through the course of iterations (Fig. 4(right)); the best reconstruction quality occurs at the last iteration. Note that due to imperfect segmenta-

,,,

4

204 Ŧ2.2

202

Ŧ2.4

Ŧ2.6

198

Reconstruction quality factor

Sum of all relative poses (degrees)

200

196

194

192

Ŧ2.8

Ŧ3

Ŧ3.2

Ŧ3.4

190

Ŧ3.6

188

Ŧ3.8

186

x 10

0

1

2

3

4 5 Iteration number

6

7

8

9

Ŧ4

0

1

2

3

4 5 Iteration number

6

7

8

9

Fig. 4. The sum of relative poses (left) and the reconstruction quality factor (right) versus the iteration number. The true value of the sum is 180 degrees.

tion of the input sequence, the final value of the quality factor for this sequence is approximately 20% lower than the corresponding values for the synthetic sequences of the experiments reported [7]. One can also observe the convergence to the true volume qualitatively (Fig. 5). The reconstructed volumes at four iterations are used to render the first frame of the sequence. The details of the object (notice the nose and lips area) that are absent from the reconstructed volume in early iterations, appear gradually and are fully present in the final volume made at the final iteration. Compared to the experiments with input sequences of a sphere and a prism [7], the head sequence seems to require more time to converge, this is because the human head has a more complicated volume with more details.

Fig. 5. The first frame of the sequence rendered from the volumes reconstructed at iterations 0 (initial estimation), 1, 7, and 9 (final). Note the gradual improvement of details especially in the lips and nose area.

5. CONCLUSIONS We introduce a novel method that estimates different poses of an object from several images. At the same time, the method reconstructs the volume of the object using shape-from-silhouette and volume/pose feedback. It does not assume any model for the object nor it uses any feature points. Thus, it can be applied for volume reconstruction of arbitrary objects and it does not suffer from problems of feature points detection, low density of feature points, or feature points tracking. Compared to work in [5] and [6], it has the added advantage that it does not require any depth

map at the outset. Instead, it builds up the depth map for every input frame starting with flat depth maps assumed for all frames. Therefore, the method can be used to reconstruct an object from a regular video sequence. Shape-from-silhouette is only able to reconstruct object volume up to its visual hull [10]. Because of this inherent limitation, our method cannot reconstruct the volume of a concave object accurately. In case of a concave object, however, the resulting volume can be refined [11] using photo-consistency conditions to a better approximation of the true object volume. Our work also requires input images taken by orthographic projection (or weak perspective). This limitation, however, is not serious if the camera stays focused on the object [12] in the entire sequence. The method does not function for texture-less objects. Note that even the human visual system hardly distinguishes pose variation of a texture-less object under uniform illumination. Another possible extension to this work is to use other visual cues (e.g., shading) along with silhouette information to enhance reconstruction accuracy especially for concave objects. In this paper, the question of “How to reconstruct object volume using a certain set of input images?” has been addressed. In cases that short processing time is the main concern, or the input size is large, the question of “How much input is needed for a certain reconstruction precision?” is of importance. This question should also be addressed in future work. 6. REFERENCES [1] R. Szeliski, “Rapid octree construction from image sequences,” CVGIP: Image Understanding, vol. 58, no. 1, pp. 23–32, 1993. [2] M. Levoy and P. Hanrahan, “Light field rendering,” in Computer Graphics Proceedings, 1996, pp. 31–42. [3] A. Broadhurst et al., “A probalistic framework for space carving,” in Proc. of IEEE International Conf. on Computer Vision, 2001, pp. 388–393. [4] C. Lu et al., “Fast and globally convergent pose estimation from video images,” IEEE Trans. on Pattern Recognition and Machine Intelligence, vol. 22, no. 6, pp. 610–622, 2000. [5] A. N. Avanaki, B. Hamidzadeh, and F. Kossentini, “Multi-objective retrieval of object pose from video,” in Proceedings of IEEE Conference on Tools with Artificial Intelligence, November 2000, pp. 242– 249. [6] M. Harville et al., “3-D pose tracking with linear depth and brightness constraints,” in Proc. of IEEE International Conf. on Computer Vision, 1999, pp. 206–213. [7] A. N. Avanaki, B. Hamidzadeh, and F. Kossentini, “Pose estimation and 3-D modeling from video by volume feedback,” in Proceedings of SPIE: Visual Communications and Image Processing, 2003, pp. 1131–1142. [8] J. Barron and R. Klette, “Quantitative color optical flow,” in Proc. of 16th international conf. on pattern recognition, 2002, vol. 4, pp. 251–255. [9] A. N. Avanaki, http://www.ece.ubc.ca/˜alin/objects.htm, The segmented head sequence used in this paper, 2003. [10] Aldo Laurentini, “The visual hull concept for silhouette-based image understanding,” IEEE Trans. on Pattern Recognition and Machine Intelligence, vol. 16, no. 2, pp. 150–162, 1994. [11] G. Eckert et al., “Shape refinement for reconstructing 3d objects using an analysis-systhesis approach,” in Proc. of IEEE International Conf. on Image Processing, 2001, pp. 903–906. [12] R. Horaud et al., “Object pose: the link between weak perspective, paraperspective, and full perspective,” International Journal of Computer Vision, vol. 22, no. 2, pp. 173–189, 1997.

,,,

Suggest Documents