Maximum a posteriori-based depth sensing with a ... - OSA Publishing

Vol. 25, No. 21 | 16 Oct 2017 | OPTICS EXPRESS 25332

Maximum a posteriori-based depth sensing with a single-shot maze pattern RUODAI LI, FU LI,* YI NIU, GUANGMING SHI, LILI YANG, AND XUEMEI XIE Key Laboratory of Intelligent Perception and Image Understanding (Chinese Ministry of Education), School of Electronic Engineering, Xidian University, Xi’an 710071, China *[email protected]

Abstract: This study addressed the general problem of correspondence retrieval for singleshot depth sensing where the coded features cannot be detected perfectly. The traditional correspondence retrieval technique can be regarded as maximum likelihood estimation with a uniform distribution prior assumption, which may lead to mismatches for two types of insignificant features: 1) incomplete features that cannot be detected completely because of edges, tiny objects, and many depth variations, etc.; and 2) distorted features disturbed by environmental noise. To overcome the drawback of the uniform distribution assumption, we propose a maximum a posteriori estimation-based correspondence retrieval method that uses the significant features as priors to estimate the weak or missing features. We also propose a novel monochromatic maze-like pattern, which is more robust to ambient illumination and the colors in scenes than the traditional patterns. Our experimental results demonstrate that the proposed system performs better than the popular RGB-D cameras and traditional single-shot techniques in terms of accuracy and robustness, especially with challenging scenes. © 2017 Optical Society of America OCIS codes: (150.5670) Range finding; (150.6910) Three-dimensional sensing.

References and links 1.

2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

14.

L. Zhang, B. Curless, and S. M. Seitz, “Rapid shape acquisition using color structured light and multi-pass dynamic programming,” in First International Symposium on 3D Data Processing Visualization and Transmission (IEEE, 2002), pp. 24-36. P. Vuylsteke and A. Oosterlinck, “Range image acquisition with a single binary-encoded light pattern,” IEEE Trans. Pattern Anal. Mach. Intell. 12, 148–164 (1990). J. Salvi, J. Pagès, and J. Batlle, “Pattern codification strategies in structured light systems,” Pattern Recognit. 37, 827–849 (2004). M. Trobina, “Error model of a coded-light range sensor,” Technique Report, Communication Technology Laboratory, ETH Zentrum, Zurich (1995). M. Gupta and S. K. Nayar, “Micro Phase Shifting,” in IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2012), 813–820. V. Srinivasan, H. C. Liu, and M. Halioua, “Automated phase-measuring profilometry of 3-D diffuse objects,” Appl. Opt. 23(18), 3105–3108 (1984). S. Zhang, High-resolution, High-speed 3-D Dynamically Deformable Shape Measurement Using Digital Fringe Projection Techniques (INTECH Open Access Publisher, 2010). G. Shi, L. Yang, F. Li, Y. Niu, R. Li, Z. Gao, and X. Xie, “Square wave encoded fringe patterns for high accuracy depth sensing,” Appl. Opt. 54, 3796–3804 (2015). T. Weise, B. Leibe, and L. Van Gool, “Fast 3D Scanning with Automatic Motion Compensation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2007), 1–8. K. L. Boyer and A. C. Kak, “Color-Encoded Structured Light for Rapid Active Ranging,” IEEE Trans. Pattern Anal. Mach. Intell. 9(1), 14–28 (1987). T. Monks, J. Carter, and C. Shadle, “Colour-encoded structured light for digitisation of real-time 3D data,” in International Conference on Image Processing and its Applications (IET, 1992), 327–330. J. Salvi, J. Batlle, and E. Mouaddib, “A robust-coded pattern projection for dynamic 3D scene measurement,” Pattern Recognit. Lett. 19, 1055–1065 (1998). R. Sagawa, Y. Ota, Y. Yagi, R. Furukawa, N. Asada, and H. Kawasaki, “Dense 3D reconstruction method using a single pattern for fast moving object,” in Computer Vision, 2009 IEEE 12th International Conference on, 2009), 1779–1786. R. Sagawa, R. Furukawa, and H. Kawasaki, “Dense 3D Reconstruction from High Frame-Rate Video Using a Static Grid Pattern,” IEEE Trans. Pattern Anal. Mach. Intell. 36(9), 1733–1747 (2014).

#295309 Journal © 2017

https://doi.org/10.1364/OE.25.025332 Received 4 May 2017; revised 23 Aug 2017; accepted 23 Sep 2017; published 5 Oct 2017


15. E. M. Petriu, Z. Sakr, H. J. W. Spoelder, and A. Moica, “Object recognition using pseudo-random color encoded structured light,” in Proceedings of the 17th IEEE Instrumentation and Measurement Technology Conference (2000), 1237–1241. 16. J. Salvi, S. Fernandez, T. Pribanic, and X. Llado, “A state of the art in structured light patterns for surface profilometry,” Pattern Recognit. 43, 2666–2680 (2010). 17. S. Y. Chen, Y. F. Li, and J. Zhang, “Vision Processing for Realtime 3-D Data Acquisition Based on Coded Structured Light,” IEEE Trans. Image Process. 17(2), 167–176 (2008). 18. P. M. Griffin, L. S. Narasimhan, and S. R. Yee, “Generation of uniquely encoded light patterns for range data acquisition,” Pattern Recognit. 25, 609–616 (1992). 19. J. Pages, C. Collewet, F. Chaumette, and J. Salvi, “An approach to visual servoing based on coded light,” in Proceedings 2006 IEEE International Conference on Robotics and Automation (IEEE, 2006), pp. 4118–4123. 20. M. Takeda and K. Mutoh, “Fourier transform profilometry for the automatic measurement of 3-D object shapes,” Appl. Opt. 22(24), 3977–3982 (1983). 21. W. Chen, X. Su, Y. Cao, Q. Zhang, and L. Xiang, “Study on Fourier transforms profilometry based on bi-color projecting,” Proc. SPIE 6027, 60271J (2006). 22. C. Albitar, P. Graebling, and C. Doignon, “Robust Structured Light Coding for 3D Reconstruction,” in IEEE 11th International Conference on Computer Vision (IEEE, 2007), pp. 1–6. 23. X. Maurice, P. Graebling, and C. Doignon, “Epipolar based structured light pattern design for 3-D reconstruction of moving surfaces,” in 2011 IEEE International Conference on Robotics and Automation (ICRA, 2011), pp. 5301–5308. 24. Q. Li, F. Li, G. Shi, S. Gao, R. Li, L. Yang, and X. Xie, “One-shot depth acquisition with a random binary pattern,” Appl. Opt. 53(30), 7095–7102 (2014). 25. Z. Zhengyou, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach. Intell. 22, 1330–1334 (2000). 26. N. Otsu, “A Threshold Selection Method from Gray-Level Histograms,” IEEE Trans. Syst. Man Cybern. 9, 62– 66 (1979). 27. R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “KinectFusion: Real-time dense surface mapping and tracking,” in 2011 10th IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (IEEE, 2011), 127–136. 28. M. Zollhöfer, M. Nießner, S. Izadi, C. Rehmann, C. Zach, M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, and C. Theobalt, “Real-time non-rigid reconstruction using an RGB-D camera,” ACM Trans. Graph. 33, 156 (2014).

1. Introduction Depth sensing techniques are used widely in many fields, such as robot vision, medical sciences, human–computer interaction, and entertainment. Numerous depth sensing techniques have been developed such as time-of-flight (ToF), stereo vision, and structured light illumination (SLI). Compared with other methods, SLI techniques have advantages in terms of their high accuracy and low cost. A general framework for SLI methods involves projecting a well-designed coding pattern onto the scene and capturing the reflection of the projected patterns. The captured coding pattern may be twisted according to the depth variations of the objects in the scene. The depth maps are acquired in two steps. The first step involves extracting the features from the captured image. The second step involves matching the corresponding locations of the detected features with the projected features. The second step is known as correspondence retrieval and it plays an important role in depth sensing. The final depth map is restored based on the retrieved features via triangulation, so the accuracy of correspondence retrieval will directly affect the depth sensing performance. A typical implementation of correspondence retrieval involves minimizing the Euclidean distance of the codes between the extracted features and projected features. This strategy works well for the significant features that can be detected perfectly in the captured pattern. The most challenging issue is how to retrieve the insignificant features, which can be divided roughly into two types: 1) incomplete features that cannot be detected completely because of edges, tiny objects, and abundant variations in depth, etc.; and 2) distorted features disturbed by environmental noise. The insignificant features are apparent as bit errors in the feature vector, where some bits may be lost due to the edges, or changed by noise. Unfortunately, the Euclidean distance-based method is sensitive to these bit errors, and after a bit error occurs, the image feature may be mismatched with another feature in the designed pattern.


Bit errors are inevitable in the Euclidean distance-based correspondence retrieval methods because these methods ignore the inherent correlations between adjacent features and they view the correspondences among features as independent events. In fact, according to the prior knowledge that an image feature x1 corresponds to a pattern feature θ1 , the correspondence θ 2 of the neighboring feature x2 is more likely to be located near θ1 . This observation motivated us to determine the correspondences of insignificant features based on prior knowledge of the correspondence results for the significant features. Mathematically, estimating the correspondence of features independently via the Euclidean distance can be regarded as the maximum likelihood estimation (MLE) process, which assumes that the prior probabilities of the correspondences of features follow a uniform distribution. Thus, we fully explore the inherent spatial correlations between the significant features and insignificant features. In this study, we propose a maximum a posteriori estimation-based correspondence retrieval (MAPECR) method, where the significant features with high detection confidence are used to generate the prior distribution of the correspondences of the insignificant features. Instead of simply minimizing the Euclidean distance of the codes, we maximize the posterior probability based on the matching probability and the prior probability. In addition to improving the correspondence retrieval method, we solve the insignificant feature problem by designing a more robust pattern. Traditional single-shot patterns can be categorized into three classes: color-encoding, continuous intensity based, and shapeencoding schemes. 1) In color-encoding methods, color extraction is vital for the depth sensing quality. Unfortunately, in addition to the projected pattern, the colors of image pixels depend on many factors such as the reflectance of scenes, shadings, color cross-talk between the projector spectra and sensor filters, and sensor noise [1]. Reliably extracting colors from images is difficult because misclassifications may occur, which limits the applicability of color-encoding methods to scenes with neutral colors. 2) Among the single-shot methods with continuous intensity, the Fourier transform profilometry (FTP) scheme is the most popular. The FTP method employs frequency filtering in the Fourier domain to extract the phases, which contain the depth information for the scenes. The filtering process is sensitive to spectrum aliasing, which may occur due to disturbance by uneven illumination and abrupt depth changes in the scenes. 3) Shape-encoding methods design features with different heights, widths, orientations, or compactness to ensure their unique identity. Unfortunately, the features are not invariant to perspective distortion. Thus, depth variations may distort the features and lead to incorrect correspondence retrieval [2]. We propose a maze-like pattern to suppress the disturbance due to the colors in scenes, ambient illumination, and depth variations. The pattern is monochromatic with only three gray levels: black, white, and gray. The pattern comprises parallel white horizontal lines and intersecting short directional gray lines, where the intersections lie on some preset epipolar lines. According to the epipolar constraint, the corresponding intersections in the image are definitely located at the corresponding epipolar lines, regardless of the target scene. The hybrid coding pattern based on both shapes and gray levels overcomes the drawbacks of the traditional single coding schemes. First, feature detection using monochromatic patterns is more insensitive to the chromatic surfaces of scenes than color-encoding patterns. Second, only three gray levels are used in the maze pattern, so the pattern is more robust to ambient light than those with continuous intensity. Third, by exploiting the epipolar constraint, the intersections in the captured image can be detected accurately regardless of the depth variations in the scenes. By improving the robustness of the designed pattern to various


disturbances in the environment, the incorrect and incomplete features can be decreased to improve the performance of the correspondence retrieval process. The remainder of this paper is organized as follows. In Section 2, we introduce related research in the SLI field. In Section 3, we explain the details of the traditional correspondence retrieval techniques. In Section 4, we described the principles of the proposed MAPECR method. The codification of the maze pattern is introduced in Section 5. Details of the implementation of the depth sensing method and the maze pattern are given in Section 6. In Section 7, we present the experimental results. We give our conclusions in Section 8. 2. Related work In recent decades, many methods have been proposed for constructing patterns with better accuracy for feature detection and greater robustness to environmental noise. According to the classification proposed by Salvi [3], the SLI techniques can be categorized into two classes: temporal-based and spatial-based schemes. The temporal-based schemes encode positional information for a projector’s pixels by multiple projected patterns in a time series. The binary code method [4] and phase shifting method [5–7] are the best known of the temporal-encoding schemes. Hybrid techniques have also been investigated. Shi [8] encoded a square wave in three sinusoidal fringe patterns for depth sensing, where the periods of the square wave and sinusoidal wave are coprime in order to unwrap the phases. Weise [9] proposed a method that combines the stereo matching method and phase shifting method. This type of method has the advantages of high accuracy and robustness. However, multi-shots must be conducted to capture the pattern series, so the temporal-based techniques are not suitable for dynamic scenes. The spatial-based schemes only project one pattern and the depth maps are acquired with just a single-shot, which is favorable for depth sensing in dynamic scenes. Most of the spatialbased methods are color-encoding schemes, where the De Bruijn sequence and M-array are the most commonly used coding maps. A De Bruijn sequence of order m over an alphabet of n symbols is a circular string of length n m , which contains each substring of length m exactly once [3]. As a one-dimensional (1D) code, the De Bruijn sequence is usually employed in stripe or slit patterns [1, 10, 11], and grid patterns [12–15]. In order to overcome the loss of slits, Monks [11] built a graph where each node represents an edge between a chromatic slit and the black band. The correspondence is retrieved by assigning costs for the deletion, replacement, and insertion of nodes into the graph, and by finding the minimum cost required to transform the graph back into the original encoded graph. Zhang [1] proposed a multi-pass dynamic programming method for dealing with the problems of deletions and disorders of the slits. Sagawa [13, 14] employed belief propagation to detect the grid and decode the De Bruijn codes. The M-array, which is also called a perfect map, is an extension of the De Bruijin sequence in two-dimensional space. M-arrays are random arrays with r × v dimensions where a sub-matrix of n × m dimensions appears only once in the whole pattern [16]. In previous studies [17–19], chromatic blocks were utilized to construct M-arrays, where the correspondence was retrieved by minimizing the Euclidean distances of the sub-matrices in the patterns and the images. Takeda [20] developed the FTP method by projecting a single grating pattern with sinusoidal intensity. Phase decoding is performed in the frequency domain rather than the spatial domain. The carrier frequency containing the phase information is separated from the other frequency components and the depth map is finally deduced from the phases. Chen [21] improved Takeda’s method by adding a π -phase shifting pattern in the projected FTP pattern, which increases the maximum reconstructible slope limitation of the original method [20]. Shape-encoding methods also use various codes such as the De Bruijn sequence, M-array, and pseudo-random codes to construct patterns. Vuylsteke [2] proposed a chessboard-like


pattern by using a De Bruijn sequence of order six and length 63. A unique identification comprises six adjacent unit codes in every 2 × 3 rectangular window rather than a 1× 6 window. This method decreased the depth variation disturbances using the compact pattern. Albitar [22] constructed an M-array pattern using three unit shapes: circle, disc, and stripe. Maurice [23] designed a pattern with cuneiform features and arranged the features along the epipolar lines to form a sub-perfect map. Li [24] projected a pseudo-random codebased speckle pattern, which can be transformed into a fringe pattern with sine intensity after integration. An initial depth map is acquired by speckle matching and a better depth map is then obtained based on the corresponding phases in the integrated fringe pattern. These methods [2, 22–24] all use the Euclidean distance-based correspondence retrieval method. 3. Insights from traditional correspondence retrieval technique According to the epipolar constraint, for any point x in the captured image, its corresponding point lies on the epipolar line in the projected pattern. To eliminate the 1D ambiguity, SLI methods generally assign each feature with a unique code among the features on the epipolar line. Let (θ , Θ ) denote a feature in the pattern, where θ denotes the 1D position along the epipolar line and Θ denotes the code of the feature. The features are projected onto the scene and twisted features are then captured by the camera. For a captured feature, in addition to the position x and code X , the decoding confidence α is also an important attribute. The captured feature is denoted by ( x, X , α ) . In terms of probability, the correspondence retrieval process for feature ( x, X , α ) in the captured image aims to find the feature (θ , Θ ) in the projected pattern that maximizes the conditional probability p (θ | x, X , α ) . In the following sections, we abbreviate p (θ | x, X , α ) as p (θ | x ) . Traditional methods retrieve correspondences by minimizing the Euclidean distance of the codes between the features in the captured image and projected pattern. Let m ( X i , Θi ) denote the Euclidean distance of the code for features ( xi , X i , α i ) and (θi , Θi ) . m ( X i , Θi ) is defined by: m ( X i , Θ= i)

X i − Θi . 2

(1)

In this condition, p (θi | xi ) is equal to the matching probability p ( xi | θi ) : p (θi | xi ) = p ( xi | θi ) ,

(2)

where p ( xi | θi ) is the normalized m ( X i , Θi ) . p ( xi | θi ) only depends on the codes X i and Θi , so the correspondence retrieval of each feature is independent. Let θ and x denote the vectors of all the features in the pattern and image, respectively. The global optimum correspondence is derived by: p ( θ | x ) arg max ∏ p ( xi | θi ), = θ argmax = θ

θi

(3)

i

which is the MLE. 4. MAPECR MLE-based correspondence retrieval methods assume that the correspondence of each feature is independent. In fact, the correspondences of features are spatially correlated according to prior knowledge that most scenes have smooth depth variations. For an image feature xi , the


correspondence θi should be determined by both the matching probability p ( xi | θi ) and the prior probability p (θi ) , as formulated by: = θi argmax = p (θi | xi ) argmax p ( xi | θi ) p (θi ) . θi

θi

(4)

This is the proposed MAPECR method. The MLE method can be regarded as a special case of the maximum a posteriori method, which implicitly assumes that the prior probability follows a uniform distribution. In the following section, we discuss the estimation of the prior probability p (θ ) from the correspondence prior xk ↔ θ k of a neighboring feature xk . The prior probability is denoted by p (θ | xk ) . For a platform where the optic axes of both the projector and camera are perpendicular to the baseline, we assume that a correspondence pair xk ↔ θ k determines a three-dimensional (3D) point with depth Z k in the scene. As shown in Fig. 1, if the depth Z near Z k remains unchanged, the corresponding point θ of the neighboring point x can be determined by: = θ f p / f c ( x − xk ) + θ k ,

(5)

where f p and f c are the focal lengths of the projector and camera, respectively. Z

θ

Zk

θk

x

xk fc

fp Op

Oc

Fig. 1. Estimation of the location of the corresponding feature in a pattern for scenes with fixed depth.

If Z varies around Z k , as shown in Fig. 2, the correspondence θ will also vary around θ . Let ∆θ denote the pixel offset of θ from θ and ∆Z is the depth difference between Z and Z k . By the similar triangle principle: fp ∆θ = , L Zk

(6)

L ∆Z = , B Z k + ∆Z

(7)

we derive the relationship between ∆θ and ∆Z :

= ∆θ f p B / ( Z k2 / ∆Z + Z k ) . For a small ∆Z , Eq. (9) holds:

(8)


Z k2 / ∆Z >> Z k .

(9)

Then, the offset ∆θ with respect to θ is approximately proportional to the depth variation ∆Z : ∆θ ≈ f p B∆Z / Z k2 .

(10)

∆Z

L

Zk

fp

Op

∆θ

Oc

B

Fig. 2. Estimation of the location of the corresponding feature in a pattern for scenes with nonuniform depth.

The smooth prior of scenes indicates the correlation of the correspondences between features. We assume that the correlation varies inversely to the distances between features. For simplicity, we choose Euclidean distance to measure the distances, which implies a Gaussian distribution prior. Therefore, we use Gaussian distribution to characterize the prior distribution of the candidate features θ in pattern: θ ~ N ( µ k , σ k ) , where µ k = θ . The probability distribution p (θ | xk ) of θ with the prior correspondence pair xk ↔ θ k is formulated by:

(

)

p (θ | xk ) 1 / 2πσ k2 exp − (θ − µ k ) / ( 2σ k2 ) . = 2

(11)

In Eq. (11), σ k is the only undetermined parameter. The remaining problem is determining

the variance σ k , which indicates the reliability of the estimated prior probability p (θ | xk ) . Let ∆θ k denote the distance between θ k and θ . When ∆θ k is smaller, the spatial relationship between θ k and θ is larger, and thus σ k should be smaller. As shown in Fig. 3, the Gaussian curve becomes “fatter” as ∆θ k increases. When ∆θ k → ∞ and σ k → ∞ , the prior probability distribution degenerates into a uniform distribution and no prior knowledge can be derived. The maximum a posteriori estimation degenerates into the MLE method.


Fig. 3. By fixing

p0 (the

area with shading), the variance σ k is positively related to the

distance ∆θ k .

The variance σ k can be solved using Eq. (12) by referring to the Gaussian integral lookup table: = p0

θ +∆θ k

∫ 1/ θ θ −∆

(

2πσ k2 exp − (θ − θ

) / ( 2σ ) )dθ , 2

2 k

(12)

k

(

)

where p0 is the preset value of the probability p θ − θ ≤ ∆θ k | xk . The mathematical meaning of p0 is illustrated by the shaded areas in Fig. 3. The physical meaning of θ − θ ≥ ∆θ k is occlusion. Figure 4 shows the case where θ − θ = ∆θ k , which means that θ k is the corresponding feature of both features x and xk . This is the critical condition for occlusion. We can set the value of p0 according to the prior knowledge of occlusion in the target scene, i.e., a smaller p0 for scenes with more occlusion and a larger p0 for those with less occlusion. Zk

Z

θ (θ k )

x

Op

xk

Oc

Fig. 4. Critical condition for occlusion.

By using µ k and σ k , the probability p (θ | xk ) with the prior correspondence xk ↔ θ k is determined. The final prior probability p (θ ) is formulated as a weighted average of p (θ | xk ) based on a Gaussian mixture model, as shown in Eq. (13): p (θ ) = ∑ ωk p (θ | xk ), k

(13)


where ωk denotes the weights. Large weights should be assigned to reliable decoded features, so ωk is defined as proportional to the decoding confidence α k :

ωk = α k t ,

(14)

where t is a constant for all ωk and α k . The decoding confidence α k is obtained in the feature detection process. t and ωk can be calculated by adding the normalized constraint:

∑ω

k

= 1.

(15)

k

Thus, the prior probability p (θi ) is obtained for each feature xi . By using the prior probability p (θi ) and the matching probability p ( xi | θi ) , the maximum a posteriori estimate of the correspondence of each feature is determined by Eq. (4). It should be noted that the proposed MAPECR method is a general solution that can be applied to various spatial-based depth sensing methods. In the following sections, we describe a novel pattern and the retrieval of the correspondence by using the proposed MAPECR method in order to verify its effectiveness. 5. Pattern codification The proposed pattern is shown in Fig. 5 (a). The pattern resembles a maze so we refer to it as a maze pattern. The pattern codification process is described in terms of feature design and code generation.

Fig. 5. (a) The maze pattern. (b) The crossing features appear as the intersections of the horizontal lines and epipolar lines (red lines) in the captured image.

5.1. Feature design The maze pattern comprises horizontal parallel white lines and directional short gray lines. Every horizontal line is assigned with a unique binary codeword, which is represented by its intersecting directional lines. Lines in the 45 degree direction represent 0 and those in the –45 degree direction represent 1. The crossings are placed along the predefined epipolar lines with approximately unified intervals. This design exploits the constraints of epipolar geometry. It is noted that the proposed pattern codification is based on SLI platforms where a camera and a projector are placed vertically. When used for horizontally placed platforms, the maze pattern should comprise vertical parallel white lines and directional short gray lines instead. The epipolar geometry is shown in Fig. 6. Let p denote a point in the pattern. The plane defined by p and the baseline is the epipolar plane. The epipolar plane joins the image planes of the camera and the projector into two lines, which are called a pair of epipolar lines. The


epipolar constraint requires that the corresponding point of p definitely lies on its corresponding epipolar line in the image, regardless of the target scene. Therefore, the crossings can be located accurately as the intersections of the corresponding epipolar lines and the horizontal lines in the captured images, as illustrated by the captured image shown in Fig. 5(b). In order to pre-calculate the epipolar lines, the intrinsic and extrinsic parameters of the platform need to be calibrated in advance. Zhang’s method [25] is used and it satisfies our requirement for precision. One of the advantages of the maze pattern is that the feature detection process is insensitive to depth variation disturbances in scenes. In addition to it robustness to depth variations, the proposed pattern has three other advantages. First, the monochromaticity of the pattern contributes to feature detection because it is not affected by the disturbances of colors in the scenes. Second, the white horizontal lines, gray directional lines, and black background can be separated easily with two intensity thresholds. Only three distinguishable gray intensities are employed, so the pattern still has the capacity to resist inference from the ambient illumination. Third, one pixel-wide line features can be detected at the pixel accuracy level in the captured images, which facilitates accurate depth sensing. Epipolar line

Oc

Baseline

Epipolar plane

p

Op Epipolar line

Fig. 6. The epipolar geometry.

5.2. Code generation Errors are inevitable in the decoding process so codes with large Hamming distances are often employed to facilitate error correction. Many methods [17–19, 22] use code tables with large global Hamming distances, and thus there are large Hamming distances between any two codes in the code table. However, the bit length of the codewords increases and the areas of the designed features are inevitably large, which increases the sensitivity of feature detection to depth variations. In fact, there is no requirement for a large global Hamming distance. According to the discussion in Section 4, the corresponding features are more likely to be located near the estimated position. This motivated us to design a code table where each feature only requires large Hamming distances between its neighboring codes within a certain region. Thus, the code length can be decreased without losing the capacity for error correction. Using this method, we obtain a better compromise between the error-correcting capacity and the sensitivity to depth variations.


The designed code table is based on the following rules. (1) Every code is unique in the code table. (2) For each code, the average Hamming distance is as large as possible with respect to its nearest k codes. Before introducing the method for generating the code table, we explain the determination of some important pattern parameters. Let V denote the vertical resolution of the pattern, W is the interval of the horizontal lines, and R is the pixel numbers of the neighboring region. Then, the number N of the horizontal lines, the codeword length T , and the number k are determined by Eqs. (16)–(18), respectively. N = V / W  ,

(16)

T ≥ log 2N  ,

(17)

k =  R / W  ,

(18)

where   and   are floor and ceiling operations, respectively. We use an N × T matrix to store the primary code matrix (PCM). The PCM can be generated easily using a greedy algorithm. The generation process begins by randomly initializing the first two rows. The other codes are generated by traversing all the available unique codes with T bits and checking their average Hamming distance against the previous k / 2 codes. The available code with the largest average Hamming distance is added as a new row in the PCM. Before assigning the T -bit codes in the PCM to the horizontal lines, we extend the PCM into an N × M cyclic code matrix (CCM) by replicating all the columns in the PCM. Assuming there are K directional lines intersecting with each horizontal line, M is determined by: = M  K / T  × T .

(19)

Every row of the CCM stores a periodic code and every column stores a certain bit of the periodic code. The detailed process of the generation of a maze pattern is as follows. Firstly, horizontal lines with unified intervals are generated in the pattern. Secondly, K points with unified intervals are selected on an arbitrary horizontal line. Then K epipolar lines through the K points are calculated and drawn with the calibrated parameters. The K epipolar lines (in vertical direction) and the horizontal lines form a grid. Thirdly, directional lines are centered at the grid crossings. The directions of these lines are determined by the codes in the CCM. According to the order of the lines in the grid, each row of the CCM is assigned to each horizontal line where the directional lines in the 45 degree direction represent zeros and those in the –45 degree direction represent ones. Each column in the CCM is placed on an identical epipolar line. 6. Implementation details In this section, we explain the detailed implementation of depth sensing with the maze pattern, including the feature detection, correspondence retrieval, and depth interpolation processes. 6.1. Feature detection Implementation As defined in Section 3, a feature in an image has a position, code, and decoding confidence. Thus, feature detection involves position detection and code detection. For our maze pattern,


the features are the horizontal lines and the intersecting directional short lines. In this section, we introduce a method for suppressing disturbance caused by ambient illumination and for accurately detecting the positions and codes of the horizontal lines from the captured images of arbitrary scenes. 6.1.1. Position Detection As shown in Fig. 7, the position detection procedure comprises three steps: background equalization, thresholding segmentation, and energy minimization.

Fig. 7. Feature detection procedure.

First, we employ a well-known background equalization technique called the grayscale top-hat transform to separate the foreground from the background. The top-hat transform T (⋅) of a grayscale image f is defined by: T(f= ) f − f  b,

(20)

where  denotes the grayscale opening operation and b denotes the flat structure element (SE). The term f  b can be regarded as the intensity of the background and the foreground can be extracted by substituting the background intensity from the image f . In experiments, we found that the grayscale top-hat transform performed well when the size of SE was comparable to the interval of the lines, whereas the performance declined when the size of SE was much larger than the interval of the lines. Therefore, the recommended size for SE is 1.5 times the interval of the horizontal lines in the captured image. After subtracting the background illumination, we separate the white horizontal lines, gray directional lines, and black background. The intensities of these three components differ greatly, so they can be separated simply by using two intensity thresholds. Considering that the albedo of the scene may be uneven, two global and fixed thresholds may fail to segment the entire image. Instead, we split the image into sub-blocks and apply the adaptive thresholding method [26] within each block to achieve segmentation. The recommended size of the blocks is 1.5–2 times the interval of the horizontal lines in the captured image. After gray segmentation, we obtain a binary mask of horizontal stripes that indicate the coarse locations of the horizontal lines. The remaining task is estimating the one pixel-width lines from the stripes provided in the binary mask. Considering that a stripe is formed by spreading the energy of a perfect one pixel-width line into its adjacent neighboring pixels, it may be assumed that the original position of the line is the one pixel-width connection of the pixels with the peak intensity values along the perpendicular direction of the line. First, we collect all the vertical peak pixels as the initial positions of the horizontal lines, which are obviously affected by noise. The next problem is eliminating noise and obtaining smooth one pixel-width lines. Each horizontal line is extracted by minimizing an energy function. Let the vector y denote the vertical coordinates of the objective horizontal line and the


vector y 0 denotes the vertical coordinates of the initial positions. The objective function is given by: = y argmin λ y − y 0 y

2 2

+ Dy 2 , 2

(21)

where D denotes the differential matrix and λ is an adjustment parameter for tuning the smoothness of the line. Assuming that the objective horizontal line contains n pixels, then both y and y 0 are n × 1 vectors, and D is an n × n matrix. The differential matrix D is defined by: 1 −1 0 0 1 −1     D= 0 0 0 0 0 0  0 0 0

 0  0    −1  

1 0

0 0    . 0 −1  0  n×n

(22)

The first term in Eq. (21) ensures that y covers most of the initial pixels inside the horizontal mask. The second term maintains the smoothness of the final line. With lim , the detected line λ →∞

forms an exact connection among all the initial pixels, which is noise sensitive. With lim , the λ →0

detected line is a perfect horizontal line based on the least squares estimation of Dy = 0 . We empirically set λ = 0.1 . Solving the objective function of Eq. (21) is a quadratic optimization problem and the closed-form solution is given by (23): = y

( λ + D D) T

−1

λy0 .

(23)

6.1.2. Code Detection After the horizontal lines are extracted, we decode each bit bi of the codewords and obtain the corresponding bit confidence γ i . First, the equalized image is filtered using an X-shaped template and a response map is obtained. The response value r ( x, y ) is formulated as: r ( x, = y)

s

s

∑ I ( x + i, y − i ) − ∑ I ( x + i, y + i ),

(24)

−s −s i= i=

where s denotes the size of the template. Intuitively, the 0 bit code appears to be a large positive response value and the 1 bit code appears to be a large negative response value. In fact, due to the ambient illumination noise and the shape distortion caused by depth variations, locating positions by detecting the local maxima or minima in the response map is unreliable. Instead, we exploit the “epipolar alignment” property of the maze pattern introduced in Section 5. The crossings of the horizontal lines and directional lines lie on the predefined epipolar lines; thus, the crossings can be identified accurately as the intersections of the horizontal lines and epipolar lines. Then, each bit code bi is detected by assessing the sign of the response value at the crossing:


0, r ( i ) > 0 bi =  , 1, r ( i ) < 0

(25)

where r ( i ) is the response value of bit code bi . The bit confidence γ i is defined by:

γ i = r ( i ) / rmax ,

(26)

where rmax is the maximum absolute response value in the response map. Obviously, γ i ∈ [ 0,1] . All bi on code B =

a b ,i {= i + jT

certain

horizontal

line

comprise

a

periodic

1, 2,..., T, j 0,1,...} with period T . The bit confidence of bit bi + jT is denoted =

by γ i + jT . 6.2. MAPECR Implementation The general MAPECR method was discussed in Section 4. In this section, we briefly explain the specialized implementation of the MAPECR method based on the maze pattern. For the maze pattern, every extracted horizontal line is regarded as an independent feature ( x, X , α ) . First, we construct the position x , code X , and decoding confidence α , according to the extracted coordinate ( ui , vi ) of each pixel on the horizontal line, the period code = B

b ,i {= i + jT

1, 2,..., = T, j 0,1,...} of each horizontal line, and the bit confidence γ i of each

bit bi . The position x of a horizontal line is derived by averaging the vertical coordinates of the pixels on the line: x=

1 N

N

∑v , i =1

i

(27)

where N denotes the number of pixels on the line. The lines in the captured image are twisted, but they still approximately extend along the horizontal direction. Horizontal lines are distinguishable by their average vertical coordinates. Each horizontal line in the designed maze pattern is assigned a primary code with T bits. In the captured image, we need to obtain the T bits primary code= X {= ci , i 1, 2,...T } from the periodic code = B

b ,i {= i + jT

1, 2,..., = T, j 0,1,...} . bi + jT is assigned to the i th bit ci if bi + jT has

the maximum bit confidence γ i + jT among all j . Mathematically: ci = bi + ˆjT ,

(28)

β i = γ i + ˆjT ,

(29)

ˆj = arg max γ i + jT .

(30)

where j

β i denotes the bit confidence of ci · The decoding confidence α is obtained from β i , where i = 1, 2,..., T . The primary code may be incorrect even if only one bit is not trusted. Therefore, the decoding confidence α of X is equivalent to its minimum bit confidence β i :


α = β iˆ ,

(31)

iˆ = min β i .

(32)

where i

The decoding confidence α indicates the decoding quality of primary code X , which is very useful in the correspondence retrieval process. We consider an extreme case where α is equal to 0. One possible explanation is that at least one bit in the decoded primary code is not trusted at all. Another more reasonable explanation is that at least one bit is missing from the decoded primary code and the correspondence of the feature cannot be found independently. After the features ( x, X , α ) are constructed, the features xk with relatively large decoding confidences α k are selected. The initial correspondence pairs xk ↔ θ k are determined by the MLE method, as given in Eq. (3). Using these xk ↔ θ k , the prior conditional

probability p (θ | xk ) of

feature θ for

candidate

image

feature xi is

derived

by

solving µ k and σ k with Eqs. (5) and (12), respectively. The final prior probability p (θ ) is the weighted average of all p (θ | xk ) , which is obtained by Eq. (13). The weighting coefficients are calculated by Eqs. (14) and (15). Finally, the correspondence θ of the image feature xi is determined by the maximum a posteriori estimation using Eq. (4). 6.3. Interpolation Implementation After the correspondence is retrieved, the depths on the horizontal lines are calculated by triangulation. For many applications, sparse depth maps are sufficient, but we may need to interpolate the sparse depth maps into dense depth maps for advanced applications. The methods used for interpolation, denoising, rendering, etc. are types of post-processing approaches in depth sensing. Many methods have been developed and they achieve excellent performance [27, 28]. In this study, we focus on depth sensing. Thus, we briefly explain the interpolation method for acquiring dense depth maps from the raw depth maps. Let the vector z denote the objective dense depth map and the vector z 0 denotes the raw sparse depth map. The dense depth map is obtained by minimizing the energy function in (33): = z argmin λ Hz − z 0 z

2 2

+ D1 z 2 + D2 z 2 , 2

2

(33)

where H denotes the sampling matrix, D1 and D2 denote the differential matrices in the horizontal and vertical directions, respectively, and λ denotes an adjustment parameter for tuning the smoothness of the dense depth map. The sampling item Hz retains the elements in z that correspond to the non-zero elements of z 0 and sets the other elements as zeros. The differential items D1 z and D2 z acquire the horizontal and vertical gradients of z , respectively. The closed-form solution of (33) is given by (34): = z

(λH

T

H + D1T D1 + DT2 D2 ) λ H T z 0 . −1

(34)

The interpolation quality depends on the accuracy of the input raw depth maps, so the performance of the proposed depth sensing method can also be evaluated using the interpolated depth maps.


7. Experiments Quantitative and qualitative experiments were conducted to evaluate the performance of the proposed method. The experimental platform comprised a digital light processing (DLP) projector with 912 × 1140 resolution (Texas Instruments DLPLCR4500) and a camera with 2048 × 2048 resolution (Point Grey Grasshopper3). The optical axes of the camera and the projector were adjusted so they were approximately parallel. The intrinsic and extrinsic parameters of the camera and the projector were calibrated in advance using Zhang’s method [25]. The baseline was 100 mm. In the designed pattern, the horizontal line was 1 pixel wide and the distance between two adjacent horizontal lines was 13 pixels. The experimental platform is shown in Fig. 8.

Fig. 8. The experimental platform.

We conduct a quantitative experiment based on planes and qualitative experiments with challenging scenes in order to compare the performance of the proposed method with an excellent MLE-based method [24] and the popular RGB-D cameras, Kinect v1 and Swiss Ranger SR4000. The comparison is given in Section 7.1. We used the MAPECR method and the MLE method to recover seven scenes with the maze pattern. The reconstruction results are compared in Section 7.2. In Section 7.3, we present the reconstructions of human facial expression obtained from video frames. 7.1. Comparison with state-of-the-art techniques A quantitative experiment and challenging qualitative experiments were conducted to compare the proposed method with three excellent techniques: Kinect v1, Swiss Ranger SR4000, and the SLI method [24]. Kinect is the highly popular RGB-D camera developed by Microsoft. Another RGB-D camera Swiss Ranger SR4000 is based on the ToF technique, which acquires depth maps by measuring the time of flight of a light signal between the camera and objects. The SLI method [24] is a typical MLE based method. A carefully designed speckle pattern can be transformed into a fringe pattern with sine intensities after integration. An initial depth map was acquired by speckle matching and a better depth map was then obtained by finding the phase correspondence in the integrated fringe pattern. The three selected techniques cover most of the mainstream active depth sensing techniques designed for dynamic scenes. In the quantitative evaluation, we used the proposed method to measure a planar board at distances ranging from 0.9 m to 1.4 m. A plane was fitted to the measured 3D points using the least squares method. The mean absolute errors were calculated for the recovered plane and


the fitted plane. As shown in Fig. 9, the proposed method achieved better accuracy than the other techniques.

Fig. 9. Mean absolute errors obtained by the depth sensing techniques versus the distance to the planar board.

In the qualitative experiment, challenging scenes with abundant depth variations and elongated regions were recovered using the proposed method and the other three techniques. Figures 10 and 11 show the depth maps for a statue and a human hand, respectively, where parts a–e are the depth maps acquired by Kinect, SR4000, block matching [24], phase matching [24], and the proposed method, respectively. The statue in Fig. 10 contains rich details in terms of the hair, nose, lips, and eyelids. In Fig. 10(a), the depth maps recovered by Kinect blur the eyelids and lips. There are also blocking effects on the recovered faces. Kinect provides a depth map with relatively low resolution (600 × 480), so it performs poorly on scenes with rich details. As shown in Fig. 10(b), SR4000 obtained depth maps with higher accuracy because of its special depth sensing strategy. However, this technique is only capable of producing depth maps with very low resolution, i.e., SR4000 provides depth maps with a resolution of 176 × 144. Therefore, the recovered depth maps appear coarse. The depth maps obtained by block matching shown in Fig. 10(c) contain more strong blocking effects and they show almost no details of the faces. The results in Fig. 10(d) are superior to those in Fig. 10(c). However, the interpolated phases are not precise, so obvious artificial waves are apparent on the faces, which decreases the accuracy of the reconstruction. In addition, Figs. 10(c) and 10(d) based on the MLE method contain many errors at the edges of the statue due to the inherent flaws of MLE-based methods. Figure 10(e) shows the depth maps obtained using the proposed method, where the hair and other details are recovered accurately, thereby verifying the high accuracy and high resolution of the proposed method. The edges were also recovered well using the MAPECR method. The hands shown in Fig. 11 contain many elongated regions. As shown in Figs. 11(a) and 11(b), Kinect and SR4000 only recovered small amounts of the points on the hands because of their low resolution. Similar to Figs. 10(c) and 10(d), many errors are apparent at the edges and the recovered points on the fingers are lacking in Figs. 11(c) and 11(d). However,


according to Fig. 11(e), the proposed method recovered the details of the fingers perfectly. The reconstruction at the edges and the slight depth variation were also satisfactory. Both of these challenging experiments demonstrated the superior performance of the proposed method compared with the current best techniques.

Fig. 10. Depth maps acquired for a statue by: (a) Kinect, (b) SR4000, (c) block matching, (d) phase matching, and (e) the proposed method.

Fig. 11. Depth maps acquired for a hand by: (a) Kinect, (b) SR4000, (c) block matching, (d) phase matching, and (e) the proposed method.

7.2. MAPECR versus MLE To verify the improved performance of the proposed MAPECR technique, we compared it with the traditional MLE method. Seven experiments were conducted, as shown in Figs. 12– 18. The experimental scenes comprised simple geometries (Fig. 12–16), a human hand (Fig. 17), and a complex scene with multiple geometries (Fig. 18) in order to thoroughly evaluate the performance of both methods using scenes with smooth depth variations, elongated regions, abrupt depth changes, and depth discontinuities. The objects were placed about 1 m from the platform. In Figs. 12–18, (a) shows the captured images of the scenes with the projected pattern, (b) are the depth maps acquired by the MLE-based method, (c) are the depth maps obtained by the MAPECR method, and (d) are the depth maps interpolated based on the sparse depth maps in (c). In the experiments based on simple geometries in Figs. 12–15, the depth maps recovered by both methods had similar excellent qualities. However, for the experiments in Figs. 16–18, there were obvious losses in the depth maps acquired by the MLE method, such as the spire of the pyramid in Fig. 16, the fingers and part of the arm in Fig. 17, and the fist and


dodecahedron in Fig. 18. As discussed in Section 3, MLE methods can successfully recover the scenes only when the codes of the features are extracted completely. For the scenes with smooth outlines and depth variations in Figs. 12–15, and the primary codes on the horizontal lines were detected with high quality; thus, the MLE method achieved excellent performance. However, in very small scenes, such as those in Fig. 16–18, the complete primary codes could not be detected, so losses occurred in the recovered depth maps. By contrast, the MAPECR method achieved consistently excellent performance for all of these scenes, as shown in Figs. 12–18(c). These results validate the effectiveness of the proposed method for scenes with very small regions, edges, and depth discontinuities. The dense depth maps in Figs. 12–18(d) show the high accuracy of the proposed method.

Fig. 12. (a) Captured image of a dodecahedron. (b) Depth map acquired by the MLE method. (c) Depth map acquired by the MAPECR method. (d) Dense depth map interpolated from (c).

Fig. 13. (a) Captured image of a simple geometry. (b) Depth map acquired by the MLE method. (c) Depth map acquired by the MAPECR method. (d) Dense depth map interpolated from (c).

Fig. 14. (a) Captured image of a pot. (b) Depth map acquired by the MLE method. (c) Depth map acquired by the MAPECR method. (d) Dense depth map interpolated from (c).


Fig. 15. (a) Captured image of a vase. (b) Depth map acquired by the MLE method. (c) Depth map acquired by the MAPECR method. (d) Dense depth map interpolated from (c).

Fig. 16. (a) Captured image of a pyramid. (b) Depth map acquired by the MLE method. (c) Depth map acquired by the MAPECR method. (d) Dense depth map interpolated from (c).

Fig. 17. (a) Captured image of a hand. (b) Depth map acquired by the MLE method. (c) Depth map acquired by the MAPECR method. (d) Dense depth map interpolated from (c).

Fig. 18. (a) Captured image of a complex scene. (b) Depth map acquired by the MLE method. (c) Depth map acquired by the MAPECR method. (d) Dense depth map interpolated from (c).

7.3. Reconstruction of dynamic scenes In order to evaluate the performance of the proposed method with dynamic scenes, we recovered a human face with different expressions from a video frame sequence. The human face is a complex scene containing rich details and uneven albedo. The first row in Fig. 19 shows the captured frames with the projected maze pattern and the second row shows the


interpolated depth maps. Clearly, the details of the eyelids, nose, and lips were recovered well, and even very small changes in expression were presented vividly. In experiment, we preset a fixed prior parameter ( p0 ) and all the frames were recovered with the fixed parameter. The depth restoration was accomplished offline and independent between every single frame. To be specific, we captured and stored frames in 30f/s. The average execute time of our C code on the experiments was about 0.6s/frame (1.7f/s) with i7 3.6G CPU. Considering that it’s not a wide gap between real-time and our CPU implementation (needs 18 times acceleration), this paper only gives a valid proof-of-concept study for the proposed dynamic depth sensing strategy. This experiment demonstrated that our method is suitable for recovering dynamic scenes with high accuracy.

Fig. 19. Depth maps of a human face with different expressions.

8. Conclusion In this study, we proposed a maximum a posteriori-based correspondence retrieval technique to solve the challenging correspondence retrieval problem for insignificant features in singleshot depth sensing. We made two main contributions in this study: 1) the MAPECR method and 2) the maze pattern. The traditional Euclidean distance-based correspondence retrieval method is modeled as MLE, which assumes that the correspondence of each feature is independent. Instead, we proposed the MAPECR method, which exploits the inherent spatial correlations among features during correspondence retrieval. The correspondences of insignificant features are determined by both the Euclidean distance of the codes and according to prior knowledge of the correspondence results for neighboring significant features. We also designed a monochromatic maze pattern with three gray levels, where the features are placed along epipolar lines. This pattern is robust to disturbances of the chromatic objects, ambient illumination, and depth variations. The proposed maze pattern is simply a proof of concept for the MAPECR method. The MAPECR method can actually be regarded as an alternative to the MLE-based method and it can be used in various single-shot depth sensing techniques with different patterns. Acknowledgements This work was supported in part by the NSFC (No. 61672404, 61632019, 61100155, 61472301, 61572387 and 61301288), the Fundamental Research Funds of the Central Universities of China (Nos. JBG160228, K5051399020, K5051202050, SA-ZD160203 and JB140207), and Natural Science Basic Research Plan in Shaanxi Province of China (Program No.2016ZDJC-08).