Robust Dense Endoscopic Stereo Reconstruction for ... - CiteSeerX

Robust Dense Endoscopic Stereo Reconstruction for Minimally Invasive Surgery Sylvain Bernhardta , Julien Abi-Nahedb , and Rafeef Abugharbieha a

Biomedical Signal and Image Computing Lab, University of British Columbia, Vancouver, Canada b Qatar Robotic Surgery Centre, Qatar Science and Technology Park, Qatar [email protected]

Abstract. Robotic assistance in minimally invasive surgical interventions has gained substantial popularity over the past decade. Surgeons perform such operations by remotely manipulating laparoscopic tools whose motion is executed by the surgical robot. One of the main tools deployed is an endoscopic binocular camera that provides stereoscopic vision of the operated scene. Such surgeries have notably garnered wide interest in renal surgeries such as partial nephrectomy, which is the focus of our work. This operation consists of the localization and removal of tumorous tissue in the kidney. During this procedure, the surgeon would greatly benefit from an augmented reality view that would display additional information from the different imaging modalities available, such as pre-operational CT and intra-operational ultrasound. In order to fuse and visualize these complementary data inputs in a pertinent way, they need to be accurately registered to a 3D reconstruction of the imaged surgical scene topology captured by the binocular camera. In this paper we propose a simple yet powerful approach for dense matching between the two stereoscopic camera views and for reconstruction of the 3D scene. Our method adaptively and accurately finds the optimal correspondence between each pair of images according to three strict confidence criteria that efficiently discard the majority of outliers. Using experiments on clinical in-vivo stereo data, including comparisons to two state-of-the-art 3D reconstruction techniques in minimally invasive surgery, our results illustrate superior robustness and better suitability of our approach to realistic surgical applications. Keywords: stereovision, rectification, dense matching, 3D reconstruction, stereo camera, stereo vision, partial nephrectomy, augmented reality, robotic assisted surgery

1

Introduction

The past decade witnessed an ever-increasing number of reports on robot-assisted surgical interventions where the surgeon remotely controls a robot that reproduces the motion of his/her hands on laparoscopic tools. Medical robots, such as the da Vinci Surgical System (Intuitive Surgical, Inc., Sunnyvale, CA, USA),

2

Robust Approach to Dense Reconstruction in Surgery

have for example been widely used in renal surgery due to similar or even better clinical outcomes than those of standard procedures [1]. The fact that the surgeon’s view of the operated scene is digitized via a stereo camera has made augmented reality (AR) in minimally invasive surgery (MIS) an very active research area since the early 2000s [2][3]. The aim is to grant the surgeon the ability to see “beyond” the visible surface by overlaying visual information from other available intra-operative and pre-operative data onto the endoscopic camera feed. However, registration of such data with the 3D scene remains a difficult problem since, particularly in abdominal MIS, the environment is mostly composed of soft tissue and organs that significantly deform due to the surgeon’s actions as well as patient breathing and cardiovascular activity. One approach to solving this problem is to use the stereo stream from the camera to perform dense matching and provide a 3D model of the surgical scene that can then serve as a registration base for the other imaging data, e.g. as in [4]. Many methods for dense stereo matching have been proposed over the last decades [8][9], however, there are two main distinctions between the kind of data typical in MIS and the traditional reference datasets for dense stereo matching, such as the Middlebury images [10]. The first is that our binocular camera provides a video output, i.e. sequences of images with very little differences between two successive frames. Therefore, temporal smoothness gains more importance and can be enforced. Furthermore, since the MIS scenes are generally not static when captured, a significant amount of motion blur is typically introduced, which makes the stereo matching problem more difficult. The second main difference is related to the content of the images. Datasets traditionally used in computer vision studies represent static scenes with a variety of rather simply shaped objects laid out at different depths. Moreover, the surfaces are most of time matte and the lighting is uniform, which does not induce complex lighting artifacts. On the other hand, intra-abdominal tissue is soft and presents complex reflections, due to the non-Lambertian nature of the surfaces, as well as irregular shapes, highly variable textures and various distortion . Additionally, there is a constant presence of surgical tools that severely occludes the scene with textureless plastic or highly reflective metallic parts (see figure 1). Overall, image sequences in MIS are very challenging to reconstruct and defy the robustness of current stereo matching techniques.

Few methods at dense reconstruction of stereo endoscopic images have been proposed. In [2], a method was presented for detecting and virtually removing the tools from the reconstructed scene. Later [12], Vagvolgyi et al proposed a method to overlay a kidney model onto the stereo display by registering the model to the kidney surface reconstructed from stereo data. More recently [5], Stoyanov et al presented a method to perform near real-time stereo reconstruction in MIS based on belief propagation. A similar work has been recently proposed in [6] where hybrid recursive matching was used.


3

Fig. 1. Example image depicting a scene captured during a partial nephrectomy procedure. Commonly encountered artifacts are highlighted in orange.

All these previous methods are based on existing stereo matching algorithms that have not been designed for MIS data. For example, they all try to enforce spatial smoothness constraints, which is supposed to ensure homogeneous disparities in regular areas. However, practical MIS images are more challenging, therefore mismatches are more prone to happen. If spatial smoothness is enforced, this will tend to spread errors into little clusters of homogeneous outliers, which are harder to discard than isolated ones. Also, the risk of getting rid of actual inliers is greater if they are in a group since a single pixel standing out from the rest of the depth map does not represent a realistic scene in world space. The work presented in this paper aims to address such issue by providing a 3D model of the surgical scene with emphasis on accuracy and robustness. The primary goal is to discard outliers and yet provide enough information across all frames of the video stream such that registration is still possible. To achieve this, we first detect only the most reliable matches by enforcing a series of strict criteria reflecting certainty of matching. We then enforce limited spatial smoothness to handle the few isolated outliers that still survive.

2 2.1

Methodology Pre-processing

The output from our surgical binocular camera is an interlaced high definition video (1080i). To alleviate the problem of jagged edges in the de-interlaced images, each extracted frame is downsampled by a factor 2 down to 960 × 540 pixels. For each pair of frames, we then perform a sparse matching using the SIFT descriptor [13]. Occasional mismatches are discarded during the robust calculation of the fundamental matrix F using RANSAC as in [7]. The epipoles are calculated from F and used to rectify the images. This process aligns the two images into the same plane in the world space (see figure 2a-b). Then, according to the laws of epipolar geometry, every feature or pixel in one frame

4


has its correspondence on the same row in the other frame, which greatly facilitates the matching (see figure 2b). We use polar rectification, as it is simple and guarantees minimal distortion of the images [14]. The matching of a feature yields the disparity d which is inversely proportional to the feature depth Z in the world space (see figure 2c and equation 1, where f is the focal length and B the baseline between the two optical centers). Z=−

Bf d

(1)

Fig. 2. Rectification and relation between depth and disparity. (a) shows the projection of a point P in world space onto the two image planes in pL and pR . In (b), the vertical alignment of this point in the rectified images gives the disparity d between its left and right locations, respectively xL and xR . The disparity d is also inversely proportional to the depth, as shown in (c) the top view of (b).

Finally, the images are converted to grayscale by averaging the three color channels with different weights as recommended in [15] where it was shown that the use of color in dense stereo matching is not beneficial. 2.2

Robust matching via confidence criteria

Our dense matching is based on calculating similarity between patches from the left and right images using normalized-cross correlation (NCC) as a metric, which is efficient even in the presence of brightness change. For a given point and window in the left image, the similarity profile is calculated across the same row in the right image, bounded by a range centered at the same position. The local maxima represent the location of candidates for matching with the best candidate chosen as the one with the highest similarity value. As robustness is paramount in our application, we ensure that each point pair matching satisfies strict confidence criteria that reflect three metrics of uncertainty in the matching in our approach: First, blurry, unlit and textureless parts of the image present very little structural information, which makes the matching difficult if not impossible. To mitigate this problem, a simple and effective gradient dispersion metric γ is


5

considered, in order to estimate the spatial structure. Let γL be its value for the considered patch Ip and γR the one for the best candidate patch Ic . Our first criterion is that both of these values have to be greater than a certain threshold γmin (see equation 2). If this condition is not met, the matching is declared too risky. γL = stdev(∇2 Ip ) > γmin and γR = stdev(∇2 Ic ) > γmin (2) Second, the quality of a matching also appears in the difference of similarity score in the NCC profile between the two highest peaks, as illustrated by figure 3. The blue curve represents the matching from the left to right images, and the green one from right to left. The graph (a) of the figure reflects a patch of size of 7x7 pixels. As can be easily seen, the profile presents many peaks of approximatively the same score, which means the content of the patch is not discriminative enough and hence the window may be too small. On the graph (b) of the figure, the window size is 13x13 pixels, which allows the patch to contain more complex patterns. As a result, the correct solution stands out in the profile since the gap between the best and other peaks is significant. Our second criterion is thus that this difference δ has to be beyond a threshold δmin . If this condition is not satisfied, the matching is declared not discriminative enough.

Fig. 3. Two example NCC profiles for the same point but with two different window sizes. The blue curve represents the matching from the left to right images, and the green one from right to left. On the y-axis is the similarity score and on the x-axis the disparity. The best peak is designated by a vertical line and its score difference with the second best peak is δ. The disparities of the best candidates from left to right and right to left are also displayed as dL→R and dR→L , respectively. (b) here shows better discrimination.

Third, let us consider dL→R and dR→L . If this correspondence is correct, then the inverse matching (from right to left) should yield the dual result: dR→L = −dL→R . Therefore, if both previous criteria are satisfied, then inverse matching

6


is performed starting from the best candidate in the right image. Our third criterion is thus that dR→L and dL→R should cancel out within a threshold (see equation 3). If this is not true, the matching is then considered incorrect. |dL→R + dR→L | <

(3)

In case of failure in satisfying the above third criterion, the window is increased by a step dw in hope of providing a more discriminative matching. If the window size reaches a threshold wmax and still none of the criteria are satisfied, then this particular point in the image is discarded (see figure 4). This matching method is iterated through the image until completion.

Fig. 4. Block diagram of our proposed dense matching method.

2.3

Post-processing and temporal smoothness

Even though the previous criteria are very restrictive, it is still possible to have outliers slipping through. Fortunately, since spatial smoothness has not been enforced, the few surviving outliers are most of the time isolated and easy to identify. Therefore, our post-processing step consists of finding isolated pixels that are significantly different from their neighborhood. More specifically, in a window of size ∆w around each point that has been matched, we consider the number N of disparities whose difference with the actual point disparity is less than a threshold ∆d. If the ratio N/∆w2 is smaller than a threshold µ, then the matching for this point is discarded. Once outliers are weeded out, one or two-pixel wide holes are filled with the median of their surrounding values. It is important to note that no other attempt at filling larger empty areas is needed, as subsequent frames will fill larger gaps (i.e. uncertain matching areas) locally across time. For computational considerations and due to the highly reliable matches, the disparities found for one image pair are used as priors for the next pair of frames by reducing the search range of a point matching around its previous disparity value, thus enforcing the temporal smoothness of the depth evaluations.


3

7

Results and discussion

All experiments were carried on frames extracted from in-vivo videos recorded by our own high definition endoscopic stereo camera during four different partial nephrectomies assisted by a da Vinci robot. The sequences have a resolution of 1920 × 1080 pixels at a frame rate of 25 fps and the images color space is YCbCR. Our experiments have shown that γmin = 0.5, δmin = 0.1, = 5, ∆d = 4, ∆w = 5, µ = 0.4, a square window of initial width wmin = 7, maximum width wmax = 30 and step dw = 4, for frames of size 960 × 540, yield good results for a wide range of MIS scenes. We compared our algorithm to the two latest methods in dense reconstruction from stereo in MIS – [5] and [6] – over various pair of frames from our data. Although their techniques often yielded accurate results, they would still present significant outliers in difficult areas as in figure 1. In contrast, our method has successfully discarded the vast majority of difficult areas (see figure 5), while still matching the easier parts of the image.

Fig. 5. Comparison with other methods. (a) Original image; (b) Depth map from [5]; (c) Depth map from [6] and (d) our method. Our method successfully dismisses error-prone areas while the two other methods present outliers (highlighted in red). In the depth maps, whiter is shallower.

8


Given that certain difficult areas may be discarded in our robust matching process which may result in occasional localized loss of reconstruction information, the temporal nature of our matching ensures that successful reconstruction is attained within a few frames. For example, in the central parts of the frame, most pixel reconstructions are updated within 10 frames which represent less than 0.5 seconds, as illustrated in figure 6. Therefore, the region of interest can always be successfully reconstructed locally in time.

Fig. 6. Depth update with respect to image region. This figure presents four different sequences. For each graph, the pixel value represents the largest number of successive frames for the corresponding pixel reconstruction to be updated again in the sequence. The colormap scales from 0 to the total number of frames. Pixels with the maximum value (red) are those which have never been matched, either for being out of the rectified image or because of difficult regions. However, in all four example sequences shown, the central parts of the image is always blue i.e. these pixels are matched very regularly.

4

Conclusions

The purpose of this work was to provide an accurate and robust stereo dense matching method that is suited to surgical scene reconstruction. By enforcing strict matching confidence criteria and relaxing spatial smoothness, we have shown that our method is capable of discarding most outliers in frame pairs from real in-vivo clinical data where current state-of-the-art techniques fail.


9

Since all matchings are independent of each other, our method is highly parallelizable and will reach its full potential once implemented on GPU, which is our plan for the near future. Acknowledgements The author would like to thank Dr. Danail Stoyanov for providing the code of his method, Sebastian Roehl for submitting results from his program on our images, the Hamad Medical Corporation hospital for providing the surgical stereo sequences and Mitchell Vu for his help on compiling results.

References 1. P. Babbar and A. K. Hemal, Robot-assisted partial nephrectomy: current status, techniques, and future directions, International Urology and Nephrology, vol. 44, no. 1, pp. 99-109, Feb. 2012. 2. F. Mourgues, F. Devernay, and E. Coste-Maniere, 3D Reconstruction of the Operating Field for Image Overlay in 3D-Endoscopic Surgery, in International Symposium on Augmented Reality, 2001, pp. 191-192. 3. T. Sielhorst, M. Feuerstein, and N. Navab, Advanced Medical Displays : A Literature Review of Augmented Reality, Journal of Display Technology, vol. 4, no. 4, pp. 451-467, 2008. 4. L.-M. Su, B. P. Vagvolgyi, R. Agarwal, C. E. Reiley, R. H. Taylor, and G. D. Hager, Augmented reality during robot-assisted laparoscopic partial nephrectomy: toward real-time 3D-CT to stereoscopic video registration, Journal of Urology, vol. 73, no. 4, pp. 896-900, Apr. 2009. 5. D. Stoyanov, M. Visentini-Scarzanella, P. Pratt, and G.-Z. Yang, Real-time stereo reconstruction in robotically assisted minimally invasive surgery, in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2010, vol. 13, no. Pt 1, pp. 275-282. 6. S. Roehl et al., Dense GPU-enhanced surface reconstruction from stereo endoscopic images for intraoperative registration, The International Journal of Medical Physics Research and Practice, vol. 39, no. 3, pp. 1632-1645, Mar. 2012. 7. R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Second Edition. Cambridge University Press, ISBN: 0521540518, 2004. 8. D. Scharstein, R. Szeliski, and R. Zabih, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, in IEEE Workshop on Stereo and Multi-Baseline Vision, 2002, no. 1, pp. 131-140. 9. M. Z. Brown, D. Burschka, and G. D. Hager, Advances in computational stereo, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 8, pp. 993-1008, 2003. 10. http://vision.middlebury.edu/stereo/data/ 11. D. Stoyanov, M. Elhelw, B. P. Lo, A. Chung, F. Bello, and G.-Z. Yang, Current Issues of Photorealistic Rendering for Virtual and Augmented Reality in Minimally Invasive Surgery, in International Conference on Information Visualization, 2003, pp. 350-358. 12. B. P. Vagvolgyi, L.-M. Su, R. H. Taylor, and G. D. Hager, Video to CT Registration for Image Overlay on Solid Organs, in Augmented Reality in Medical Imaging and Augmented Reality in Computer-Aided Surgery (AMIARCS), 2008, pp. 7886.

10


13. D. G. Lowe, Object recognition from local scale-invariant features, in International Conference on Computer Vision, 1999, pp. 1150-1157. 14. M. Pollefeys, R. Koch, and L. Van Gool, A simple and efficient rectification method for general motion, in International Conference on Computer Vision, 1999, pp. 496501. 15. M. Bleyer and S. Chambon, Does Color Really Help in Dense Stereo Matching ?, in International Symposium on 3D data processing, 2010, pp. 1-8.