Image and Vision Computing 23 (2005) 999–1008 www.elsevier.com/locate/imavis
Extraction of visual features with eye tracking for saliency driven 2D/3D registration Adrian J. Chung*, Fani Deligianni, Xiao-Peng Hu, Guang-Zhong Yang Royal Society/Wolfson Foundation Medical Image Computing Laboratory, Department of Computing, Imperial College, 180 Queen’s Gate, SW7 2BZ London, UK Received 30 January 2004; received in revised form 19 May 2005; accepted 1 July 2005
Abstract This paper presents a new technique for deriving information on visual saliency with experimental eye-tracking data. The strength and potential pitfalls of the method are demonstrated with feature correspondence for 2D to 3D image registration. With this application, an eyetracking system is employed to determine which features in endoscopy video images are considered to be salient from a group of human observers. By using this information, a biologically inspired saliency map is derived by transforming each observed video image into a feature space representation. Features related to visual attention are determined by using a feature normalisation process based on the relative abundance of image features within the background image and those dwelled on visual search scan paths. These features are then backprojected to the image domain to determine spatial area of interest for each unseen endoscopy video image. The derived saliency map is employed to provide an image similarity measure that forms the heart of a new 2D/3D registration method with much reduced rendering overhead by only processing-selective regions of interest as determined by the saliency map. Significant improvements in pose estimation efficiency are achieved without apparent reduction in registration accuracy when compared to that of using an intensity-based similarity measure. q 2005 Elsevier B.V. All rights reserved. Keywords: 2D/3D registration; Saliency; Image correlation; Eye tracking
1. Introduction New techniques for minimally invasive surgery [1,2] have brought significant benefits to patient care, including reduced trauma, shortened hospitalisation, and improved diagnostic accuracy and therapeutic outcome. Endoscopy is the most common procedure in minimal access surgery, but requires a high degree of manual dexterity from the operator owing to the complexity of the instrument controls, restricted vision and lack of tactile perception. Training methods for these skills involve the use of computer simulation, which needs to be as realistic as possible for specialist training to be effective. One important factor is
* Corresponding author. Tel.: C44 20 7594 8318; fax: C44 20 7581 8024. E-mail address:
[email protected] (A.J. Chung).
0262-8856/$ - see front matter q 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.imavis.2005.07.003
the level of visual realism presented to the trainee, which is made possible by recent advances in image-based modelling and rendering [3]. Genuine surface texture information can be extracted from matched real endoscopic videos with geometry derived from tomographic images of the same patient. This process relies on the accurate registration of 2D video images to 3D tomographic datasets. The method of 2D/3D registration is a much researched topic, and many applications arisen are from medical imaging and surgical planning applications [4–10]. In general, methods of medical image registration can be classified into landmark, segmentation, and voxel-based techniques [11,12]. Voxel-based techniques [13] have shown to be effective for the registration of both unimodal or multimodal images depending on the similarity measures and optimisation strategy used. One of the typical 2D/3D registration problems addressed is the registration of fluoroscope images against 3D CT or MRI datasets, for which a voxel similarity measure is typically used. Registering 2D bronchoscopy video to 3D tomographic
1000
A.J. Chung et al. / Image and Vision Computing 23 (2005) 999–1008
datasets, however, poses unique problems. Unlike fluoroscope images, endoscope video of internal organs features textured surfaces with shading varying greatly with illumination conditions, distance from the light source, surface reflectivity, degree of subsurface scattering, and inter-reflection properties. Practically, these effects are difficult to model accurately. The photo-consistency method [14] assumes a Lambertian illumination model but relies on having multiple cameras with a known rigid relationship between them, which is difficult to achieve in most endoscopy applications. The flexibility of colonoscopes and bronchoscopes also rules out the use of landmark-based systems such as those with an infrared tracker as described in [10]. The bronchoscope tip does not form a rigid relationship with instruments outside of the body as required by the technique. Recent advances in electromagnetic sensor technology have allowed the fabrication of positional trackers small enough to be inserted into the biopsy channel of the bronchoscope [15]. This, however, limits the tracking functionality to purely exploratory interventions in which no other catheters are used. Although EM tracking of the bronchoscope tip can be highly accurate (sub-millimetre in the best case), positions are given relative to the EM field emitter and not relative to the surrounding parts of the anatomy, which have often been subjected to non-rigid deformation after pre-operative acquisition of the 3D geometry. Furthermore, the current state-of-the-art catheter-based EM tracker still cannot offer full six degrees of freedom. Therefore, image-based pose estimation with 2D/3D registration is essential to patient-specific endoscope simulation. In practice, two parallel approaches for 2D/3D endoscope registration are currently being explored. Videos captured by a flexible endoscope can be registered against 3D pre-operative data by using naturally occurring anatomical features as landmarks. Alternatively one can perform registration by using image pixels directly. Normalised cross-correlation [16,17] and mutual information-based [18,19] similarity measures are commonly adopted in this situation. With this approach, the 3D dataset is commonly projected onto a 2D image while catering for occlusion, which in effect, reduces the problem to a 2D–2D registration problem. Similarity measures such as correlation ratio [20], intensity gradients, texture similarity measures [13], and joint intensity distributions [21] have been adopted previously. In practice, for each video frame with which the 3D dataset is to be registered, a large number of pose parameters must be evaluated. A different image needs to be rendered from the 3D CT dataset for each unique set of pose parameters and then compared with the video frame. This is computationally expensive and usually represents the bottleneck in the entire registration process. Although the rendering process can be accelerated via the use of specialised graphics hardware, the computational burden is
still excessive especially when photo-realistic rendering is considered. In order to reduce the rendering overhead, this paper proposes the use of visual saliency for selective rendering so that registration efficiency can be maximised. With this approach, criteria determining the portion of the image to be rendered are essential. To this end, a saliency map derived from a model of the human visual system is employed. Since the similarity measure is not applied to the entire image, pixel-oriented rendering methods, such as ray tracing [22], can be effectively exploited. For large datasets, the proposed scheme can outperform hardware z-bufferbased methods and also allows for more sophisticated illumination models. The following subsections briefly discuss visual search, the function of saliency in the modelling of this human activity, followed by a detailed explanation of the 2D/3D registration problem. Subsequent sections will then describe how the modelling of visual search through the analysis of eye-tracking data can be used to improve the performance of 2D/3D registration algorithms. 1.1. Saliency in visual search Visual search is the act of searching for a target within a scene. If the scene is between 2 and 308, the eyes will move across it to find the target, and for larger scenes, the head moves as well. The myriad of visual search tasks performed in a single day is so large that it has become a reactive rather than deliberative process for most normal tasks. In practice, the eye movements associated with visual search can be detected with high accuracy by using the relative position of the pupil and corneal reflection from an infra-red light source [26,27]. During a visual search, a saccade moves the gaze of the eye to the current area of interest. This area of interest normally needs to be dwelled on for longer than 100 ms in order for the brain to register the underlying visual information. This point is called a fixation. The objective of eye-tracking systems is to determine when and where fixations occur. The spatio-temporal characteristics of human visual search, together with the intrinsic visual features of the fixation points, can provide important clues to the salient features that visual comparisons are based upon. The use of saliency in human visual recognition has long been recognised. It has been shown that the human visual system does not apply processing power to visual information in a uniform manner. The intermediate and higher visual processes of the primate vision system appear to select a subset of the available sensory information before further detailed processing is applied [23]. This selection takes the form of a focus of attention that scans the scene in a saliency-driven task-oriented manner [24]. While attention can be controlled in a voluntary manner, it is also attracted in an unconscious manner to conspicuous, or salient, visual locations.
A.J. Chung et al. / Image and Vision Computing 23 (2005) 999–1008
The detection of visual saliency for recognition has traditionally been approached by pre-defined image features dictated by domain-specific knowledge. The process is hampered by the fact that visual features are difficult to describe and assimilation of near-subliminal information is cryptic. Our previous research has shown that it is possible to use eye tracking to extract intrinsic visual features used by observers without the need of explicit feature definition [25].
1001
2. Method 2.1. Participants Six participants, three male and three female, were selected from our research group and consented to have their eye movements tracked for this study. All were in their 20s and had experience viewing endoscopy images. 2.2. Apparatus
1.2. The 2D/3D registration problem The problem of 2D/3D registration can be formulated as a parameter estimation problem. A camera, C, has both intrinsic and extrinsic parameters. The intrinsic parameters include focal length, lens distortion, and optical origin. For a typical endoscope, these parameters are all fixed and can be determined pre-operatively by using techniques such as [28,29]. The extrinsic parameters determine the pose of the camera in terms of position given by three Euclidean coordinates, (Vx,Vy,Vz), and orientation defined by three Euler angles, (q0,q1,q2). These correspond to the six degrees of freedom of rigid body motion. An object, A, is transformed (via perspective projection) into a 2D image, I, using the camera, C. The problem of 2D/3D registration is to determine the extrinsic parameters (V,q) uniquely with image, I, and the position, orientation and 3D geometry of A. A closely related problem in computer vision is that of pose estimation where the goal is to determine the position and orientation of the object, A, given only its 3D geometry, the image, I, and the position and orientation of the camera, C. 2D/3D registration is often approached as an optimisation problem whereby the camera pose is determined through minimisation of a cost function O(V,q). In practise there is no closed form solution for finding the minimum of O, thus an iterative solution is required. In each iteration of the minimiser, the object, A, is perspectively projected using the current estimate of the extrinsic camera parameters, (V,q), to produce a 2D image, I 0 . A specially constructed distance measure, d, compares I with I 0 , indicating the amount of disimilarity as a scalar value, and thus the cost function can be defined by O(V,q)Zd(I,I 0 ). There are a variety of methods to construct function, d, and in this paper, a weighted crosscorrelation measure on image intensities has been used. Typically a large number of iterations will be required to find each pose so the overall computation time will be dominated by the rendering of image I 0 and the evaluation of d(I,I 0 ). Both these processes can be made more efficient by computing a small subset of I 0 which, as is described in the following sections, will be determined through the analysis of visual search behaviour when subjects are presented with examples of images I and I 0 .
A phantom lung model made from silicone-rubber and the inner surfaces were textured using acrylics. The size of the inner airways ranged from 12 to 5 cm in diameter. The phantom was scanned with a Siemens Somaton Volume Zoom 4-Channel Multidetector CT with a slice thickness of 3 mm and in-plane resolution of 1 mm. The inner surface was reconstructed as a polygonal mesh to be used for rendering virtual endoscopic views for camera pose estimation. The phantom was filmed with a miniature NTSC camera (model CCN-2712 YS YTech Design Ltd) which was fitted with a small light source attached as close as possible to the optical axis of the camera. The location of the assembly within the phantom was recorded by using a Polhemus Electro-Magnetic (EM) motion tracker (FASTRAK). The configuration of phantom, camera and EM tracker facilitates the modelling of a typical endoscopy procedure at a significantly larger scale, enabling validation via accurate positional tracking. Selected video frames from the phantom lung model were combined with greyscale images rendered from the 3D CT dataset to create nine problem slides to present to the participants. Each slide consisted of four images in a 2!2 format. Three of the slides present one rendered image with three video frames, and the remaining six slides present one video frame with three rendered images. In each slide, two of the images, one video and one rendered, were taken from identical viewpoints, while the remaining images were taken from different viewpoints. All images had a pixel resolution of 320!240. The slides were displayed sequentially on a 21-in. computer monitor (1280!1024 pixels) producing an image 39.4 cm!27.6 cm at a distance 65.0 cm in front of the participant’s eyes, as shown in Fig. 1. An ASL-504 remote eye-tracking system (specified accuracy 0.58, visual range 50 (H), and 408 (V), sampling rate 50 Hz, Applied Science Laboratories, MA) was used to track eye movements in real time. 2.3. Procedure Each of the six participants was given the task of choosing the pair of images on each slide that most closely match in pose—i.e. share the same viewpoint and viewing orientation. Whilst they carried out this task, the eye-tracking
1002
A.J. Chung et al. / Image and Vision Computing 23 (2005) 999–1008
Fig. 1. A group of images is presented to a participant whose eye movements are monitored in real-time. (a) System setup with an ASL 504 remote eye-tracking system, and (b) the resulting scan paths and fixations superimposed on the original endoscope video image frames.
system followed their eye movements and recorded fixation coordinates for each problem slide. Only the fixation data were used for the automatic extraction of salient visual features as is described in this section. 2.3.1. Feature space representation A feature space representation of each video frame and rendered image from 3D CT data were derived by using edge strength contrasts between different scales. The use of contrast, especially in intensity, has shown to be a significant determining factor in the saliency map during a search task [30]. The multiscale contrast method described in [24] has been adapted for representing the feature space. The method employed a biologically inspired filter bank that modelled the behaviour of the orientation-selective cells found in the primary visual cortex of primates [31], primarily the bar cells whose optimal stimuli are bar-shaped patterns. These cells act as both local spatial frequency analysers and local edge and bar detectors, which can be approximated by 2D Gabor filters. The Gabor filter was implemented as a product of a twodimensional Gaussian with an oriented cosine wave: 2px 0 Gx;h;q;j ðx; yÞ Z e cos Cj ; l m
x 02 C g2 y 02 m ZK ; 2s2
(1)
x 0 Z ðxKxÞcos qKðyKhÞsin q; y 0 Z ðxKxÞsin qKðyKhÞcos q G gives the filter values for a Gabor filter centred on a point (x,h). j is the phase offset of the harmonic factor (i.e. the cosine term), taking the value of zero for symmetric filters, p/2 for anti-symmetric filters. The direction of the line of symmetry is affected by q which controls the filter orientation. l is the wavelength and g is the aspect ratio. The value s is functionally dependent on l and a bandwidth
parameter, b: lð2b C 1Þ sZ pð2b K1Þ
rffiffiffiffiffiffiffiffi ln 2 2
Steps for deriving the feature space can be summarised as follows: (1) A four-level Gaussian pyramid was created by down sampling the image repeatedly by a factor of two. Image IkK1 is derived from image Ik by convolving with a Gaussian and then discarding pixels with odd-valued coordinates
IkK1 ði; jÞ Z 4
2 2 X X
wðm; nÞIk ð2i C m; 2j C nÞ
(2)
mZK2 nZK2
where w is the 2D Gaussian kernel. (2) Gabor filtering at four orientations (0, 45, 90, 1358) was applied at each of the four scales. The parameter values for b, l and g were fixed for each feature component generated. The phase, f, was set to zero. (3) At each scale, the Gabor responses were totalled over the four orientations to derive an orientation-independent measure of edge strength f ðs; PÞ Z
X
ðGP;q !Is Þ
(3)
q2f0;45;90;135g
where GP,q, the Gabor filter response centred on point, P, with orientation, q, is convolved with the image I at scale s in the Gaussian pyramid. (4) The ‘center-surround’ operation defined in [24] was applied to obtain pairwise difference in edge strengths for corresponding points at different scale levels in the Gaussian pyramid. Thus a six-component feature
A.J. Chung et al. / Image and Vision Computing 23 (2005) 999–1008
vector, F, was derived for each point, P
To eliminate projection bias of the a priori probability distribution Fj(P), background bias over the entire image
1 F0 ðPÞ Z jf ð0; PÞKf ð3; PÞj 8
Hj ða0 ; a1 Þ Z P
1 1 F1 ðPÞ Z jf ð0; PÞKf ð2; PÞj 8 2 1 F2 ðPÞ Z jf ð1; PÞKf ð3; PÞj 4
1003
(4)
1 F3 ðPÞ Z jf ð0; PÞKf ð1; PÞj 4 1 1 F4 ðPÞ Z jf ð1; PÞKf ð2; PÞj 4 2 1 F5 ðPÞ Z jf ð2; PÞKf ð3; PÞj 2 Here, sZ0 is the coarsest resolution level in the Gaussian pyramid. The factor 1/2k is necessary to ensure chosen points at different levels correspond. (5) The above steps were repeated for different parameter settings for the Gabor filter. The bandwidth, b, varied between 0.6 and 2.0, the wavelength, l, varied from 4.0 to 10.0, and aspect ratio, g, varied between 0.6 and 1.5. Although each video image consisted of 320!240 nearly square pixels, the camera’s wide angle lens introduced significant radial distortion thus necessitating the inclusion of an aspect ratio as a feature space parameter. By using the center-surround differences to derive six components from each parameter triple, (b,l, g), a combined feature space of 288 components was derived.
2.3.2. Feature space projection Each fixation point was projected to the multidimensional feature space, F, via the method described in [25]. Image features are thus represented as response intervals of each feature component, a0%Fj(P)%a1. The dwell time for a particular feature response interval, [a0, a1], is defined as X X ti gi ðx; yÞbj ða0 ; a1 ; ðx; yÞÞ (5) Tj ða0 ; a1 Þ Z xi ;yi2U ðx;yÞ2Oðxi ;yi Þ
where U is the set of fixation points, O(xi,yi) is the foveal field of fixation about fixation point (xi,yi), ti the fixation dwell time, gi(x,y) the Gaussian function centred at (xi,yi), and ( 1; if a0 % Fj ððx; yÞÞ% a1 bj ða0 ; a1 ; ðx; yÞÞ Z (6) 0; otherwise In practice, the range spanned by each feature component, Fj, is divided into eight intervals so that Tj(a0,a1) can be represented by a histogram.
Tj ða0 ; a1 Þ ðx;yÞ bj ða0 ; a1 ; ðx; yÞÞ
(7)
was used to normalise Fj(P). Feature selection was carried out to find which feature components, j, and corresponding intervals, [a0,a1], were most representative of salient features. The chosen triple (j,a0,a1) can be considered to represent a general visual search strategy adopted by human observers. A straightforward modal analysis was applied to Hj across all feature components and participants. Each component was scored according to how widely spread the modes were distributed over the population of participants. The three most prominent modal values in Hj were selected for the set of nine slides used in the eye-tracking experiments. The components with the least variance in modal value (i.e. lowest standard deviation when the location of extrema is considered as a random variable) across the sample of participants were selected as the candidate features. The mean of the modal values determined the interval [a0,a1]. These chosen feature response intervals were subsequently back-projected into the image space by using bj. This binary image was convolved with a Gaussian filter with a kernel width that is consistent with the foveal field. A threshold was then applied with low valued pixels set to zero. The resulting image is used as the saliency map. The threshold value was chosen to generate saliency maps that have a fixed percentage of non-zero pixels. For quantitative assessment, saliency maps with 10–70% non-zero pixels were applied to the weighted correlation measure used in the 2D/3D registration algorithm. 2.3.3. Weighted correlation To assess the efficacy of the proposed method, the similarity measure described in [16] was adapted by applying a weighting factor, w, to the estimated normalised cross-correlation of the video image, A, and rendered image generated from the 3D CT dataset, B X t Corrðw; A; BÞ Z w A B Ktm m (8) i i i A B sA sB ðt2 KsÞ where s2A Z s2B Z and mA Z
t t2 Ks
X
wi A2i Ktm2A
t X 2 2 B Ktm w i i B t2 Ks P
wi Ai ;mB Z t
P
X X 2 wi Bi ;t Z wi ; s Z wi t
(9) (10)
(11)
1004
A.J. Chung et al. / Image and Vision Computing 23 (2005) 999–1008
Fig. 2. The Gobar filter response is shown when applied to a video image of the phantom, using four different orientations: (a) 0, (b) 45, (c) 90, (d) 1358. The result of summing the four images is shown in (e).
Corr(w,A, B) reduces to the classical correlation for a constant weighting distribution. Note also that pixels with w iZ0 will be completely ignored during actual computation. The similarity measure is thus
Sðw; A; BÞ Z
MSEðA; BÞ 1 C Corrðw; A; BÞ
(12)
where MSE(A,B) is the mean square error over corresponding pixels of A and B.
3. Results 3.1. Feature selection Fig. 2 illustrates an example when the described Gabor filter is applied to a video frame with four different orientations (0, 45, 90, 1358). The resulting four images were summed to produce the orientationindependent Gabor edge response image, as shown in Fig. 2(e). This process was repeated for each of the four scales. The differences between scales were used to create the saliency map.
Fig. 3. This diagram illustrates the fixation points derived from eye-tracking data collected from four human observers when viewing an example video frame of the phantom lung model. The fixation points are represented by circles with radii proportional to fixation duration. Comparing the sets of fixation points indicates a high degree of variability in scan paths. There are few features upon which both observers (a) and (b) have fixations. Observer (c) spent significantly more time per fixation compared with the other observers, whereas the few fixations yielded from observer (d) are of relatively short duration.
A.J. Chung et al. / Image and Vision Computing 23 (2005) 999–1008
1005
Fig. 4. The saliency map derived from six participants and superimposed on an unseen image. (a) Saliency map with 20% non-zero pixels, (b) 30, (c) 40, (d) 50%.
Fig. 3 illustrates one representative image that was used in this experiment, where the associated fixation points by four different observers are highlighted with dark circles. The size of each circle is proportional to the dwell time of the fixation. Fig. 4 shows the resulting saliency map. It is evident that the saliency image only covers a small part of the image which can subsequently be used as the mask for 2D/3D registration. 3.2. Pose estimation The best solution pose, Si, for each video image in a sample of 600 images, i2I, was obtained via traditional normalised cross-correlation. These poses were validated with positional information from the EM tracker. Additionally, each solution was inspected visually and any clearly misregistered Si was discarded, resulting in a set of 505 solution poses. For each video image used, a set of saliency maps was generated in which the percentage of non-zero pixels ranged from 20 to 70%. From this set of solution poses, Si, 1284 starting poses were generated by adding scalar offsets randomly chosen from a uniform distribution to each of the pose parameters. The registration algorithm for pose estimation used the Powell minimisation method [32] and was initialised with each of the starting poses in turn. The Powell minimisation method locates the minima in a multidimensional parametric space by solving a series of one-dimensional minimisation problems. Although the cost function must be differentiable, the algorithm never
evaluates the partial derivatives analytically, so they need not be expressed as a closed-form equation. Each execution of the algorithm entailed the registration of the 3D CT phantom model data with the 2D video image corresponding to the solution pose from which the initial pose was generated. With each random initial pose, the registration algorithm was executed first using unweighted correlation. The pose parameters to which the program converged were then recorded. This procedure was repeated by using the weighted correlation similarity measure with each of the saliency maps. For each run, the identical initial pose was used as for the unweighted correlation. In total the saliency-enabled registration algorithm was executed 5954 times. To objectively determine how saliency map affected registration accuracy, the distance between the best pose and the pose to which the algorithm converged was measured. Plots of Euclidean distance and total absolute error in orientation (Euler angle discrepancy) yielded results that indicate not only improved accuracy over classic intensity correlation, but paradoxically the returned poses became more accurate when fewer pixels were considered to be salient (see Fig. 5). A closer examination of the optimisation landscape, however, revealed that the saliency map was introducing more local minima near to the starting pose. Reducing saliency coverage merely increased local minima density, so the algorithm was more likely to converge to a pose near to the initial pose which had been chosen to be near to the best solution.
1006
A.J. Chung et al. / Image and Vision Computing 23 (2005) 999–1008 90
90 no map 70% map
no map 50% map
80
angle error (deg)
angle error (deg)
80 70 60 50 40 30 20 10
70 60 50 40 30 20 10
0
0 0
10
20
30
40
50
60
70
80
90
0
10
20
distance error (mm) 90
40
50
60
70
80
90
80
90
90 no map 30% map
no map 20% map
80
angle error (deg)
80
angle error (deg)
30
distance error (mm)
70 60 50 40 30 20 10
70 60 50 40 30 20 10
0
0 0
10
20
30
40
50
60
70
80
90
0
10
20
distance error (mm)
30
40
50
60
70
distance error (mm)
Fig. 5. Scatter plots of the error in position and orientation for solution poses returned by the optimiser for different saliency map image coverage. The 70% map shows poses are distributed far from the expected solution. By considering fewer pixels, the algorithm is observed to converge to poses closer to the initial pose. In the 30 and 20% maps, there is an evidence of clustering around a point consistent with the distance of the mean initial pose from the expected solution.
To better judge the effect of saliency maps on the registration accuracy, the following criteria were selected to determine when a pose is considered to be correctly registered: † The pose must be within 7 mm of the best solution. (The passages of the scaled-up phantom range from 50 to 90 mm in diameter.) † The pose orientation angles may differ from the best solution by no more than 38. These criteria were qualitatively validated by visually inspecting a sample of poses output by the algorithm. Fig. 6 illustrates this process whereby a rendered image of the 3D CT data in the result pose is superimposed on a video image of the phantom model. The situation shown on the left is clearly misregistered, but the result on the right is classified
as being correctly registered. Although subjective, this classification scheme can be applied consistently to every pose in an unbiased manner. In this way, the registration success rate was estimated for saliency maps covering a varying percentage of the video image (Fig. 7). Saliency maps between 30 and 60% non-zero pixels lead to an almost two-fold improvement in the registration success rate over traditional intensity-based correlation.
4. Discussion We have demonstrated that saliency-based 2D/3D registration by utilising only 25% of the image content can achieve a comparable accuracy to traditional correlation-based 2D/3D registration. In this study, salient image
Fig. 6. Left: Virtual image of 3D dataset (edges highlighted) overlayed with a frame of video revealing misregistration. Right: 3D dataset and video frame properly registered.
A.J. Chung et al. / Image and Vision Computing 23 (2005) 999–1008 2.2
Factor improvement
2 1.8 1.6 1.4 1.2 1 0.8 0.6
0
10
20
30
40
50
60
70
% salient pixels in map
Fig. 7. This graph shows the varying improvement in convergence success rate according to saliency coverage.
features were extracted based on the analysis of human visual behaviour but without the use of domain-specific knowledge. Eye-tracking data collected from participants with a training data set were used to automatically determine salient features for all seen and unseen video frames. It is worth noting that the effectiveness of this method was dictated by the completeness of the chosen feature space library. Although features based on multi-scale contrast and Gabor filter response (Eq. (3)) have been shown to be important in biological vision systems, there are other parallel approaches for generating saliency maps that utilise information theory [33] and multiscale wavelets [34]. The use of selective weighting is similar to [36] and we have shown that the use of saliency map provides an effective means of providing reliable 2D/3D registration. The features an observer may fixate upon in one image depend on a number of factors including prior knowledge, expectation, and the nature of the question being asked. The use of visual search with eye tracking is a new idea for determining visual saliency. To demonstrate the feasibility of this approach, eye-tracking data acquired from a relatively small group of observers were used. It is known that individuals can employ different visual search strategies when confronted with an image; the extraction of common visual search behaviour can be improved by expanding the study population. It is also worth noting that the described experiment utilised only one feature component selected from a large set of candidates. Strategies for combining saliency maps [35] can further improve the quality of the resulting saliency maps when different search strategies are taken into account. In this paper, the method for assessing the success rate of the 2D/3D registration is based on subjective observations. Our results showed that contrast-based saliency maps can improve normalised cross-correlation as an image similarity measure in 2D/3D registration when no special lighting model has been adopted. Features selected based on Gabor
1007
filter response tended to be more immune to lighting conditions when used in judging image similarity. It must be noted, however, that the effectiveness of the normalised cross-correlation measure in 2D/3D registration depends greatly on how accurately one can model the illumination conditions in the rendered images. One particularly important aspect of the illumination model, for example, is the intensity attenuation with distance from the light source. A carefully tuned attenuation function can greatly improve the convergence of the optimisation process. In practice, however, the attenuation parameters may have to be adjusted specific to each situation. With careful manual tuning of the lighting model, it may be possible to further improve the registration accuracy. Another important issue to consider is that of soft tissue deformation. Whether saliency maps can yield greater immunity to global deformation of the 3D model require further investigation. The work presented in this paper is a first attempt using observer derived saliency map for 2D/3D image registration tasks. It represents a novel way of enhancing computer vision algorithms by implicitly modelling human visual search behaviour.
References [1] M.J. Mack, Minimally invasive and robotic surgery, The Journal of the American Medical Association 285 (5) (2001) 568–572. [2] P. Gieles, Image guided surgery: digital imaging as a support to patient treatment, Medica Mundi 40 (2). [3] P.E. Debevec, C.J. Taylor, J. Malik, Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach, in: Computer Graphics Proceedings, Annual Conference Series (Proceedings of the SIGGRAPH’96), 1996, pp. 11–20. [4] D. Wagner, R. Wegenkittl, E. Gro¨ller, Endoview: A phantom study of a tracked virtual bronchoscopy in: V. Skala (Ed.), Journal of WSCG vol. 10 (2002), pp. 493–498. URL: http://visinfo.zib.de/EVlib/ Show?EVL-2002-66. [5] D. Dey, D. Gobbi, P. Slomka, K. Surry, T. Peters, Automatic fusion of freehand endoscopic brain images to three-dimensional surfaces: creating stereoscopic panoramas, IEEE Transactions on Medical Imaging 21 (1) (2002) 23–30. [6] F. Tendick, D. Polly, D. Blezek, J. Burgess, C. Carignan, G. Higgins, C. Lathan, K. Reinig, Final Report of the Technical Requirements for Image-Guided Spine Procedures, Imaging Science and Information Systems (ISIS) Center, Department of Radiology, Georgetown University Medical Center, Georgetown University Medical Center, 2115 Wisconsin Avenue, NW, Suite 603, Washington, DC 20007, 1999, Ch. 3, Operative Planning and Surgical Simulators. [7] D. Dey, P.J. Slomka, D.G. Gobbi, T.M. Peters, Mixed reality merging of endoscopic images and 3-d surfaces. 796-803, in: Medical Image Computing and Computer-Assisted Intervention—MICCAI 2000, Third International Conference, Pittsburgh, Pennsylvania, USA, October 11–14, 2000, Proceedings, Lecture Notes in Computer Science, vol. 1935, Springer, Berlin, 2000, pp. 796–803. [8] A.F. Ayoub, J.P. Siebert, D. Wray, K.F. Moos, A vision-based three dimensional capture system for maxillofacial assessment and surgical planning, British Journal of Oral Maxillofacial Surgery 36 (1998) 353–357.
1008
A.J. Chung et al. / Image and Vision Computing 23 (2005) 999–1008
[9] L. Joskowicz, Fluoroscopy-based navigation in computer-aided orthopaedic surgery, in: Proceedings of the IFAC Conference on Mechatronic Systems, Darmstadt, Germany, 2000. [10] L. Joskowicz, C. Milgrom, A. Simkin, L. Tockus, Z. Yaniv, Fracas: a system for computer-aided image-guided long bone fracture surgery, Journal of Computer-Aided Surgery 3 (6). [11] J.B.A. Maintz, M.A. Viergever, A survey of medical image registration, Medical Image Analysis 2 (1) (1998) 1–36 URL: http:// www.cs.uu.nl/people/twan/personal/media97.pdf. [12] D.L.G. Hill, P.G. Batchelor, M. Holden, D.J. Hawkes, Medical image registration, Physics in Medicine and Biology 46 (2001) R1–R45. [13] G.P. Penney, J. Weese, J.A. Little, P. Desmedt, D.L.G. Hill, D.J. Hawkes, A comparison of similarity measures for use in 2D– 3D medical image registration, Lecture Notes in Computer Science 1496 (1998) 1153 URL: http://link.springer-ny.com/link/service/ series/0558/bibs/1496/14961153.htm; http://link.springer-ny.com/ link/service/series/0558/papers/1496/14961153.pdf. [14] M.J. Clarkson, D. Rueckert, D.L.G. Hill, D.J. Hawkes, Using photoconsistency to register 2d optical images of the human face to a 3d surface model, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (11) (2001) 1266–1280. [15] D.D. Frantz, A.D. Wiles, S.E. Leis, S.R. Kirsch, Accuracy assessment protocols for electromagnetic tracking systems, Physics in Medicine and Biology 48 (2003) 2241–2251. [16] K. Mori, Y. Suenaga, J. Toriwaki, J. Hasegawa, K. Kataa, H. Takabatake, H. Natori, A method for tracking camera motion of real endoscope by using virtual endoscopy system, in: Proceedings of SPIE, 3978, 2000–02, pp. 134–145. [17] K. Mori, D. Deguchi, J. Sugiyama, Y. Suenaga, J. Toriwaki Jr., H. Takabatake, H. Natori, Tracking of a bronchoscope using epipolar geometry analysis and intensity-based image registration of real and virtual endoscopic images, Medical Image Analysis 6 (2002) 321–336. [18] J.P. Helferty, W.E. Higgins, Technique for registering 3d virtual ct images to endoscopic video, in: IEEE International Conference on Image Processing, 2001. [19] P. Viola, W.M. Wells III, Alignment by maximization of mutual information, International Journal of Computer Vision. [20] A. Roche, G. Malandain, X. Pennec, N. Ayache, The correlation ratio as a new similarity measure for multimodal image registration, Lecture Notes in Computer Science 1496 (1998) 1115 URL: http:// link.springer-ny.com/link/service/series/0558/bibs/1496/14961115. htm; http://link.springer-ny.com/link/service/series/0558/papers/ 1496/14961115.pdf. [21] M.E. Leventon, W.E.L. Grimson, Multi-modal volume registration using joint intensity distributions, Lecture Notes in Computer Science 1496 (1998) 1057 URL: http://link.springer-ny.com/link/service/ series/0558/bibs/1496/14961057.htm; http://link.springer-ny.com/ link/service/series/0558/papers/1496/14961057.pdf. [22] T. Whitted, An improved illumination model for shaded display, in: Computer Graphics (Special SIGGRAPH’79 Issue), vol. 13, 1979, pp. 1–14.
[23] J. Tsotsos, S. Culhane, W. Wai, Y. Lai, N. Davis, F. Nuflo, Modelling visual attention via selective tuning, Artificial Intelligence 78 (1/2) (1995) 507–545. [24] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (11) (1998) 1254–1259 URL: citeseer.nj.nec. com/itti98model.html. [25] X.-P. Hu, L. Dempere-Marco, G.-Z. Yang, Hot spot detection based on feature space representation of visual search, IEEE Transactions on Medical Imaging, in press. [26] G.-Z. Yang, L. Dempere-Marco, X.-P. Hu, A. Rowe, Visual search: psychophysical models and practical applications, Image and Vision Computing 20 (4) (2002) 291–305. [27] A.A. Faisal, M. Fislage, M. Pomplun, R. Rae, H. Ritter, Observation of human eye movements to simulate visual exploration of complex scenes, Technical Report, University of Bielfeld, 1998 [28] R.Y. Tsai, A versatile camera calibration technique for high accuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses, IEEE Journal of Robotics and Automation RA-3 (4) (1987) 323–344. [29] Z. Zhang, A flexible new technique for camera calibration, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (11) (2000) 1330–1334. [30] W. Einha¨user, P. Ko¨nig, Does luminance-contrast contribute to a saliency map for overt visual attention?, European Journal of Neuroscience 17 (5) (2003) 1089–1097. [31] N. Petkov, P. Kruizinga, Computational models of visual neurons specialised in the detection of periodic and aperiodic oriented visual stimuli: bar and grating cells, Biological Cybernetics 76 (2) (1997) 83–96. [32] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in C, second ed., Cambridge University Press, Cambridge, MA, 1992. [33] M. Ja¨gersand, Saliency Maps and Attention Selection in Scale and Spatial Coordinates: An Information Theoretic Approach 1993 URL: citeseer.nj.nec.com/12842.html. [34] A. Shokoufandeh, I. Marsic, S.J. Dickinson, View-based object recognition using saliency maps, Image and Vision Computing 17 (5/6) (1999) 445–460. [35] L. Itti, C. Koch, A comparison of feature combination strategies for saliency-based visual attention systems, in: SPIE Human Vision and Electronic Imaging IV (HVEI 1999), San Jose, CA, vol. 3644, 1999, pp. 373–382. [36] D. Deguchi, K. Mori, Y. Suenaga, J. ichi Hasegawa, J. ichiro Toriwaki, H. Natori, H. Takabatake, New calculation method of image similarity for endoscope tracking based on image registration in endoscope navigaton in: H. Lemke, M. Vannier, K. Inamura, A. Ferman, K. Doi, J. Reiber (Eds.), Computer Assisted Radiology and Surgery (CARS) 2003 International Congress Series 1256, (London, June 25–28, 2003), Elsevier, Amsterdam, 2003, pp. 460–466.