Visual Feature Extraction via Eye Tracking for ... - Semantic Scholar

Visual Feature Extraction via Eye Tracking for Saliency Driven 2D/3D Registration Adrian J. Chung ∗

Fani Deligianni †

Xiao-Peng Hu ‡

Guang-Zhong Yang §

Royal Society/Wolfson Foundation Medical Image Computing Laboratory, Department of Computing, Imperial College London

Abstract

The intermediate and higher level visual processes of the primate vision system appear to select a subset of the available sensory information before further detailed processing is applied [Tsotsos et al. 1995]. This selection takes the form of a focus of attention that scans the scene in a saliency-driven task-oriented manner [Itti et al. 1998]. Although this focus of attention can be consciously directed, it is also attracted in an unconscious manner to conspicuous, or salient, visual locations. Traditionally, the detection of visual saliency for recognition has been approached through predefined image features determined by domain specific knowledge. This process is hampered by the fact that visual features are difficult to describe and concrete representations for near-subliminal information are lacking. Despite this, it has been demonstrated that eye tracking can be used to extract intrinsic visual features used by observers without the need of explicit feature definition [Hu et al. 2003]. During a visual search task, the eyes will move across the scene for angular distances of between 2◦ and 30◦ . Searching for a target over a larger area will also involve head movements. In a single day the number and variety of visual search tasks performed is so large that the activity tends to be a reactive rather than a deliberative process. Eye movements associated with visual search can be detected with high accuracy by using the relative position of the pupil and corneal reflection from an infra-red light source [Yang et al. 2002; Faisal et al. 1998]. During a visual search, a saccade moves the gaze of the eye to the current area of interest. For the brain to register the underlying visual information, the eye must dwell on this area of interest for longer than 100ms. This point is called a fixation. The spatio-temporal characteristics of human visual search, together with the intrinsic visual features of the fixation points, can provide important clues to the salient features that visual comparisons are based upon. In this paper, the underlying search strategy employed during a visual comparison task is explored by using eye tracking, and the results are applied to the specific problem of 2D/3D registration.

This paper presents a new technique for extracting visual saliency from experimental eye tracking data. An eye-tracking system is employed to determine which features that a group of human observers considered to be salient when viewing a set of video images. With this information, a biologically inspired saliency map is derived by transforming each observed video image into a feature space representation. By using a feature normalisation process based on the relative abundance of visual features within the background image and those dwelled on eye tracking scan paths, features related to visual attention are determined. These features are then back projected to the image domain to determine spatial areas of interest for unseen video images. The strengths and weaknesses of the method are demonstrated with feature correspondence for 2D to 3D image registration of endoscopy videos with computed tomography data. The biologically derived saliency map is employed to provide an image similarity measure that forms the heart of the 2D/3D registration method. It is shown that by only processing selective regions of interest as determined by the saliency map, rendering overhead can be greatly reduced. Significant improvements in pose estimation efficiency can be achieved without apparent reduction in registration accuracy when compared to that of using a non-saliency based similarity measure. CR Categories: I.2.10 [Vision and Scene Understanding]: Representations, data structures and transforms—; I.4.7 [Feature Measurement]: Projections; I.5.2 [Design Methodology]: Feature evaluation and selection; J.3.2 [Medical information systems] Keywords: 2D/3D registration, saliency, image correlation, eye tracking

1

Introduction

1.1

Saliency in Visual Search

It has long been recognised that the human visual system does not apply processing power to visual information in a uniform manner.

1.2

2D/3D Registration in Medical Imaging

Modern medical diagnosis and surgical planning often entails combining spatial information from a wide variety of sources, such as computed tomography, magnetic resonance imaging, flouroscopy, endoscopy, and ultrasound. The fusing of these multimodal datasets into the same spatial representation or coordinate space involves a process known as registration. In particular, the registration of 2D data sources (e.g. flouroscopy, endoscopy video) with 3D datasets (e.g. CT, MRI) has beed a greatly researched topic, and many applications arisen are from medical imaging and surgical planning applications [Wagner et al. 2002; Dey et al. 2002; Tendick et al. 1999; Dey et al. 2000; Ayoub et al. 1998; Joskowicz 2000]. In general, methods of medical image registration can be classified into landmark, segmentation, and voxel based techniques [Maintz and Viergever 1998; Hill et al. 2001]. The voxel based approach relies on statistical measures applied directly to the image pixels captured by the 2D video. Normalised cross correla-

∗ e-mail:

[email protected] † e-mail: [email protected] ‡ e-mail: [email protected] § e-mail: [email protected]

Copyright © 2004 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected]. © 2004 ACM 1-58113-825-3/04/0003 $5.00

49

(a)

(b)

Figure 1: Images are presented to human observers in groups of four and their eye movements are tracked. The resulting scan paths and fixations have been superimposed on the original images. (a) One rendered image and three video images with the corresponding eye track for one human observer, and (b) one video image with three rendered images overlayed with eye track for a different human observer. of the extrinsic parameters, (v, θ ), to produce a 2D image, I . A specially designed distance measure, d, compares I with I and the values of (v, θ ) are then uniquely determined by minimising d(I, I ) over the pose parameter domain. In this paper, a weighted crosscorrelation measure on image intensities has been used for d.

tion [K.Mori et al. 2000-02; Mori et al. 2002] and mutual information [Helferty and Higgins 2001; Viola and Wells III 1997] based similarity measures have commonly been adopted in this situation. With this approach, the 3D dataset is projected to form a 2D image, which in effect, reduces the problem to a 2D-2D registration problem. Similarity measures such as correlation ratio [Roche et al. 1998], intensity gradients, texture similarity measures [Penney et al. 1998], and joint intensity distributions [Leventon and Grimson 1998] have been adopted previously. Registration of the 3D dataset with a given 2D video image will entail, in practice, the evaluation of the similarity measure over a large number of pose parameters. For each unique set of pose parameters, a new image needs to be rendered using the 3D dataset and then compared with the video image. This is computationally expensive and usually represents the bottle neck in the entire registration process. In order to reduce the rendering overhead, this paper examines the use of visual saliency for selective rendering so that registration efficiency can be maximised. With this approach, criteria determining the portion of the image to be rendered are essential. To this end, a saliency map derived from a model of the human visual system is employed. Since the similarity measure is not applied to the entire image, pixel oriented rendering methods, such as ray tracing [Whitted 1979], can be effectively exploited. For large datasets, the proposed scheme can outperform hardware z-buffer based methods and also allows for more sophisticated illumination models.

1.3

2 2.1

Method Experiment Setup

Eye movements were tracked in real-time by using the ASL-504 remote eye tracking system (accuracy 0.5◦ , visual range 50◦ (H), and 40◦ (V), sampling rate 50Hz, Applied Science Laboratories, MA). Images were displayed sequentially on a 24-inch computer monitor (1280×1024 pixels) producing an image 39.4cm by 27.6cm at a distance 65.0 cm in front of the eye. A miniature NTSC camera fitted with a point light source was used to film a phantom lung model, specially constructed using silicone-rubber and textured with acrylic. A Polhemus ElectroMagnetic (EM) motion tracker (FASTRAK) tracked the location of the camera within the phantom, which had been scanned with a Siemens Somaton Volume Zoom 4-Channel Multidetector CT with a slice thickness of 3 mm and in-plane resolution of 1 mm. A polygonal mesh was reconstructed from the resulting 3D datset and used for rendering virtual endoscopic views. The inner passages of the phantom range from 12 cm to 5 cm in diameter. Video frames from the phantom lung model were selected and combined with images rendered from the 3D CT dataset to create nine slides with a 2 × 2 format. Six volunteers were given identical visual search tasks. Each was asked to choose the pair of images on each slide that most closely match in pose (that is, share the same view point and viewing orientation) while their eye movements were tracked. Two example slides are shown in Figure 1 overlayed with eye tracks obtained from two different observers.

The 2D/3D Registration Problem

2D/3D registration is essentially a parameter estimation problem. The extrinsic parameters of a camera, C, define the position given by three Euclidean coordinates, (vx , vy , vz ), and orientation defined by three Euler angles, (θ0 , θ1 , θ2 ). These correspond to the 6 degrees of freedom of rigid body motion. An object, A, is transformed (via perspective projection) into a 2D image, I, via the camera, C. The problem of 2D/3D registration is to determine the extrinsic parameters (v, θ ) uniquely with image, I, and the position, orientation and 3D geometry of A. 2D/3D registration is usually solved through optimisation. The object, A, is perspectively projected by using an initial estimate

2.2

Feature Space Representation

For each video frame and image rendered from 3D CT data, a feature space representation was derived by using edge strength contrasts between different scales. Contrast, in both intensity and edge

50

(a)

(b)

(c)

100 ms 200 ms 300 ms 400 ms 600 ms 800 ms

Figure 2: This diagram illustrates the fixation points derived from eye tracking data collected from three human observers when viewing an example video frame of the phantom lung model. The fixation points are represented by circles with radii proportional to fixation duration. Comparing the sets of fixation points indicates a high degree of variability in scan paths. There are few features in common upon which both observers (a) and (b) fixate their gaze. Observer (c) spent significantly more time per fixation compared with the other observers, whereas the few fixations yielded from observer (b) are mostly of short duration.

3. The Gabor responses were totalled over the four orientations, for each scale level, to derive an orientation independent measure of edge strength.

strength, has been a significant determining factor in the saliency map for a search task [Einhäuser and König 2003]. To represent the feature space, the method of edge contrast across different scales [Itti et al. 1998] was adopted. A biologically inspired filter bank is used to model the behaviour of the orientation-selective cells found in the primary visual cortex of primates [Petkov and Kruizinga 1997], primarily the bar cells whose optimal stimuli are bar shaped patterns. Laboratory experiments reveal that these bar cells act as local spatial frequency analysers and local edge and bar detectors. This behaviour can be approximated by 2D Gabor filters. The Gabor filter was implemented as a product of a 2 dimensional Gaussian with an oriented cosine wave: Gξ ,η ,θ ,ψ (x, y)

=

µ

=

= =

x y

2π x + ψ) λ x2 + γ 2 y2 − 2σ 2 (x − ξ ) cos θ − (y − η ) sin θ (x − ξ ) sin θ − (y − η ) cos θ

eµ cos(

fσ (P) =

(1)

( f0 f1 , f1 f2 , f2 f3 , f0 f2 , f1 f3 , f0 f3 ) The operator evaluates the difference in edge strengths for corresponding points at different scale levels in the Gaussian pyramid. 5. The above steps were repeated for different parameter settings for the Gabor filter. The bandwidth, β , was varied between 0.6 and 2.0, the wavelength, λ , was varied from 4.0 to 10.0, and aspect ratio, γ , varied between 0.6 and 1.5. Each image was 320×240 pixels. This resulted in a feature space composed of 288 components.

2.3

2

∑

m=−2 n=−2

w(m, n)Ik (2i + m, 2 j + n)

Feature Space Projection

The method to project each fixation point to the multi-dimensional feature space, F, has previously been used to extract saliency from radiography images [Hu et al. 2003]. The dwell time for a particular feature, ϑ ⊂ Fϕ , is defined as: Tϕ (ϑ ) =

1. By down sampling the image repeatedly by a factor of two, a four level Gaussian pyramid was created. Image Ik−1 is derived from image Ik by convolving with a Gaussian and then discarding pixels with odd-valued coordinates: 2

(3)

4. The “center-surround” operation [Itti et al. 1998] was applied to obtain a 6 component feature vector, F:

Steps for deriving the feature space can be summarised as follows:

∑

(GP,θ ∗ Iσ )

where GP,θ , the Gabor filter response centred on point, P, with orientation, θ , is convolved with the image I at scale σ in the Gaussian pyramid.

G gives the filter values for a Gabor filter centred on a point (ξ , η ). The phase offset of the harmonic factor (i.e. the cosine term), ψ , often takes one of two values: zero for symmetric filters, π2 for antisymmetric filters. The value θ controls the orientation of the line of symmetry. λ is the wavelength and γ is the aspect ratio. The value σ is functionally dependent on λ and a bandwidth parameter, β : λ (2β + 1) ln 2 σ= 2 π (2β − 1)

Ik−1 (i, j) = 4

∑

θ ∈{0◦ ,45◦ ,90◦ ,135◦ }

∑

xi ,yi ∈Ω

∑

(x,y)∈O(xi ,yi )

ti gi (x, y)Ψϕ (ϑ , (x, y))

(4)

where Ω is the set of fixation points, O(xi , yi ), is the foveal field of fixation about fixation point (xi , yi ), ti the fixation dwell time, gi (x, y) the Gaussian function centred at (xi , yi ), and 1, if Fϕ ((x, y)) ∈ ϑ (5) Ψϕ (ϑ , (x, y)) = 0, otherwise

(2)

where w is the 2D Gaussian kernel.

In practice, the range spanned by each feature component, Fϕ , is divided into a finite number of bins and a histogram is used to represent Tϕ (ϑ ).

2. Gabor filtering at four orientations (0◦ , 45◦ , 90◦ , 135◦ ) was applied at each of the four scales.

51

(a)

(b)

Figure 3: The saliency map derived from six volunteers and superimposed on a previously unseen image. (a) Saliency map with 20%, and (b) 50% non-zero pixels. The projection bias of the a priori probability distribution Fϕ (P), is eliminated through normalisation using the background bias evaluated over the entire image Hϕ (ϑ ) = Tϕ (ϑ )/

∑

(x,y)

and

µA =

Ψϕ (ϑ , (x, y))

S(w, A, B) =

3 3.1

where

σA2 =

σB2 =

t

t2 − s t t2 − s

∑ wi A2i − t µA2

∑ wi B2i − t µB2

(10)

Results Feature selection

One representative image that was used in this experiment is illustrated in Figure 2, where the associated fixation points by three different observers are highlighted with dark circles. The size of each circle is proportional to the dwell time of the fixation. Figure 3 shows the resulting saliency map. It is evident that the saliency image only covers a small part of the image which can subsequently be used as the mask for 2D/3D registration.

3.2

Pose estimation

For each video image in a sample of 600 images, i ∈ I, the best solution pose, Si , was obtained via traditional normalised cross correlation, and validated against positional information from the EM tracker. After visual inspection, any clearly misregistered Si was discarded, which resulted in a set of 505 solution poses. A set of saliency maps was generated for each video image used, by varying the percentage of non-zero pixels between 20% and 70%. Scalar offsets, randomly chosen from a uniform distribution, were then added to the pose parameters selected at random from the set, S, to produce 1284 starting poses. The Powell minimisation method [Press et al. 1992] was employed in the registration algorithm for pose estimation, and was initialised with each of the starting poses in turn. Each execution of the algorithm involved the registration of the 3D CT phantom model data with the 2D video image corresponding to the solution pose from which the initial pose was generated.

To assess the efficacy of the proposed method, an intesity-based similarity measure, shown to be effective in bronchoscopy video registration [K.Mori et al. 2000-02], was adapted by applying a weighting factor, w, to the estimated normalised cross correlation of the video image, A, and rendered image generated from the 3D CT dataset, B.

t w i A i B i − t µ A µB σA σB (t 2 − s) ∑

MSE(A, B) 1 +Corr(w, A, B)

where MSE(A, B) is the mean square error over corresponding pixels of A and B.

Weighted Correlation

Corr(w, A, B) =

(9)

Corr(w, A, B) reduces to the classical correlation for a constant weighting distribution. Note also that pixels with wi = 0 will be completely ignored during actual computation. The similarity measure is thus:

Feature selection was carried out to find which feature components, ϕ , and corresponding values, ϑ , were most representative of salient features. The chosen pair (ϕ , ϑ ) can be considered to represent a general visual search strategy adopted by human observers. Each component was scored according to how widely spread the modes were distributed over the population of volunteers. The three most prominent modal values in Hϕ were selected for the set of nine slides used in the eye-tracking experiments. The components with the least variance in modal value (i.e. lowest standard deviation when we treat the location of extrema as a random variable) across the sample of volunteers were selected as the candidate features. The mean of the modal values determined the value of ϑ . These chosen values of ϑ were subsequently back-projected into the image space by using Ψϕ . This binary image was convolved with a Gaussian filter with a kernel width that is consistent with the foveal field. A threshold was then applied with low valued pixels set to zero. The resulting image was used as the saliency map. The threshold value was chosen to generate saliency maps that have a fixed percentage of non-zero pixels. For quantitative assessment, saliency maps with 10% to 70% non-zero pixels were applied to the weighted correlation measure used in the 2D/3D registration algorithm.

2.4

∑ w i Bi ∑ w i Ai , t = ∑ wi , s = ∑ w2i , µB = t t

(6)

(7)

(8)

52

4

The registration algorithm was initialised with one of the randomly perturbed poses and executed using unweighted correlation. The pose parameters to which it converged were recorded. The program was then re-executed using the weighted correlation similarity measure with each of the saliency maps. For each run, the identical initialisation pose was used as for the unweighted correlation. The saliency-enabled registration algorithm was executed a total of 5954 times.

% salient pixels

10 20 30 40 50 60 70

Discussion

We have shown that the use of selective weighing based on saliency offers a viable method of 2D/3D registration. Saliency-based 2D/3D registration utilising as little as one quarter of the image content has been shown to achieve a comparable accuracy to traditional correlation-based 2D/3D registration. Utilising eye tracking during visual search tasks is a novel idea for determining visual saliency. The feasibility of this approach has been demonstrated by acquiring eye-tracking data from a group of volunteers, which was used as the training data set in order to automatically determine salient features in video frames, most of which have never been seen. Salient image features were extracted based on an analysis of human visual behaviour but without the use of domain-specific knowledge. There are a number of factors that affect the quality of the derived saliency maps, such as prior knowledge, expectation, the nature of question being asked, number of volunteers, feature selection and extraction, and the process of combining different human visual search strategies. Furthermore, the completeness of the chosen feature space library significantly influences the effectiveness of this method. It is worth noting that in deriving the saliency map, only one feature component selected from a large set of candidates was utilised. Strategies for combining saliency maps [Itti and Koch 1999] can further enhance the quality of the system, improve immunity to noise, and increase robustness to spurious eye movements. It is also worth noting that the method for assessing the success rate of the 2D/3D registration, although consistently applied, was based on subjective observations. It is known that individuals can employ different visual search strategies for the same image, so the extraction of common visual search behaviour is better realised by expanding the study population. This should improve the robustness of the technique in the face of spurious eye movements, and also enable the exploration of how differences in experience level affect the quality of the saliency map. Variations of the core technique for deriving saliency form another avenue for future investigation. Although features based on multi-scale contrast and Gabor filter response (Eqn.3) have been shown to be important in biological vision systems, there exist alternative approaches for generating saliency maps which utilise information theory [Jägersand 1993] and multiscale wavelets [Shokoufandeh et al. 1999]. The experiments confirm that, in the case of no special lighting model being adopted, contrast based saliency maps can improve normalised cross-correlation as an image similarity measure in 2D/3D registration. This showed that comparing images, using features selected according to Gabor filter response, tended to be more immune to lighting conditions. It must be noted, however, the effectiveness of the normalised cross correlation measure in 2D/3D registration depends greatly on how accurately one can model the illumination conditions in the rendered images. One method by which the convergence of the optimisation process can be significantly improved is to employ a carefully tuned function that attenuates reflected light intensity according to the distance from the light source. In practice, however, the attenuation parameters may have to be adjusted specific to each situation. In summary, a first attempt of 2D/3D image registration using observer derived saliency maps has been presented. It has been demonstrated that implicit modelling of human visual search behaviour can enhance computer vision algorithms and improve their computational performance.

Relative improvement in registration success rate 0.86 1.49 1.82 1.94 1.67 1.94 1.69

Table 1: Percentage improvement in rates of successful convergence for saliency based 2D/3D registration with a simple illumination model. Improvement is measured relative to unweighted 2D/3D registration under otherwise identical conditions. Performance degrades at 10% saliency but improves when more pixels are considered salient. To investigate how saliency affected registration accuracy, the distance between the best pose and the pose to which the algorithm converged was measured. A preliminary examination of the Euclidean distance and total absolute error in orientation (Euler angle error) at first seemed to indicate improved accuracy over classic intensity correlation. Paradoxically, the returned poses improved in accuracy as fewer and fewer pixels were considered salient. On closer examination it was found that the use of saliency had introduced more local minima in the optimisation landscape. As a result the algorithm was more likely to converge to a pose near to where it had first started, and these initial poses had been chosen quite close to the best solution. This bias can be avoided by choosing the initial poses from a fixed distribution, independent of the best pose. This, however, caused almost all of the trial runs to fail to converge correctly. Assessing the effect of the saliency map thus became infeasible. Furthermore, in video tracking applications the initial pose is expected to be fairly close to the best solution. For a more meaningful investigation on the effect of saliency on registration accuracy, the following criteria were selected to determine when a pose was considered to be correctly registered: • The pose must be within 7mm of the best solution. (The passages of the scaled-up phantom range from 50mm to 90mm in diameter.) • The pose orientation angles may differ from the best solution by no more than 3◦ . These criteria were qualitatively validated by visual inspection of a sample of poses output by the algorithm. Figure 4 illustrates this process with one selected video frame (a) and two poses returned by the algorithm which are used to render images (b) and (c). The first pose is clearly misregistered but the second is properly registered. The experimental software actually superimposes the rendered and video images in the same window to make this distinction easier. Although subjective, this classification scheme can be applied consistently to every pose in an unbiased manner. In this way the registration success rate was estimated for saliency maps covering a varying percentage of the video image (Table 1). Saliency maps with between 30% to 60% non-zero pixels lead to an almost twofold improvement in the registration success rate over traditional intensity-based correlation.

References AYOUB , A. F., S IEBERT, J. P., W RAY, D., AND M OOS , K. F. 1998. A vision-based three dimensional capture system for maxillofacial assess-

53

(a)

(b)

(c)

Figure 4: (a) A video image of the phantom lung model captured with a small CCD camera. (b) An image of the phantom rendered from the CT scan dataset. It as clearly seen that the CT data is not properly registered with the video image. (c) The phantom CT data is rendered after successful 2D/3D registration with the video image.

ment and surgical planning. Brit. J. Oral Maxillofacial Surg. 36, 353– 357.

M AINTZ , J. B. A., AND V IERGEVER , M. A. 1998. A survey of medical image registration. Medical Image Analysis 2, 1, 1–36.

D EY, D., S LOMKA , P. J., G OBBI , D. G., AND P ETERS , T. M. 2000. Mixed reality merging of endoscopic images and 3-d surfaces. 796803. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2000, Third International Conference, Pittsburgh, Pennsylvania, USA, October 11-14, 2000, Proceedings, Springer, vol. 1935 of Lecture Notes in Computer Science, 796–803.

M ORI , K., D EGUCHI , D., S UGIYAMA , J., S UENAGA , Y., T ORIWAKI , J., J R ., C. R. M., TAKABATAKE , H., AND NATORI , H. 2002. Tracking of a bronchoscope using epipolar geometry analysis and intensity-based image registration of real and virtual endoscopic images. Medical Image Analysis 6, 321–336. P ENNEY, G. P., W EESE , J., L ITTLE , J. A., D ESMEDT, P., H ILL , D. L. G., AND H AWKES , D. J. 1998. A comparison of similarity measures for use in 2D–3D medical image registration. Lecture Notes in Computer Science 1496, 1153–??

D EY, D., G OBBI , D., S LOMKA , P., S URRY, K., AND P ETERS , T. 2002. Automatic fusion of freehand endoscopic brain images to threedimensional surfaces: Creating stereoscopic panoramas. IEEE Transactions on Medical Imaging 21, 1, 23–30. ¨ ¨ , W., AND K ONIG , P. 2003. Does luminance-contrast conE INH AUSER tribute to a saliency map for overt visual attention? European Journal of Neuroscience 17, 5, 1089–1097.

P ETKOV, N., AND K RUIZINGA , P. 1997. Computational models of visual neurons specialised in the detection of periodic and aperiodic oriented visual stimuli: bar and grating cells. Biological Cybernetics 76, 2, 83– 96.

FAISAL , A. A., F ISLAGE , M., P OMPLUN , M., R AE , R., AND R ITTER , H. 1998. Observation of human eye movements to simulate visual exploration of complex scenes. Tech. rep., University of Bielfeld.

P RESS , W. H., T EUKOLSKY, S. A., V ETTERLING , W. T., AND F LAN NERY, B. P. 1992. Numerical Recipes in C, 2nd. edition. Cambridge University Press.

H ELFERTY, J. P., AND H IGGINS , W. E. 2001. Technique for registering 3d virtual ct images to endoscopic video. In IEEE International Conference on Image Processing, ??–??

ROCHE , A., M ALANDAIN , G., P ENNEC , X., AND AYACHE , N. 1998. The correlation ratio as a new similarity measure for multimodal image registration. Lecture Notes in Computer Science 1496, 1115–??

H ILL , D. L. G., BATCHELOR , P. G., H OLDEN , M., AND H AWKES , D. J. 2001. Medical image registration. Physics in Medicine and Biology 46, R1–R45.

S HOKOUFANDEH , A., M ARSIC , I., AND D ICKINSON , S. J. 1999. Viewbased object recognition using saliency maps. Image and Vision Computing 17, 5-6 (April), 445–460.

H U , X.-P., D EMPERE -M ARCO , L., AND YANG , G.-Z. 2003. Hot spot detection based on feature space representation of visual search. IEEE Transactions on Medical Imaging to appear. I TTI , L., AND KOCH , C. 1999. A comparison of feature combination strategies for saliency-based visual attention systems. In SPIE Human Vision and Electronic Imaging IV (HVEI 1999), San Jose, CA, vol. 3644, 373–382.

T ENDICK , F., P OLLY, D., B LEZEK , D., B URGESS , J., C ARIGNAN , C., H IGGINS , G., L ATHAN , C., AND R EINIG , K. 1999. Final Report of the Technical Requirements for Image-Guided Spine Procedures. Imaging Science and Information Systems (ISIS) Center, Department of Radiology, Georgetown University Medical Center, Georgetown University Medical Center, 2115 Wisconsin Avenue, N.W., Suite 603, Washington, DC 20007, ch. 3. Operative Planning and Surgical Simulators.

I TTI , L., KOCH , C., AND N IEBUR , E. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 11, 1254–1259.

T SOTSOS , J., C ULHANE , S., WAI , W., L AI , Y., DAVIS , N., AND N U FLO , F. 1995. Modelling visual attention via selective tuning. Artificial Intelligence 78, 1–2 (Oct.), 507–545.

¨ J AGERSAND , M., 1993. Saliency maps and attention selection in scale and spatial coordinates: An information theoretic approach.

V IOLA , P., AND W ELLS III, W. M. 1997. Alignment by maximization of mutual information. International Journal of Computer Vision.

J OSKOWICZ , L. 2000. Fluoroscopy-based navigation in computer-aided orthopaedic surgery. In Proc. of the IFAC Conference on Mechatronic Systems.

¨ WAGNER , D., W EGENKITTL , R., AND G R OLLER , E. 2002. Endoview: A phantom study of a tracked virtual bronchoscopy. In Journal of WSCG, V. Skala, Ed., vol. 10.

K.M ORI , Y.S UENAGA , J.T ORIWAKI , J.H ASEGAWA , K.K ATAA , H.TAKABATAKE , AND H.NATORI. 2000-02. A method for tracking camera motion of real endoscope by using virtual endoscopy system. In Proc. of SPIE, 3978, 134–145.

W HITTED , T. 1979. An improved illumination model for shaded display. In Computer Graphics (Special SIGGRAPH ’79 Issue), vol. 13, 1–14. YANG , G.-Z., D EMPERE -M ARCO , L., H U , X.-P., AND ROWE , A. 2002. Visual search: psychophysical models and practical applications. Image and Vision Computing 20, 4, 291–305.

L EVENTON , M. E., AND G RIMSON , W. E. L. 1998. Multi-modal volume registration using joint intensity distributions. Lecture Notes in Computer Science 1496, 1057–??

54