Improved Symmetric-SIFT for Multi-modal Image Registration

2011 International Conference on Digital Image Computing: Techniques and Applications

Improved Symmetric-SIFT for Multi-modal Image Registration Md. Tanvir Hossain, Guohua Lv, Shyh Wei Teng, Guojun Lu

Martin Lackmann

Gippsland School of Information Technology Dept. of Biochemistry & Molecular Biology Monash University Monash University Churchill, Victoria 3842, Australia Clayton, Victoria 3800, Australia Email: {Tanvir.Hossain, Guohua.Lv, Shyh.Wei.Teng, Guojun.Lu}@monash.edu Email: [email protected]

Abstract—Multi-modal image registration has received significant research attention over the past decade. SymmetricSIFT is a recently proposed local description technique that can be used for registering multi-modal images. It is based on a well-known general image registration technique named Scale Invariant Feature Transform (SIFT). Symmetric-SIFT, however, achieves this invariance to multi-modality at the cost of losing important information. In this paper, we show how this loss may adversely affect the accuracy of registration results. We then propose an improvement to Symmetric-SIFT to overcome the problem. Our experimental results show that the proposed technique can improve the number of true matches by up to 10 times and overall matching accuracy by up to 30%.

(a)

(b)

Fig. 1. Stained microscopic images having very low overlap between them. The common portions of the images are marked with dotted lines.

I. I NTRODUCTION Image registration is a process of aligning two images which have most likely been acquired in different imaging conditions. A good image registration technique should be able to correctly identify the corresponding regions and determine the appropriate geometric transformation required to map the sensed image on the reference image, despite the presence of varying imaging conditions. A. Image Registration and Image Description In order to register two images it is necessary to measure the amount of dissimilarity, misalignment or lack of correspondence between the two input images. The lower the dissimilarity, the better is the alignment. To make this comparison possible, different registration techniques employ different sorts of descriptors. Descriptors may be broadly categorized into two types: global and local. Global descriptors describe the entire image as a whole and, therefore, there would usually be only one such descriptor per image. Local descriptors, on the other hand, represent prominent and stable parts of an image. Thus a single image may have more than one local descriptors each describing one of the stable parts (or key regions) in the image. B. Multi-modal Image Registration Two or more images are called multi-modal if each of them is captured by a different sensor (imaging device or modality). Images capturing the same object using different sensing devices might represent it by different combinations of colors or intensities. Of course, this may happen besides having

978-0-7695-4588-2/11 $26.00 © 2011 IEEE DOI 10.1109/DICTA.2011.40

other variations in imaging conditions (such as difference in time, scale, viewpoint or noise). Therefore, multi-modal image registration is far more complicated and challenging as compared to basic registration problems. There are a good number of global techniques [1], [2], [3], [4], [5] found in the literature that can be used for multi-modal registration. However, we are interested in local description techniques as they are more invariant to affine transformations, less affected by the presence of outliers, occlusion, clutter, truncation and low overlap of images. Global techniques also most commonly have to deal with complex optimization problems. Fig. 1 shows an example image pair where solving the registration problem with a global technique may be very difficult due to low overlap between the images. Despite the strength of using local description techniques, very few [6], [7], [8] have been studied for multi-modal registration so far. In this paper we concentrate on a recently proposed local description technique named Symmetric-SIFT [6], which is in essence a variation of a widely used local description technique - Scale Invariant Feature Transform (SIFT) [9]. Unlike SIFT, Symmetric-SIFT has been adapted to be used for multi-modal registration. Multi-modal image registration has its applications in a wide range of domains including remote sensing, robot navigation, areal imagery analysis and medical imaging. CT ↔ MR, CT ↔ MRI, CT ↔ PET, CT ↔ SPECT, PET ↔ MR and PET ↔ US are just a few examples of multi-modal medical image registration problems [10]. The capability of registering multi-

195 197

modal images efficiently and correctly is important to the medical field as it aids in better and faster diagnosis of diseases and can also help in the optimal planning of complex radio therapeutical and neurosurgical procedures. C. Contribution of the Paper The problem that we try to overcome in this work is caused by the process of descriptor merging in Symmetric-SIFT. The merging process will be described in Section IV. We show that though Symmetric-SIFT is invariant to modality changes, the invariance is achieved at the cost of loss of information. We will also show how this loss might greatly reduce the overall matching accuracy. We then propose an improved technique that not only raises key-point matching accuracy but also significantly increases the number of true matches. The rest of the paper is organized as follows. Section II describes the fundamental operations involved in the original SIFT. Section III describes some interesting properties of multi-modal images that influence the design of SymmetricSIFT. Section IV then outlines the Symmetric-SIFT technique and identifies certain issues with it. Section V describes our proposed solution to the identified problems and Section VI presents our experimental results. Finally Section VII concludes this paper. II. SIFT: A K EY-P OINT- BASED I MAGE D ESCRIPTOR Both SIFT and Symmetric-SIFT are basically key-pointbased image description techniques. Most key-point-based image registration techniques can be decomposed into the following five phases: key-point detection, key-point description, similarity matching, transformation estimation and image transformation. Key-points are those points in an image that are believed to survive in a wide range of geometric and photometric transformations. The purpose of the detection phase is to identify such stable points from a given image. In the description phase, each identified key-point is described numerically for them to be used for comparison in the similarity matching phase. The next phase is to compute similarity of the descriptors from the two images. The set of matching descriptors indicates the corresponding parts between the images. This information is used for deriving a transformation function that maps one image onto the other and that becomes the final phase of registration. The accuracy of registration, therefore, depends on the accuracy of the matching set. As both SIFT and Symmetric-SIFT are key-point-based image description techniques and are very closely related to each other, we first describe how SIFT works. SIFT, originally introduced by Lowe [9], is a widely used key-point description technique in both image registration and retrieval. SIFT keypoints are identified by applying Difference of Gaussian (DoG) in scale space. Next, the dominant orientation O of the gradients is computed within a region R around each keypoint. Dominant orientation is the direction in which most of the gradients of a particular key-region are oriented. The size of R is also determined based on the scale at which the keypoint was identified. This makes the final descriptor invariant

to scale change. Rotation invariance, on the other hand, is achieved by building the descriptor relative to the dominant orientation O. The final descriptor is built on a 4-by-4 spatial grid where each cell in the grid consists of an 8-bin orientation histogram. All gradients within a cell are quantized into one of these 8 bins. Thus a SIFT descriptor has (4×4×8 or) 128 dimensions. Descriptors from a given image pair are then analyzed to derive the key-point matching set. III. P ROPERTIES OF M ULTI - MODAL I MAGES A. Intensity Variations The same portion of an object may be represented by different intensities in images captured from different modalities [1]. This is because different sensors may have different levels of sensitivity to a particular part of an object. Again, portions of an object may remain invisible to some sensors and visible to others as some sensors cannot realize their presence whereas others can. Variation in intensities has two possible consequences which are introduced in the following subsection. B. Gradient Reversal and Region Reversal It is very common in multi-modal images that the gradients of corresponding parts of the images will change their directions (See Fig. 2) by exactly 180 ∘ [1], [11]. We call this property ‘Gradient Reversal’ for future reference. Gradient reversal is one of the main reasons that causes SIFT to fail with multi-modal images. Gradient reversal may also cause inaccuracies in rotation normalization of regions. We have seen in Section II that rotation normalization is done by aligning the regions based on the directions of their dominant orientations. However, the presence of gradient reversal also reverses the direction of the dominant orientation. Therefore, even if two visually similar regions are rotation normalized, they can still remain totally out of phase - which we denote as ‘Region Reversal’. As a result, descriptors built on these regions will not match. Both Gradient Reversal and Region Reversal are illustrated in Fig. 3. In addition to having ‘Gradient Reversal’ and ‘Region Reversal’, medical microscopic images also commonly contain many objects which are visually very similar. For example, microscopic images representing a human tissue may contain many cell particles which are very similar in structure and color. As a result, if the descriptors built to represent different objects in such images are not sufficiently distinctive, it could easily result in different objects in the images to be considered as matches. These false matches will lead to poor registration performance. IV. S YMMETRIC -SIFT A. Overview Symmetric-SIFT [6] addresses the gradient reversal problem by limiting all the gradients within the range [0, 𝜋). Let us denote this intermediate descriptor as ‘Gradient SymmetricSIFT’ (or GS-SIFT). As gradient reversal may not always happen, the existence of region reversal is also unknown.

198 196

(a)

(b)

Fig. 2. (a) T1 and (b) T2-weighted brain MR images are samples from our test data set. The arrows in the insets show how gradient direction may change with modality. Region reversal

Gradient reversal (a-i)

(a-ii)

(b-i)

(b-ii)

Dominant orientation

Dominant orientation

Fig. 3. (a-i) and (a-ii) are two example images of the same object assumed to have been captured by different sensing devices. The arrows indicate the direction of gradients. See how the gradients reverse their directions due to change in modality. (b-i) and (b-ii) show the same sample image regions rotation-normalized based on their dominant orientations. See how they still remain out of phase by 180 ∘ even after rotation-normalization.

(a-i)

(a-ii)

Fig. 4.

10

20

30

40

10

20

30

50

60

70

80

50

60

70

80

90 130

100 140

110 150

120 160

90 130

100 140

110 150

120 160

(b-i) (a-iii)

40

(b-iv)

170

170

170

170

160

150

140

130

170

170

170

170

170

170

170

170

120

110

100

90

170

170

170

170

10 90

30 110

50 130

70 150

80 40

70 30

60 20

50 10

10 90

30 110

50 130

70 150

10

150

30

40

50

60

100

80

90 130

70 140

110 20

120 160

160

150

140

130

120

110

100

90

80 40

70 30

60 20

50 10

(b-ii)

(b-iii)

(a- i–iii) The GS-SIFT merging process in Symmetric-SIFT. (b- i–iv) The ambiguity in Symmetric-SIFT’s merging process.

Symmetric-SIFT, therefore, initially builds two intermediate GS-SIFT descriptors - one of which assumes the presence of region reversal and the other does not. These two GS-SIFT descriptors are out of phase from each other by 180 ∘ . Both of these intermediate descriptors are then merged together to form a single descriptor. Fig. 4 illustrates the merging process by making use of 4-by-4 sample GS-SIFT descriptors to convey the basic idea simplistically. In this figure, (a-i) shows a sample GS-SIFT descriptor and (a-ii) shows the opposite GSSIFT descriptor built by rotating the initial descriptor by 180 ∘ . (a-iii) then shows the merged Symmetric-SIFT descriptor, the upper and lower halves of which are calculated by taking the summation and absolute difference between the corresponding values from (a-i and ii) respectively. B. The Problem In Fig. 4, we show that the merged Symmetric-SIFT descriptor (b-iv) can be produced either from (b-i and ii) or from (b-i and iii). However, for multi-modal images, we want (b-i) to match with either itself or with (b-ii) but certainly

not with (b-iii). The merging process of Symmetric-SIFT is, therefore, ambiguous. In fact, many other combinations that merge to the same final descriptor can be derived. This is why Symmetric-SIFT is prone to generate too many false positives, restricting it from attaining higher key-point matching as well as registration accuracy. V. P ROPOSED T ECHNIQUE TO I MPROVE R EGISTRATION ACCURACY In spite of the issue described in Section IV, SymmetricSIFT was shown to have better accuracy as compared to other techniques for multi-modal images [2], [9]. Moreover, the strengths inherited from the proven technique SIFT inspired us to utilize the results obtained by Symmetric-SIFT. The problem with Symmetric-SIFT as described in Section IV-B, can be overcome if we can avoid the merging process. Sections V-A and V-B describe the two steps involved in our proposed technique. The first step estimates the overall orientation difference between the given images by analyzing the initial matching set. This estimated value is used in the

199 197

(a-i)

(c-i)

Fig. 5.

(a-ii)

(b-i)

(c-ii)

(b-ii)

(d-i)

(d-ii)

(a) T1 vs T2 weighted brain MR images. (b) Near infrared vs visual images. (c) and (d) Medical microscopic images.

second step to rotation normalize the descriptors. Such way of normalization does not result in region reversal. As a result we can skip the descriptor merging step (of Symmetric-SIFT) in the second step and achieve better matching accuracy. A. Step I: Identifying the Phase Difference Our proposed technique starts with analyzing the initial keypoint matching set obtained by Symmetric-SIFT. Suppose, we have two multi-modal images I1 and I2 with a rotational difference of 𝛼 degrees between them. Now, let the initial matching set be - {𝐷𝑖1 → 𝐷𝑗1 , 𝐷𝑖2 → 𝐷𝑗2 , 𝐷𝑖3 → 𝐷𝑗3 , ..., 𝐷𝑖𝑛 → 𝐷𝑗𝑛 }, where i and j denote the indices of the key-points from I1 and I2 respectively and n is the total number of matches found. Let {𝜃𝑖1 , 𝜃𝑖2 , 𝜃𝑖3 , ..., 𝜃𝑖𝑛 } be the dominant orientations of {𝐷𝑖1 , 𝐷𝑖2 , 𝐷𝑖3 , ..., 𝐷𝑖𝑛 } and {𝜃𝑗1 , 𝜃𝑗2 , 𝜃𝑗3 , ..., 𝜃𝑗𝑛 } be the dominant orientations of {𝐷𝑗1 , 𝐷𝑗2 , 𝐷𝑗3 , ..., 𝐷𝑗𝑛 }. Next we define {𝛼1 = 𝜃𝑖1 − 𝜃𝑗1 , 𝛼2 = 𝜃𝑖2 − 𝜃𝑗2 , 𝛼3 = 𝜃𝑖3 − 𝜃𝑗3 , ..., 𝛼𝑛 = 𝜃𝑖𝑛 − 𝜃𝑗𝑛 } to be the differences between the dominant orientations of each key-point pair found in the initial matching set. Assuming the presence of a rigid transformation between I1 and I2, the following relation holds in an ideal situation where all the matches in the initial matching set are true: 𝛼1 = 𝛼2 = ... = 𝛼𝑛 = 𝛼𝐷 = the difference in rotation between I1 and I2 . However, as in a real scenario, the 𝛼 values tend to vary, we propose to compute 𝑚𝑒𝑑𝑖𝑎𝑛(𝛼) and consider it to be the phase difference 𝛼𝐷 between the two given images I1 and I2. We have also conducted experiment using 𝑚𝑒𝑎𝑛(𝛼) and 𝑚𝑜𝑑𝑒(𝛼). As all the three measures produce comparable results, in this paper we have chosen to present results based on median. B. Step II: Rebuilding the Descriptors In this step we start by building GS-SIFT descriptors for I1. We do the same for I2 except that the regions in this case are rotated by 𝛼 degrees before any descriptor can be built. It is important to note that, instead of using dominant orientations, we simply use 𝛼 to align the regions. As this approach does not require rotation normalization via dominant orientation, the region reversal problem becomes irrelevant and the GSSIFT merging process is no longer required here. Once all the descriptors are rebuilt, we do the final matching.

VI. P ERFORMANCE S TUDY For our experiments, we have used 101 pairs of T1-T2 coronal brain MR images and another 87 pairs of T1-T2 transverse brain MR images collected from McConnell Brain Imaging Centre’s online brain image data store [12]. We have also used another 26 custom image pairs from different sources [7], [13], [14], [15]. These include 23 multi-modal, one aerial, one natural and one synthetic pair of images. The multi-modal images include near-infrared vs visual image pairs. We have also used 8 pairs of medical microscopic images. These are captured by different types of microscopes and are, therefore, multi-modal in nature having large scale differences. Examples of some test image pairs are presented in Fig. 5. A. Evaluation Measures The accuracy of a registration technique highly depends on the accuracy of the key-point matching set. The higher the proportion of the identified true matches, the better the chance of achieving higher registration accuracy. Therefore, we evaluate our results based on the accuracy of the matching set where number of true matches × 100% . (1) number of total matches The ground truths for all our test cases are trivial as the transformations are known. A maximum of 4-pixel error is considered to be accepted as a true match which is consistent with existing literature [16]. We also evaluate our technique with Recall versus 1Precision curves [13]. Some of these graphs have been included in this paper to illustrate the superiority of the proposed technique. Why compare against Symmetric-SIFT? The technique proposed in this paper is a local technique. According to our understanding, Symmetric-SIFT performs the best among the very few multi-modal local techniques and has, therefore, been used as the benchmark technique in this paper. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =

B. Results Table I compares the average matching accuracy between the original Symmetric-SIFT and our proposed technique for a

200 198

TABLE I C OMPARISON OF OUR PROPOSED TECHNIQUE AND S YMMETRIC -SIFT ON BRAIN AND CUSTOM IMAGE DATA SET. Type of image pair

Total # of image pairs examined

Method

Transverse 87

T1-T2 Coronal

101

T1-T2 Custom

26

images

Average matching accuracy (%)

Symmetric-SIFT

1.5𝑥 79.98

2.0𝑥 79.07

2.5𝑥 76.80

Proposed Symmetric-SIFT

84.98 81.06

83.90 79.74

81.49 78.95

Proposed Symmetric-SIFT

85.79 80.18

84.19 79.13

83.16 77.87

Proposed

84.61

82.95

80.86

TABLE II C OMPARISON OF OUR PROPOSED TECHNIQUE AND S YMMETRIC -SIFT ON MEDICAL MICROSCOPIC IMAGE DATA SET. Specimena

Scale difference

𝐴1 𝐵1 𝐴1 𝐵1 𝐵2 𝐵1 𝐵2 𝐵3

Matching accuracy (%) Symmetric-SIFT Proposed 65.00 73.02 60.87 88.73 33.33 48.89 16.67 61.54 58.82 87.50 0 25.00 20.00 47.62 0 25.00

1𝑥 1𝑥 2𝑥 2𝑥 2𝑥 4𝑥 4𝑥 4𝑥

a. Denotes which of specimens A or B is used. Subscripts denote a particular area within a given specimen.

0.7

Symmetric−SIFT

0.5

Proposed

0.6

0.4

0.4

Recall

Recall

0.5

0.3

0.3 0.2

0.2 0.1

0.1 0 0

0.1

0.2

0.3 0.4 1−Precision

0.5

0.6

0 0

0.7

0.2

0.16

0.08

0.14

0.07

0.12

0.06

0.1

0.05

0.08

0.03

0.04

0.02

0.02

0.01

0.2


0.6

0.8

0.04

0.06

0 0

0.4 1−Precision

(b)

Recall

Recall

(a)

0.8

0 0

1

(c)

0.2


0.8

1

(d)

Fig. 6. Recall versus 1-Precision curves for (a) T1 vs T2 brain MR images, (b) near-infrared vs visual images and (c)-(d) medical microscopic images. For (a), (b) and (c) the input image pairs have a scale difference of 2x, whereas, for (d) the scale difference is 4x.

201 199

Number of true matches

250

Symmetric−SIFT

Proposed

200 150 100 50 0

10

20

30

40 50 60 Image pair ID

70

80

90

Fig. 7. Comparison between Symmetric-SIFT and our proposed technique in terms of the number of true matches identified.

first phase are not repeated during the second phase. We are working to further optimize our technique and reduce computational complexity. The technique proposed in this paper works for images having global rotation. However, the relative rotation between two descriptors is likely to be negligible if their spatial distance is sufficiently small. Therefore, our technique can be easily extended for images with local rotations (or nonrigid transformations) by clustering the descriptors spatially and estimating local rotations. Again, though all experiments in this paper are based on 2D images, being a variant of SIFT, our proposed technique can also be adapted for 3D image registration [17]. R EFERENCES

range of scale differences on the brain and custom image data set. Table II, on the other hand, shows the results for medical microscopic images. The results show that the accuracy of the matching set can improve by up to 30% if our proposed technique is used. Fig. 6 further compares the performance of our proposed technique against that of Symmetric-SIFT with the help of Recall versus 1-Precision curves. Though we present graphs for just a few pair of images, the trend in general is the same across all the images in our test data set. There are actually two key reasons that help our technique to perform better. Firstly, rotation normalization based on the derived 𝛼 value is less susceptible to error as compared to using dominant orientations. Secondly, as there is no descriptor merging step involved, the information loss that occurs in Symmetric-SIFT can totally be avoided. We would also like to emphasize that our proposed technique has significant impact on the number of true matches identified. We found that the number of true matches may increase by up to 10 times. Fig. 7 demonstrates this trend of improvement for transverse MR image pairs having 15 ∘ of rotational difference. The T1-T2 image slices used in our experiments are ordered according to their positions in the original 3D stacks. Image slices obtained from the middle of the stack contain more brain tissue as compared to those on either ends of the stack. This is why the number of true matches is lower at both ends of the graph. Although we present a randomly picked case for demonstration, the trend of getting increased number of true matches is alike over all our test cases. VII. C ONCLUSION AND D ISCUSSION Key-point matching accuracy is vital to the performance of image registration. According to our experimental results, our proposed technique can increase the average matching accuracy by up to 30% and the number of true matches by up to 10 times. Again, though we have used our technique in conjunction with Symmetric-SIFT, ideally the concept can also fit into other appropriate local description techniques. Although we have to rebuild all the descriptors in the second phase of our proposed technique, many steps done in the

[1] J. Pluim, J. Maintz, and M. Viergever, “Image registration by maximization of combined mutual information and gradient information,” in Medical Image Computing and Computer-Assisted Intervention– MICCAI 2000. Springer, 2000, pp. 103–129. [2] J. Orchard, “Efficient least squares multimodal registration with a globally exhaustive alignment search,” Image Processing, IEEE Transactions on, vol. 16, no. 10, pp. 2526–2534, 2007. [3] A. Roche, G. Malandain, X. Pennec, and N. Ayache, “The correlation ratio as a new similarity measure for multimodal image registration.” Springer, 1998, p. 1115. [4] D. Russakoff, C. Tomasi, T. Rohlfing, and C. Jr, “Image similarity using mutual information of regions.” Springer, 2004, pp. 596–607. [5] P. Viola and W. Wells III, “Alignment by maximization of mutual information,” International Journal of Computer Vision, vol. 24, no. 2, pp. 137–154, 1997. [6] J. Chen and J. Tian, “Real-time multi-modal rigid registration based on a novel symmetric-SIFT descriptor,” Progress in Natural Science, vol. 19, no. 5, pp. 643–651, 2009. [7] A. Kelman, M. Sofka, and C. Stewart, “Keypoint descriptors for matching across multiple image modalities and non-linear intensity variations,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2007, pp. 1–7. [8] R. Ma, J. Chen, and Z. Su, “MI-SIFT: mirror and inversion invariant generalization for SIFT descriptor,” in Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 2010, pp. 228–235. [9] D. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004. [10] J. Maintz and M. Viergever, “A survey of medical image registration,” Medical image analysis, vol. 2, no. 1, pp. 1–36, 1998. [11] A. Collignon, D. Vandermeulen, P. Suetens, and G. Marchal, “3D multimodality medical image registration using feature space clustering,” in Computer Vision, Virtual Reality and Robotics in Medicine. Springer, 1995, pp. 193–204. [12] “Brainweb: Simulated brain database,” brainWeb. [Online]. Available: http://mouldy.bic.mni.mcgill.ca/brainweb/ [13] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE transactions on pattern analysis and machine intelligence, pp. 1615–1630, 2005. [14] Y. S. Kim, J. H. Lee, and J. B. Ra, “Multi-sensor image registration based on intensity and edge orientation information,” Pattern Recognition, vol. 41, no. 11, pp. 3356 – 3365, 2008. [Online]. Available: http://www.sciencedirect.com/science/article/B6V144SF301J-4/2/a977f79c7ba245a24ae638678132c87e [15] C. Fredembach and S. Susstrunk, “Illuminant estimation and detection using near-infrared,” in Proceedings of SPIE, vol. 7250, 2009, p. 72500E. [16] G. Yang, C. Stewart, M. Sofka, and C. Tsai, “Registration of challenging image pairs: Initialization, estimation, and decision,” IEEE transactions on pattern analysis and machine intelligence, pp. 1973–1989, 2007. [17] W. Cheung and G. Hamarneh, “N-sift: N-dimensional scale invariant feature transform for matching medical images,” in Biomedical Imaging: From Nano to Macro, 2007. ISBI 2007. 4th IEEE International Symposium on, april 2007, pp. 720 –723.

202 200