Evaluation of Interest Point Detectors for non ... - Semantic Scholar

Evaluation of Interest Point Detectors for non-planar, transparent scenes Chrysi Papalazarou1 , Peter M. J. Rongen2 , Peter H. N. de With1,3 1

Eindhoven University of Technology, 5600 MB Eindhoven the Netherlands

2

Philips Healthcare 5680 DA Best the Netherlands

3

CycloMedia Technology 4180 BB Waardeburg the Netherlands

Abstract. The detection of stable, distinctive and rich feature point sets has been an active area of research in the field of video and image analysis. Transparency imaging, such as X-ray, has also benefited from this research. However, an evaluation of the performance of various available detectors for this type of images is lacking. The differences with natural imaging stem not only from the transparency, but -in the case of medical X-ray- also from the non-planarity of the scenes, a factor that complicates the evaluation. In this paper, a method is proposed to perform this evaluation on non-planar, calibrated X-ray images. Repeatability and accuracy of nine interest point detectors is demonstrated on phantom and clinical images. The evaluation has shown that the Laplacian-of-Gaussian and Harris-Laplace detectors show overall the best performance for the datasets used. Key words: Interest point detection, features, X-ray, evaluation, depth estimation

1

Introduction

Interest point detection is the first step in many high-level vision tasks, either directly or through intermediate tasks like surface reconstruction, self-calibration, etc. Much research has been devoted to the task of interest point detection as a discipline, and many detection algorithms have been proposed. Several researchers have addressed the performance of subsets of these detectors in a number of evaluation papers [1–5]. In this paper, we examine a special type of content, namely transparency images, more specifically X-ray images made with a rotating C-arm. This imaging system resembles a moving camera system with known calibration parameters. A comparison of the configurations of a pinhole camera model and an X-ray source-detector system is shown in Figure 1. The evaluation of these images is challenging for two main reasons: (1) the transparency creates overlapping structures that can mislead the detection (a problem dual to occlusions in natural imaging), and (2) scenes imaged with a rotating C-arm do not comply with the planarity demand. The transparency, along with the high noise content and

Fig. 1. Comparison of configuration between camera (left) and X-ray system (right).

relatively low contrast of X-ray, complicates the detection step itself. The nonplanarity forces us to consider a different evaluation scheme than what is usually applied in generic imaging. Most evaluations of interest point detection have been performed either on planar (or assumed to be planar) scenes, or on objects with a known configuration. An exception is [5], where intersecting epipolar constraints are used between triplets of calibrated views: reference, test, and auxiliary view. However, that work is motivated by object recognition applications, and uses evaluation criteria described from a large feature database. In our work, we propose an evaluation method that does not assume planarity, and uses multiple views, along with the calibration information, to create a reference model for each sequence. Thus, no prior statistical knowledge or database creation is needed, and the detectors can be evaluated per dataset. Additionally, in our approach the error in the reference set creation is taken into account by comparing the detectors under controlled error conditions, as will be further explained later. The evaluation does not create an absolute ground truth, and thus we cannot conclude on the ability of the detectors to capture the true position of interest points. It is our opinion, however, that since such a ground truth cannot be obtained for complex scenes, a relative comparison can serve to select the best detection scheme for a given task. This evaluation is tuned to the application for which the detection step is used, namely the creation of a sparse depth map using feature point correspondences. The goal of the evaluation is to identify the detector or detectors that provide the most rich and robust feature points, under various imaging conditions. The remainder of this paper is structured as follows: first, the proposed evaluation method is described and justified in Section 2. Next, the interest point detectors used are briefly introduced (Section 3) and some implementation details given (Section 4). The experiments performed on a number of phantom and clinical images are described, and the results of the evaluation presented in Section 5, followed by conclusions in Section 6.

2

Description of the evaluation scheme

In scenes that can be considered planar (e.g. faraway landscapes), a typical method of interest point evaluation comprises the following steps: (1) detection

of N points xi , i = 1..N on the image, (2) 2D transformation of the image with a known transform (e.g. affine), such that the ground-truth position of the ˜ 0i can be predicted, (3) detection of points feature points on the second image x 0 in second image xi , (4) measurement of the error |˜ x0i − x0i |. The planarity of the scene allows the application of one global 2D transformation that determines the displacement of all image points. If this assumption does not hold, world points that are in different depths will undergo different transformations, such that no single homography will exist that will describe their position. A more general approach, employing invariance methods for the evaluation of corner extraction, has been proposed by [4]. These invariants assume specific configurations of the object, e.g. polyhedrals, to construct a manifold that constrains the true positions of corners. Although it is also possible to use objects of a known configuration for X-ray, for example, a dodecahedron phantom (the X-ray equivalent of a checkerboard pattern), we opted to perform the evaluation on anatomical phantom images without any ground truth. There are two main motivations behind this: 1. It is precisely the performance of the detectors in “real” (or similar to real) images that we want to assess. A detector that performs well on an artificial pattern does not necessarily preserve its performance for more complex (clinical) content. This is especially important, as the different properties of medical X-ray images are motivating this research. 2. The configuration of a model object contains some inherent errors (e.g. size, placement), which in the case of depth estimation may be in the order of the errors that we want to measure, preventing its use as an absolute ground truth. An example of this is shown in Figure 2 showing the predicted position of points on a dodecahedron phantom, which is assumed to be positioned at the iso-center. The predicted positions are overlaid on the original image, and compared with the position obtained from calibration. Since calibration is performed assuming accurate 2D detection, it is clear that the inherent error in the model interferes with any attempt to measure the absolute accuracy. Instead, we construct an evaluation method which is inspired by the final task of 3D localization of the feature points. In this method, a subset of the available views is used to create a reference set of interest points for each detector. This is done by tracking corresponding feature points in successive views, and using these correspondences to obtain their 3D coordinates. The 3D back-projection step is performed by using a combination of intersection and resection, as described in [6], and by employing the available calibration parameters. Next, the reference points are projected onto each of the frames to be tested. Two evaluation criteria are routinely employed in interest point detection (see e.g. [1], [7]): repeatability and accuracy. Here, we have modified their definition in compliance with the rest of our scheme. Thus, an interest point is considered to be repeated when its distance to the nearest projected reference point is smaller than a threshold. We define repeatability in this framework as the ratio between the number of repeated points and the size of the reference set. Accuracy

Fig. 2. Comparison between predicted (theoretical) positions of lead bullets of dodecahedron phantom, with the positions extracted after calibration.

is then expressed as the Euclidian distance between the matched points and corresponding reference points. The evaluation scheme is outlined in Figure 3. The creation of the reference set contains an error, since the true position of the points is neither known in 2D, nor in 3D. This is an inherent limitation introduced by the non-coplanarity of the feature points. To control the effect of this error, we employ a selection on the reference points based on their 2D re-projection error. A point with a large re-projection error in the reference set is considered unstable and rejected. This introduces an additional evaluation criterion in our scheme: the number of points that can be back-projected with a small re-projection error, which indicates the detector’s ability to select suitable points. This metric is further on termed as the detector’s “3D reliability”. Under the assumption that the 3D error is reflected on the 2D re-projection error, this metric also implies that the influence of the 3D error is normalized between the different detectors. This means that the evaluation of detectors is performed among points that have similar re-projection errors, thus enabling a fair relative comparison on the grounds of 2D repeatability and accuracy. The different sources of error are illustrated in Figure 4.

3

Suitable interest point detection

In this work, we start from a definition of a suitable interest point as a point that can be reliably linked to a specific world point. It is of no special importance whether the point examined is expressed as a corner, a blob, or in any other form. The specific characteristics of the neighborhood also do not play an important role, as long as this neighborhood (or some description of it) allows the point to be reliably tracked, in a way that is consistent with its world position. This definition bears a resemblance to the notion of “good features to track” by Shi et al. [8], in the sense that the task defines the quality of the features. Our approach incorporates the additional demand that, as we are ultimately

Fig. 3. The proposed evaluation scheme

Fig. 4. Illustration of the different error types

aiming at recovering 3D structure, 3D constraints stemming from the application (i.e. calibration parameters and physical limitations in world position) are also considered. To this end, we have tested a number of interest point detectors known from literature. These include: (1) the conventional Harris corner detector [9], (2) the morphology-based SUSAN detector [10], (3) Laplacian-of-Gaussian [11], (4) multi-scale Harris, (5) Determinant of Hessian [12], (6) normalized curvature [11], (7) Harris-Laplace [13], (8) Ridge corner points [14] [15], and (9) Ridgeness features transformed with the Euclidian distance transform. The evaluation presented here concentrates as much as possible to the task of detection itself. Therefore, all other steps of the evaluation, including matching, sub-pixel estimation and back-projection, are kept the same for the different detectors tested. This excludes combinations of detectors and descriptors from this evaluation, such as e.g. the popular SIFT transform [16]. The detection part of SIFT and its successors [12], [17] uses an approximation of the Laplacian-ofGaussian, which is included in this evaluation. The main power of SIFT, however, lies in the description of the feature points that allows for a different matching scheme. The evaluation of description schemes is not within the scope of this paper.

4

Implementation details of the detectors

In the following, we provide some implementation aspects and settings of the detectors. The main consideration during the implementation has been to tune all detectors as well as possible to the high noise content and low contrast of X-ray images, which makes the tradeoff between the number of detected feature points and their stability more difficult than in the case of natural images.

4.1

Parameter settings

The Harris detector was used with smoothing σ = 2 and the k factor was chosen equal to 0.08. For SUSAN, we used the original implementation provided in [10]. The multi-scale Gaussian derivatives were calculated for a number of successive scales, defined as: σkD = s · σ0 · q k−1 , where nsc is the number of scales and q the factor between them. The parameter s defines the relation between integration and derivation scale. We used s = 0.7, q = 1.3, σ0 = 1.5 pixels and nsc = 9 for detectors (4) through (7), as indicated above. For detectors (8) and (9) (ridge features) we used larger scales, σ0 = 5 and q = 1.5, to capture the coarser elongated structures. For detector (9), the same ridgeness feature representation is used as for (8), and a binary mask is created for each scale by a soft histogrambased thresholding. Then, the Euclidian distance transform is applied to the mask image and this is used as the feature representation. This operation aims at capturing high-curvature points on the centerlines of coarse ridges.

4.2

Maxima selection

Each of the detectors provides a feature representation of the image, from which the most prominent points are selected. We select the 500 largest extrema as feature points. This ensures comparability between the detectors and does not affect the repeatability measurement in our scheme. For the case of multi-scale features, scale-space maxima are used [16], where the neighborhood for the maxima selection is larger for small scales, to avoid concentration of noisy feature points very close to each other. We use the empiric relation nb = dσnsc · σ0 /σk2 e to obtain appropriate neighborhood sizes. For all tested detectors, a selection is applied to the candidate feature points. Points located on significant edges, where the localization is expected to be poor, are rejected by examining the ratio of the Eigenvalues of the Hessian matrix in a patch of size 3σ around the candidate point, as in [16]. A final rejection is applied to points with a low rms value of the Eigenvalues, as these points correspond to little or no local structure. The parameters of this selection were the same for all detectors. 4.3

Matching

In the experiments, we used a combination of correlation-based matching and geometric constraints. Candidate matches are searched for in a block of the target image, defined by the epipolar constraints [18] and a clipping volume that restricts the 3D position of the features. The epipolar constraints are specified using the known calibration parameters, while the clipping volume constraint stems from the configuration of the X-ray system, where the object is always positioned between the X-ray source and the detector. 4.4

Sub-pixel accuracy

The accuracy of the detected interest points is extended by applying spline interpolation on the feature representations. For multi-scale detectors, the interpolation is applied in the 3D volume of the feature representation, using a patch of size equal to the scale of the feature. The sub-pixel accuracy is not used in the matching step, as that would involve a heavy computational cost due to dense interpolation of the feature representations. Instead, it is applied after the matching to obtain better accuracy in the back-projection step.

5

Experimental results

The above-described evaluation was applied on two sequences of phantom images and two of clinical images, examples of which are shown in Figure 5. The sequences were made using an X-ray system equipped with a rotational C-arm and were 122 frames long. For each sequence, sub-sequences were selected to create the reference set, where the length of the sub-sequence was chosen such that

a minimum number of 10 points is back-projected with an error below a predetemined tolerance. This tolerance was varied as part of the evaluation. Different sub-sequences of 20 frames (6 in total) were used to create the reference points. The number of reference points for the different datasets is shown in Figure 6.

(a)

(b)

(c)

(d)

Fig. 5. Examples of the data. Phantom images: (a) Knee phantom with wires, (b) Chest phantom with catheter. (c), (d): Clinical images of head angiograms.

Fig. 6. Number of back-projected points for a back-projection error tolerance of 5 pixels and a matching distance of 5 pixels.

After the reference set was created, two parameters were varied: the tolerance in the creation of the reference points (back-projection error), and the threshold of the matching (termed recall-precision in [19]). The repeatability and accuracy were measured for each frame in the sequence, excluding the ones used for the reference creation, and averaged over the number of sub-sequences. The results are shown in Figures 7 and 8. In Figure 6, the difference in the number of backprojected points among datasets is evident. Since the same parameters were used in all sequences, this reflects the relative sensitivity of all the tested detectors to the content. The 2 sequences for which many points can be tracked within a small sub-sequence, are the chest phantom and the first of the clinical sequences. In those, however, the points are less repeated over the entire sequence, as can

(a)

(b)

(c)

(d)

Fig. 7. Repeatability for: (a) Knee phantom, (b) Chest phantom (c) Sequence 1 of head angiograms (d) Sequence 2 of head angiograms.

(a)

(b)

(c)

(d)

Fig. 8. Accuracy for: (a) Knee phantom, (b) Chest phantom (c) Sequence 1 of head angiograms (d) Sequence 2 of head angiograms.

be seen in Figure 7. The other two sequences, knee phantom and second clinical sequence, achieve a far smaller number of 3D points, but a larger repeatability. This is possibly due to the fact that in these sequences high-contrast objects (the wires in the first and the tooth implants in the second) are visible throughout the sequence and can be found reliably. For both sequences, the Laplacian-ofGaussian scores high, while they individually achieve maximum performance with the Determinant-of-Hessian and single-scale Harris, respectively. The other two datasets (chest and clinical 2) respond best to Harris-Laplace, but here also the LoG scores high. However, the repeatability scores in this case lie almost a factor of 2 below the knee and clinical 1 datasets. In the accuracy (2D error) measurements, the chest phantom and clinical 1 datasets show little variation between detectors, with a slightly better performance (lower error) of the Determinant-of-Hessian. The detection error increases approximately linearly with matching tolerance, as expected from the evaluation scheme: increases in the matching threshold are added to the mean detection error. In the knee and clinical 2 datasets, again the LoG scores well (best score in clinical 2). In the knee dataset, the Harris and SUSAN detectors score surprisingly well, Harris even better than its multi-scale version. This may be explained from the content of the sequence. It is mainly the points on the wires that can be tracked and back-projected reliably in this sequence. Their high contrast and distinctiveness, along with the relatively smooth background, enable the morphology-based SUSAN detector to achieve good accuracy; however, its low repeatability for the same dataset excludes it from being the best option. Additionally, it must be noted that the single-scale detectors only perform well when the kernel size is tuned to the size of the detected objects. This may be the reason that SUSAN and Harris score much lower in e.g. the chest sequence, where objects of different scales are present.

6

Conclusions

This paper has presented an analysis of feature point evaluation for non-planar scenes captured by a rotating X-ray system. We have addressed the implications of the non-planarity and proposed an evaluation method to overcome them. While the proposed method cannot guarantee absolute error measurements, the creation of a reference set from the image data with a controlled back-projection error serves to make these measurements comparable. This allows for the selection of the best detection algorithm for a given task. In the context of sparse depth estimation, a good feature point detector should: (1) lead to a reliable back-projection for as many points as possible, (2) allow the establishment of reliable correspondences under projective transformation of the image, and (3) localize the detected points accurately. According to these criteria, we have found the Laplacian-of-Gaussian and Harris-Laplace detectors to score overall high in repeatability for the 4 datasets used. Little variation occurred between the multi-scale detectors in terms of localization accuracy. The variation between results for different types of content

implies an interesting future topic: to investigate the response of these, or more, detectors for different types of clinical images, as this may potentially allow content-dependent selection of the detection method for each clinical application.

References 1. Schmid, C., Mohr, R., Bauckhage, C.: Evaluation of interest point detectors. International Journal of Computer Vision 37(2) (2000) 151–172 2. Remondino, F.: Detectors and descriptors for photogrammetric applications. In: ISPRS III. (2006) 3. Mokhtarian, F., Mohanna, F.: Performance evaluation of corner detectors using consistency and accuracy measures. Computer Vision and Image Understanding 102(1) (2006) 81–94 4. Heyden, A., Rohr, K.: Evaluation of corner extraction schemes using invariance methods. Pattern Recognition, International Conference on 1 (1996) 895 5. Moreels, P., Perona, P.: Evaluation of features detectors and descriptors based on 3d objects. International Journal of Computer Vision 73(3) (2007) 263–284 6. Chen, Q., Medioni, G.G.: Efficient iterative solution to m-view projective reconstruction problem. In: CVPR. (1999) 2055–2061 7. Farin, D.: Automatic video segmentation employing object/camera modeling techniques. In: PhD thesis. (2005) 8. Shi, J., Tomasi, C.: Good features to track. In: Computer Vision and Pattern Recognition, 1994. Proceedings CVPR ’94., 1994 IEEE Computer Society Conference on. (1994) 593–600 9. Harris, C., Stephens, M.: A combined corner and edge detection. In: Proc. of 4th Alvey Vision Conference. (1988) 147–151 10. Smith, S.M., Brady, J.M.: Susan-a new approach to low level image processing. International Journal of Computer Vision (1997) 45–78 11. Lindeberg, T.: Feature detection with automatic scale selection. International Journal of Computer Vision 30 (1998) 79–116 12. Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: European Conference on Computer Vision. (May 2006) 13. Mikolajczyk, K., Schmid, C.: An affine invariant interest point detector. In: Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, Springer (2002) 128–142 Copenhagen. 14. Shilat, F., Werman, M., Gdalyahn, Y.: Ridge’s corner detection and correspondence. Computer Vision and Pattern Recognition 0 (1997) 976 15. Maintz, J.B.A., van den Elsen, P.A., Viergever, M.A.: Evaluation of ridge seeking operators for multimodality medical image matching. Pattern Analysis and Machine Intelligence, IEEE Transactions on 18(4) (1996) 353–365 16. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60 (2004) 91–110 17. Mikolajczyk, K., Schmid, C.: Scale & affine invariant interest point detectors. International Journal of Computer Vision 60(1) (2004) 63–86 18. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Second edn. Cambridge University Press, ISBN: 0521540518 (2004) 19. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10) (2005) 1615–1630