image mosaicking, object matching, and multi-modal registrations in medical imaging ..... improves on the basic sampling method using a computationally cheap ...
A Unifying Framework for Mutual Information methods for use in Non-linear Optimisation Nicholas Dowson and Richard Bowden Centre for Vision Speed and Signal Processing University of Surrey, Guildford, GU2 7JW, UK {n.dowson;r.bowden}@surrey.ac.uk http://www.ee.surrey.ac.uk/personal/n.dowson
Abstract. Many variants of MI exist in the literature. These vary primarily in how the joint histogram is populated. This paper places the four main variants of MI: Standard sampling, Partial Volume Estimation (PVE), In-Parzen Windowing and Post-Parzen Windowing into a single mathematical framework. Jacobians and Hessians are derived in each case. A particular contribution is that the non-linearities implicit to standard sampling and post-Parzen windowing are explicitly dealt with. These non-linearities are a barrier to their use in optimisation. Side-byside comparison of the MI variants is made using eight diverse data-sets, considering computational expense and convergence. In the experiments, PVE was generally the best performer, although standard sampling often performed nearly as well (if a higher sample rate was used). The widely used sum of squared differences metric performed as well as MI unless large occlusions and non-linear intensity relationships occurred. The binaries and scripts used for testing are available online.
1
Introduction
Our aim is to place the common variants of Mutual Information (MI) into a single mathematical framework, and provide their analytic derivatives for use in non-linear optimisation methods. Furthermore an evaluation of the MI variants is provided allowing other researchers to choose a particular variant in an informed manner. We demonstrate that the four most commonly used variants, namely: standard sampling, Partial Volume Estimation, In-Parzen Windowing and Post-Parzen Windowing; vary primarily in how the joint histogram is sampled. The Jacobians and Hessians are derived for all these methods using an approach similar to that of Thevenaz and Unser [1], but who considered InParzen Windowing only. In the cases of standard sampling, post-Parzen window estimation and higher order partial volume estimation, this is novel. Using the established framework, the methods are compared in terms of computational cost, and convergence to the ground truth for eight data sets. Generally papers using MI choose the method reported to best suit their application, without the scope to consider other methods. The obvious exceptions are several papers that discuss artefacts on the MI cost function surface
2
N. Dowson and R. Bowden
[2, 3] and optimisation strategies [4]. However, only empirical analyses are made (unlike the analytic comparisons here). Also, neither artefacts nor optimisation strategies are the focus here. Rather, we focus on providing a common framework and side by side comparison of MI methods, which to the authors’ knowledge is currently unpublished elsewhere. The binaries and scripts used for testing are available online. Registration, or aligning one image (template) relative to another image (reference), is a common problem in many machine vision applications: e.g. tracking, image mosaicking, object matching, and multi-modal registrations in medical imaging. A widespread strategy for registration is to minimise (or maximise) a similarity metric between a template and the region of overlap in a reference image using an optimisation algorithm. MI has proved to be superior to many other metrics. Since its concurrent introduction and popularisation by Viola and Wells [5], Studholme et al.[6] and Collignon et al.[7] it has been widely adopted. Shannon proposed MI [8] in his theory of information as a measure of entropy of the information shared between two signals, with quantised amplitudes over a period of time. It is a simple extension to consider 2D or 3D images rather than 1D signals, which consist of quantised intensities over a 2D/3D space. MI has been applied using many different optimisation methods with varying degrees of success, including the simplex algorithm [9], Powell’s method [7, 10], Gradient Descent [11], hierarchical brute-force searches [12] hierarchical approaches [1, 13]. Pluim et al.’s survey [14] cites many more examples. Several optimisation methods were systematically compared by Maes in [15]. Due to space constraints, such a comparison is beyond the scope of this paper, but all the optimisation methods in [16] (Ch.10) and the Levenberg-Marquardt algorithm [17] have been implemented. The advantages of MI include an invariance to changes in lighting conditions, robustness to noise, sharp maxima and computational simplicity [18]. In a comparative study of registration methods, an MI based algorithm outperformed 15 other algorithms [19]. However, MI is a non-linear function and is prone to artefacts in its cost function surface. To overcome this, other forms of MI have been developed. One approach, used by Wells et al.[11] is to convolve the histogram with a Parzen window [20], to account for uncertainty in the intensity values. Thevenaz and Unser have a more sophisticated Parzen windowing method using B-splines [1], which are applied during the construction of the histogram, giving more accurate results. In addition, Partial Volume Interpolation was introduced by Maes et al.[10], which increments several histogram bins for each sample based on the distance of the sample point from the surrounding pixels. Chen and Varshney extended this concept to Generalised Partial Volume Estimation, which uses extended spatial support [21]. The remainder of the paper is organised as follows. Section 2 reviews MI along with the four common sampling methods. After this, the first and second derivatives are derived in Section 3 and some analysis is performed in Section 4. Next, two example applications are discussed with a corresponding set of experiments in Section 5. The conclusion follows in Section 6.
Unifying Framework for Mutual Information
2 2.1
3
Mutual Information Registration
To start a brief formalisation of the registration process is required. Let fR represent a reference image, and let fT represent a template image. Both images are functions of position x ∈ R2 , although only trivial changes in the analysis below are required if the dimension of x is altered to represent volumetric data. Since fR and fT are represented as lattices of values at integral positions for x, interpolation is used to obtain values at non-integral positions. There is insufficient space to discuss the choice of an interpolation method, but this is an important design issue. The interested reader is referred to a survey by Amidror [22]. For convenience and computational efficiency fR is treated as infinite in extent and sampling is only performed within bounds of the lattice of fT . Regions outside of the defined lattice of fR are defined as 0. Hence fT is considered constant with respect to any warp, and expensive boundary checking is avoided. The registration process aims to align fR and fT , by minimising a distance function D for some warp function w with parameters v: vreg = argv min D[fR (x), fT (w(x, v))]. For computational reasons (because fT is usually a smaller data set than fR ) it is easier to reformulate the problem as one of applying an inverse warp. Also, the function being minimised is MI, denoted by convention as I: vreg = argv min −I[fR (w−1 (x, v)), fT (x)] To maintain notational clarity w−1 (x, v) is referred to hereafter as xw . The negative sign is required because MI has larger values for better matches, and we wish to maintain the convention of referring to function minimisation. 2.2
Histogram Estimation
A measure of the information mutual to fT and the corresponding region in fR is obtained from the joint intensity histogram h(r, t, v) of the two images. Here r ∈ [0; rmx ] ∈ Z and t ∈ [0; tmx ] ∈ Z index the intensities that fR and fT respectively consist of (Z is the set of integers). The histogram may be normalised to give an approximation of the probability distribution function (PDF) of intensities, i.e. p(r, t, v) = N1x h(r, t, v), where Nx is the number of samples in the histogram. MI is defined here in terms of p rather than h for clarity, and the dependence on v is explicitly indicated: X prt (r, t, v) (1) I(v) = prt (r, t, v) log pr (r, v)pt (t) r,t A more common form of (1) has three entropy terms: I = Hr + Ht − Hrt . These are exactly the same and the more condensed form above is used for conciseness. r and pt are easily obtained from the joint PDF, since pr = P The PDF’s pP p and p = t t rt r prt . Note the treatment of r and t as discrete variables (or
4
N. Dowson and R. Bowden
indices), indicating the finite bin-size of the histogram h from which p is derived. MI is not invariant to the bin-size ∆i, which limits its bounds, as does the nummx tmx ber of sample points: I ≤ log(min( r∆i , ∆i , kmx Nx )), where kmx indicates the number of histogram bins populated per sample. The joint histogram is defined in terms of two window functions ψ(), which act as membership functions: X fR (xw ) fT (x) h(r, t, v) = ψ r− ψ t− (2) ∆i ∆i x Where each sample taken from fR and fT is added to one histogram bin: 1 0