Automatic Registration of Multispectral Images ...

Automatic Registration of Multispectral Images through Maximization of Mutual Information Pietro Guccionea , Luigi Mascoloa , Giuseppe Cifarellib , Cristoforo Abbattistab and Mario Tragnib a Politecnico b Planetek

di Bari, via Orabona 4, Bari, Italy srl, via Massaua 12, Bari, Italy ABSTRACT

In this paper we propose a method to get fine registration of high resolution multispectral images. The algorithm supposes that a coarse registration, based on ancillary information, has been already performed. It is known, in fact, that residual distortions remain, due to the combined effects of Earth rotation and curvature, view geometry, sensor operation, variations in platform velocity, atmospheric and terrain effects. The algorithm grounds its main idea on the information-theoretic approach to register volumetric medical images of different modalities. Registration is achieved by adjustment of the relative position and orientation until the mutual information between the images is maximized. The idea is that the join information is maximized when the two images are at their best registration. This approach works directly with image data but in principle it can be applied in any transformed domain. While the original algorithm has been thought to make registration in a limited search space (i.e. translation and orientation), in the remote sensing framework the class of transformations is extended allowing scaling, shearing or a general polynomial model. The maximization of the target function is performed using both the stochastic gradient descent algorithm and the simulated annealing, since the former is known to occasionally deadlock in local maxima. We have applied the algorithm on a SPOT-5 couple of images, achieving the registration of chips of size 256x256 pixels at time. Accuracy has been obtained comparing the results with the outcomes of a commercial software that adopts a sort of Normalized Cross-Correlation method. On 143 chips taken throughout the image, the final translation accuracy resulted well below 1 pixel and the rotation accuracy about 0.015deg. Keywords: Image Registration, Mutual Information, Gradient Descent, Simulated Annealing

1. INTRODUCTION Image registration is one of the important preprocessing steps for studying changes of a scene over time. Two major application fields are remote sensing, where registration is essential for analysis of environment changes with seasons and in biomedical images, to compare analysis coming from different techniques or through time. In the context of remote sensing, image registration is a particular challenging task, as it implies the prediction of regional climate changes of seasonal and interannual time scales and the understanding of the interactions between human activity and changes in major Earth ecosystems. In this framework, the integration of data coming from different missions plays a key role. The problem of integration concerns the creation of seamsless moisaics of data coming from sensors that can potentially differ in band, resolution and acquisition geometry; further differences in data are due to changes of scene in time, including illumination if passive sensors, long term and short term changes. A major example comes from the Landsat program, which has provided remote sensing information for over 40 years, maintaining a continuous and seamless stream of data (http://ldcm.usgs.gov/). Starting in 1972, the mission continued until today with the new Landsat 8, a push–broom sensor able to collect images in the visible, P. Guccione and L. Mascolo are with the Department of Electrical and Information Engineering, Technical University of Bari. e-mail: [email protected]. C. Abbattista, G. Cifarelli and M. Tragni are with Planetek srl, Italy.

Image and Signal Processing for Remote Sensing XX, edited by Lorenzo Bruzzone, Proc. of SPIE Vol. 9244, 92440C · © 2014 SPIE · CCC code: 0277-786X/14/$18 doi: 10.1117/12.2065296 Proc. of SPIE Vol. 9244 92440C-1

near infrared (NIR) and shortwave infrared wavelength regions, as well as in panchromatic band with a resolution up to 15m in panchromatic. Another example is given by the SPOT mission (http://www.astrium-geo.com/en/143-spot-satellite-imagery). The last SPOT-5, launched in 2002, includes two high geometric resolution optical instruments, a high resolution stereoscopic viewing instrument for relief mapping and a low-resolution instrument to provide continuity of environmental monitoring around the globe. SPOT-5 resolution has been improved up to 2.5m resolution and wide imaging swath, which covers 60 × 60 km or 60 × 120 km in twin-instrument mode. The huge amount of images generated by these two missions hardly can be managed for regional studies without some pre-processing steps as registration. Basically, registration algorithms attempt to recover the transformation parameters that describe how one image maps to another. In remote sensing context this often means to register one image in another previously registered on a common reference as, for example, a Earth projection on a 2-D cartesian coordinate system. The different kinds of distortion in remote sensing images make it extremely difficult for two images, acquired by different sensors (or even by the same sensor) to be ’perfectly registered’ to each other or to a fixed coordinate system. These distortions usually result from the combined effects of Earth rotation and curvature, sensor operations (such as scan time skew and scan nonlinearities), variations in platform orbit (position, velocity and attitude), atmospheric and topographic effects.1, 2 For this reason, it is very difficult to determine the exact registration using only ancillary data. Accurate registration involves, then, a two-step procedure.3 First of all the systematic correction of errors due to different acquisition geometry is applied using satellite ephemeris and a simplified Earth surface (smooth ellipsoid). The output of this registration procedure is a polynomial model that gives, for each sample coordinate in the reference image, the corresponding (possibly fractional) coordinates in the image to register. The complexity of this polynomial is a function of the particular couple of images to register, the number of corresponding pixels, the geometry of the two images, and so on. One of the most used model is the Rational Function Model,3–5 used for example in Ikonos mission. Registration through orbit yields registration accuracy within a few pixels. For example, navigation errors range from 3–10 pixels (i.e., 0.1–0.3 km) for Landsat-5 data, and from one to three pixels (i.e., 50–100 m) for Landsat-7 data.6 In this paper we are interested, instead, to the registration based on data content, i.e. on image features and/or landmarks. The idea is to exploit the natural features of the image to get a pixel by pixel registration, even when the transformation function is very complicated. Many algorithms have been proposed to perform registration based on image similarity. Among these, correlation function registration, point mapping registration,7 matching of features (control points, corners, segments)8–10 are the most popular; they differ for the domain in which registration takes place (spatial or transformed domain9 ) or the function to be optimized. In our algorithm, registration is achieved by the maximization of the Mutual Information (MI) of two images. The proposed method exploits concepts derived from Information Theory and puts together the information of two images, providing a more compact representation of the joint information. A key point of the method is that the joint information involves the entire image and not only the most relevant parts as segments, landmarks, topographic elements.8, 11 The algorithm works in a iterative way until the optimal set of transformation, up to a pre-defined precision, is reached. For each iteration, a general transformation is applied to the test image that is interpolated on the reference grid. The computed mutual information drives the choice of the new set of transformations to apply in the next iteration by means of a maximization method as the gradient descent. The paper is organized as follows. In Section 2 the key concepts of the algorithm are explained while in Section 3 the algorithm is detailed and the experimental choice justified. In Section 4 the experiments are described and the registration accuracy reported, finally in Section 5 some conclusions are drawn.

2. PROBLEM STATEMENT The four key factors to deal with in the problem of registration are:6, 12 1. The Feature space: The domain in which information for matching is taken;

Proc. of SPIE Vol. 9244 92440C-2

2. The Search space: The class of transformations that establish the correspondence between the two images; 3. The Search strategy: shortly, the algorithm used for finding the optimal transformation; 4. The Similarity metric: the function or figure of merit used to look for the optimal match in the feature space. To provide a robust method of registration we must comply with all these issues. The problem of registration of remote sensed images is formalized as follows. Let us denote with p(z) and r(z) the two images to be registered, where z is the set of coordinates (rows and columns, for example). Even if not restrictive, we consider r(z) the reference image, while p(z) is the image to be registered on the grid of r(z). The reference image is assumed already registered on a reference grid of coordinates, so that the process of registration brings the test image, p(z) on the same grid after a final interpolation. It is here remarked that a good registration algorithm includes, as final step, the optimal choice of the 2D interpolation function. This choice is subjected to considerations such as relative band of data, precision of interpolation and computational complexity and is not matter of this paper. The key concept of registration lies on the mapping of the test image in the grid coordinates of the reference image by applying a set of (generally nonlinear) transformations. The most general way a transformation can be represented is by using a function T (z), that transforms the set of coordinates z in another set of coordinates on the same domain (mapping). Just to give an example, shift, rotation and scaling, are represented for a 2-dimensional set of coordinates z = (i, j): ˆi = α (i cos(θ) − j sin(θ) − i0 ) (1) ˆj = α (i sin(θ) + j cos(θ) − j0 ) where α is the scaling (α =1 means no scaling), (i0 , j0 ) are the shifts and θ is the rotation angle. From Eq. (1) is straight the generalization to more complex transformations, for example including a higher-order polynomial model. Complex transformations make the result more precise but imply a larger search space in which solution must be found. In Landsat specifications, for example, only affine transformations are used (shift and rotation), due to the limited spread of the orbital tube and to the acquisition geometry. Provided that T is the transformation from the coordinate frame of the test image to the reference frame, the test image in the reference set of coordinates can be addressed as p(T (z)). To simplify the notations, some of the subsequent equations use T both to denote the transformation or the set of parameters (for example {α, i0 , j0 , θ}). In literature Fourier transform, wavelet and other domains have been successfully exploited as domain in which information for matching is taken. As an example, registration based on image features or landmarks is achieved in the wavelet domain in Netanyahu6 and LeMoigne.13 Let us suppose, then, that both the images have been transformed in a new domain, where registration is more promising. After transformation, we denote with P (x) and R(x) the couple of images in the new domain of coordinates x:

D

p(z) → P (x) D

r(z) → R(x)

(2)

where D is the generic transformation applied. We look for an estimate of the transformation parameters that registers the reference and test images by maximizing their mutual information I in the transformed domain: Tˆ = arg max {I [R(x), D {p(T (z))}]} T

(3)

where the mutual information is defined in terms of entropy of the (transformed) images: I [R(x), D {p(T (z))}] = h(R(x)) + h(D {p(T (z))}) − h (R(x), D {p(T (z))})


(4)

The basic concept behind the algorithm is that images are modeled as a stochastic field of which R and P are two realizations. Moreover, the algorithm does not use any prior knowledge of the image statistics, nor of the domain in which the method is applied, and exploits the whole image, not only some details, as in the point mapping registration methods. It is finally more robust than correlation, since it implies nonlinear relation between images, as entropy is based on both occurrence and information brought by the stochastic field. The entropy can be interpreted as a measure of uncertainty, variability, or complexity of the underlying function. The mutual information defined in Eq. (4) has three components. The first term on the right is the entropy in the reference volume, and is not a function of T . The second term is the entropy of the part of the test volume into which the reference volume projects. Maximization encourages transformations that project P into complex parts of R. The third term, the (negative) joint entropy, contributes when the two images are functionally related and encourages transformations where R explains P well. Together the last two terms identify transformations that find complexity and explain it well.14

3. ALGORITHM DESCRIPTION The algorithm scheme is outlined in Fig. 1 and described in the following.

Test Image

r (z)

p (z)

R(x)

Domain trasform.

D{ p (T (z))}

p (T (z)) Geometric transf. & interp

Domain trasform.

Mutual information

Reference Image

I [ R (x), D{ p (T (z))}] Optimization algorithm

T ( i +1) ( i ) Optimal set of transf.

⌢ T (i)

Figure 1. A block diagram of the registration algorithm. Please note that the optimization algorithm has not been specified since, in principle, different algorithms can be used to maximize the mutual information. In the same way, the domain transformation has not been specified.

As previously said, the optimization of the mutual information may be performed in different domains. In the present work we chose to compute the mutual information in the space domain, so no transformation has been applied at all. The choice has been dictated by two considerations: first of all the original method of registration through maximization of mutual information has been successfully applied in the space domain in the context of biomedical images;15 the second consideration concerns the computational complexity, since a domain transformation (as the FFT) implies a burden of operations to be iteratively repeated. Anyway the application of Fourier or wavelet transforms to remote sensed images for the registration through the maximization of mutual information may be matter of future investigation. In our case we have:

 x=z    R(x) = r(z) P (x) = p(z)    D {p(T (z))} = P (T (x))

(5)

Transformation of 2-dimensional coordinate is a simple matter of applying a polynomial model. In the present case we applied three transformation, i.e. shift in both direction, rotation with respect to the image center and


scaling, as already detailed in Eq. (1). The set of unknown parameters to find is then: T = {α, i0 , j0 , θ} After the application of the geometric transformation, the test image is interpolated in the reference frame. We used, to make interpolation, a cubic interpolator, since it is a good compromise between computational complexity (4 multiplications in each dimension to get a single output sample) and interpolation accuracy (the preservation of the input signal band through the transfer function of the interpolator, up to -1dB of attenuation, is about 0.6 of the Nyquist frequency).

3.1 Estimation of the PDF and Mutual Information The Mutual Information requires the estimation of the entropy, whose definition, for a continuous source X is: Z h(x) = − f (x) log(f (x))dx (6) while the definition of the joint entropy of two sources (X, Y ) is: Z h(x, y) = − f (x, y) log(f (x, y))dx

(7)

where (X, Y ) are not the coordinates, but random variables on which entropy is computed. Estimation of entropy and, then, of mutual information passes through the estimation of the Probability Density Function (PDF) of the image. Among the different methods in literature, we chose the histogram and the Parzen window methods. While the generation of histogram is straight, some detail is given for the Parzen window (PW). In the PW algorithm the PDF f (x) is approximated by the superposition of functions centered on the samples xj drawn from the pool A (the image or a subset of it, in our case): 1 X g(x − xj ) (8) f (x) ≈ f ∗ (x) = N xj ∈A

where N is the number of trials (i.e. samples) in A and g is a window function that integrates to 1. In our subsequent analysis we used the Gaussian density function window. In the PDF estimation one of the problem is the determination of the optimal parameter, in our case the width of the Gaussian window σ. A too high value of σ makes the estimation too smoothed, while a too low value produces a multimodal PDF, which could not be representative of the pool of data. On the other hand, even in the generation of a histogram we have to define the interval of the values [xinf , xsup ] and a step ∆x or, equivalently, an array of intervals, [xk−1 , xk ] in which we count for the number of occurrences of A. Optimization of the final registration to such parameters is an open issue that we have not faced (see for example Viola15 ). In practice, we used a reasonable value for the interval step in the histogram and we have adjusted σ to get the PDF estimation from the PW method very close to the estimation achieved by the histogram. In order to seek for the maximum of MI, the Gradient Descent Algorithm (GDA) requires for the knowledge of the derivative of MI. Since we have not a closed expression for MI, the derivative is computed by an approximation, provided the required precision with which the mapping parameters must be known. So, let us consider the minimum parameter step for the approximation of the derivative: dT = {dα, di0 , dj0 , dθ} It is desirable that such steps are as little as possible, for example for translation we should have (di0 , dj0 ) AND k < Nk max   with a pre-defined precision and Nk max the maximum number of iterates. First of all, we fix a starting point for the mapping parameters T (0) = {α, i0 , j0 , θ} (a practical inizialization is described in Sec. 4.4). Then, in an dI iterative loop, the parameters are updated, i.e. T (k) → T (k+1) , by a term proportional to the gradient dT until a stop criterion is reached (convergence of the parameters or maximum number of iterates). When using this procedure, some care must be taken to assure that the mapping parameters remain valid. For example, a value of α < 0 no longer represents a valid scaling factor, but also a value too far from 1 may raise a warning of wrong solution. The parameter λ is called learning rate. The learning rate is a function of the sensibility of the parameters to the objective function and must be experimentally determined. It is common to decrease its value when the maximum is close to be reached (but this can be difficult to evaluate). The main computational problem of the algorithm is the cost in the PDF estimation, since this cost is quadratic with the sample size. A less costly approach, proposed in the past,16, 17 consists in using smaller sample size, even if additional noise is introduced into the derivative estimates. The stochastic version of GDA uses noisy derivative estimates instead of the true derivative for optimizing a function, and convergence can be


proven for particular linear systems, provided that the derivative estimates are unbiased and the learning rate is properly changed (decreased) over time. Successfull registration has been achieved using small samples size, starting from 200 samples for each image, but this number is a function of the image statistics and should be carefully assessed, since the noise introduced by the sampling can effectively lock the algorithm into local maxima. Especially in remote sensed images such local maxima are numerous, since images with regularly repeated topographic elements or ambiguous details are common (for example a urban street grid). For this reason we tried to explore another method to escape from such local maxima still using stochastic estimates of PDF, the Simulated Annealing Algorithm.

3.3 Simulated Annealing Algorithm The Simulated Annealing (SA) is an alternative way to unlock from local maxima/minima. The SA is a generic probabilistic method for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete but it works also for continuous spaces. The name and inspiration come from annealing in metallurgy, a technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects (both being attributes of the material that depend on its thermodynamic free energy).18 In a SA problem, the objective function (in our case the MI, I) is given and defined in a set S, the search space. The global maxima are assumed to be a subset S † of S. Given an element i ∈ S, it is defined the neighbor set of i, S(i) and a set of positive coefficients qij , such that X qij = 1 (10) j∈S(i)

It is assumed that if j ∈ S(i), then i ∈ S(j) and that qii = 1 (the value i, j here have no relation with the coordinates introduced in Sec. 2 and Eq. (1)). It is finally defined a non-increasing function, called cooling schedule T () and an initial state x(0). The value T (t) is called temperature at time t. The SA algorithm consists of a discrete-time inhomogeneous Markov chain, x(t). Given the current state of such chain in t, x(t) = i, we choose a neighbor at random, j ∈ S(i): the probability to select the neighbor j is just qij . Then, the next state is determined as follows:    if   I(j) >= I(i)     then x(t + 1) = j    if I(j) < I(i)     then     x(t + 1) = j with probability exp [−(I(i) − I(j))/T (t)]    x(t + 1) = i otherwise The previous algorithm can be fornalized in the following probability of changing of state: i ( h qij exp − T 1(t) · max {0, I(i) − I(j)} , j = 6 i, j ∈ S(i) p [x(t + 1) = j|x(t) = i] = 0, j∈ / S(i) It has been shown that with T → 0 the probability distribution of the Markov chain remains concentrated on the set of global maxima S † of I ∗ . Convergence analysis carried on in the past18 provides also optimal cooling schedules, that usually decrease using exponential functions. In our case the search space S is the interval in which the mapping parameters T can change and must be properly defined in advance. Each element i ∈ S is a specific set of parameters, i.e. a given choice of T . ∗

Really, the original algorithm looks for the global minima, as they are related to the free energy of the system and the probability distribution of the Markov chain is just the Gibbs distribution. However, it is not difficult to revert the original problem to a problem of looking for a global maximum.


4. EXPERIMENTAL RESULTS The algorithm proposed involves basically two sub–procedures: (i) the statistical part: estimation of the PDF, mutual information and its derivative; (ii) the optimization. The image interpolation, required to test every trial transformation, T , is not objective of the paper and then is not matter of experimental investigation. To provide the algorithm performance, we separately analyzed these two parts.

4.1 Dataset description Two kind of images have been used for testing: 4 set of simulated images and a SPOT-5 dataset. Simulated images have been collected indoor and outdoor. Each set of images is composed by a reference image and one (or more) images whose transformation (shift and rotation) is known with absolute precision. The images within the set have been chosen with high level of detail and varied for illumination conditions, in order to simulate characteristics similar to a satellite acquisition. The photocamera used for image acquisition is a Nikon D60 with 10.5Mpx of resolution; the images size, originally of 3900×2613 pixels, has been downsampled to 390×262 (factor 10) and converted to 8-bit gray scale. The SPOT-5 dataset consists of two images on the area of Venice lagoon, Italy. The first image, acquired on 09 March 2007 12h52m57s and of size 11000×11000 pixels, has been previously registered on UTM coordinates (zone 32N) and is considered the reference dataset (see Fig. 2). The second image has been acquired on 13 January 2006 07h48m22s and has size 10001×10001 pixels. Both images have a sampling step of 5m. From the reference image 300 chips of size 256×256 pixel have been cut. Corresponding chips (of the same size) on test image have been identified using ancillary information from the satellite ephemeris.

Figure 2. The panchromatic (8-bit) SPOT-5 reference image. Image size is about 60×60 km.


4.2 Estimation of PDF and MI Initially, we performed the PDF estimation by using the histogram. In Table 1 the main parameters are reported (symbols make reference to the ones adopted in previous sections 3.2 and 3.3). Then, the Parzen window method has been adopted using different values of σ. The estimation parameters have been optimized using the SPOT-5 dataset. Table 1. Parameters used in the registration algorithm Parameter

Value

Chip Size Image values # bins histogram 1D # bins histogram 2D Parzen window parameter Translation search interval Rotation search interval Scaling search interval Stop criterion for GDA Maximum # iterates k Learning rate for GDA

256×256 pixels integers, 256 gray-valued 64 64×64 σ =3 [−10, 10] pixels [10, 16]◦ [0.97, 1.03] =0.001 300 transl: 1.4, rotation: 0.1, scaling: 1e-3 if iterates