A tight multiframe registration problem with application to Earth observation satellite design Mart´ın Rais
Carole Thiebaut and Jean-Marc Delvit
Jean-Michel Morel
Dpt. Matem`atiques i Inform`atica / CMLA Universitat de les Illes Balears / ENS Cachan Spain / France Email:
[email protected]
´ Centre National d’Etudes Spatiales DCT/SI/QI France
CMLA ´ Ecole Normale Suprieure de Cachan France
Abstract—Image registration can play a key role to increase the SNR of Earth satellite images by fusing many successive noisy thin image stripes in real time, before transmitting to the ground the final fused image. Yet these stripes cannot be fused directly by addition because their position is altered by microvibrations and other geometric perturbations. These must be compensated using limited onboard computational resources with high subpixel accuracy and in real time. In this paper we study the fundamental performance limits for this problem and propose a real time solution that nonetheless gets close to the theoretical limits. We introduce a scheme using temporal convolution together with online noise estimation, optical flow and a non-conventional multiframe method for measuring global translation. Finally, we compare our results with the theoretical bounds and other stateof-the-art methods. The results are conclusive on the fronts of accuracy and complexity. Keywords—Multiframe Registration, Optical Flow, Cramer-Rao bounds, Earth Observing System.
I.
I NTRODUCTION
Push broom imagers placed on satellites are used to perform Earth observation at high resolution [1], [2]. Because they are close to the Earth (≈ 800 km), they move fast and the image acquisition time is very brief. The solution to avoid motion blur is to use Time Delay Integration (TDI) on the satellite’s imaging sensor, which, by synchronizing the pixels with the motion of the camera or the object, are able to increase the effective exposure time. This sensor works by shifting the partial measurements of its multiple rows to their adjacent rows synchronously with the motion of the image across the array of elements. This synchronous accumulation yields an SNR unobtainable using conventional CCD arrays or single-line-scan devices. However, it places stability constraints on the imaging device. To relax these constraints that limit the accumulation time, we evaluated if performing on board registration and accumulation was possible. Envisaging the use of a CMOS TDI allows to correct the accumulation procedure by resampling the incoming signal. This permits to avoid motion blur while accumulating many more frames, typically 500 instead of 20. Yet this raises new technical issues. A prerequisite is that the perturbations of the line of sight must be estimated very accurately. To this aim, little low resolution CMOS sensors set along this line of sight could sense the same landscape underneath of the TDI’s detection line, possibly under slight perturbations and with a low signal to noise ratio (SNR) [3]. The problem is then to
perform subpixel image registration, up to the noise limit, of a set of consecutive low SNR images of the same landscape, known as a “TDI ROW” or “line”. This becomes a nontrivial task due to different external conditions such as pointing errors and satellite vibration. Luckily at the studied time scales (> 500Hz), the satellite micro-vibrations can be considered a linear shift over the image sequence, requiring only to compute a global translation estimate for the whole line. Finally, due to limited hardware resources, the desired algorithm must be low demanding in both memory and processing speed. Based on this task, several interesting questions arise. Is it possible to predict the theoretical optimal performance of the registration method to fix a theoretical bound? Since this will imply an accurate noise estimate, can the noise be estimated at low cost on board? Can the smoothness of the trajectory be used to make an accurate and very low complexity registration? How can the results be validated? Registration of two images is an old subject for which several very good reviews exist over the years [4], [5]. An iterative method to gain subpixel accuracy was already proposed in 1981 [6]. A classic alternative involves FFT-based phase correlation [7]. It has been improved to get subpixel accuracy in [8], [9] and [10]. The optimal bounds for performing registration between two images also have a long history going back at least to 1983 [11]. We shall use their latest developments [12], [13], [14] which will be extended here to a multiimage uniform translation. It so happens that the optimal registration bounds for an image sequence are much more accurate than with only two, and the challenge is to reach them. This paper is organized as follows. Optimal lower bounds for the multi-images registration method are obtained in section II. Section III describes the proposed algorithm. Finally, the method is evaluated and compared with other algorithms and with the theoretical bounds in section IV and conclusions are given afterwards. II.
I NFLUENCE OF N OISE ON THE R EGISTRATION
A. Predicting the error on the estimation caused by the noise In the last decade, two main works have handled the calculation of the error on the estimated shift caused by the noise. Robinson and Milanfar [12] used the Cramer-Rao bound (CRB) to derive a performance limit for image registration. The CRB bound imposes a lower limit on the mean squared error for any estimate of a deterministic parameter. Pham et
al. [13] continued on the same idea to derive a theoretical limit for image registration, followed by a study of the bias for the gradient based 2D shift estimation from the Lucas and Kanade [6] optical flow method. Finally, Sabater et al. [14] have studied how accurate the disparity computed from a rectified stereo pair of images with small baseline can be, and a formula for the variance of the disparity error caused by the noise was proven. Contrarily to CRB bounds, the error formula of this last paper is not a bound but a real evaluation of the variance. In this section, we assume for a sake of simplicity that the noise is uniform and Gaussian. This assumption is valid if the SNR is not too low and can be secured by applying a Variance Stabilization Transform (VST) to Poisson noise [15]. In [13], the CRB is given for the registration of a pair of images. We now consider its extension to a set of N images, or “line” when a global uniform translation (vx , vy ) per line is assumed. Assuming that I0 , . . . , In are uniformly translated sample images of a noiseless image I on a discrete image domain S, each being corrupted by additive white Gaussian noise with variance σ 2 , we have that I0 (x, y) = I(x, y) + n0 (x, y) I1 (x, y) = I(x − vx , y − vy ) + n1 (x, y) .. . In (x, y) = I(x − nvx , y − nvy ) + nn (x, y) ˆ = [vˆx , vˆy ] be an estimator for the global shift v = Let v [vx , vy ]T , then under the noise model the probability P r(I|ˆ v) of the unknown scene I given vˆ is ! n Y 2 Y 1 (Ii (x, y) − I(x − ivx , y − ivy )) √ exp − . 2σ 2 σ 2π i=0 S (1) Then, logP r(I|ˆ v ) becomes n 1 XX 2 (Ii (x, y) − I(x − ivx , y − ivy )) + const. (2) 2σ 2 i=0 S
∂I ∂x
and Iy =
so that the CRB for performing registration on a set of N consecutive images using an unbiased estimator becomes P 2 N !−1 2 X σ S Iy var(vx ) ≥ F−1 i2 (4) 11 = Det(F) i=1 N 2 X σ S Ix i2 Det(F) i=1 P 2P 2 P 2 where Det(F) = S Ix S Iy − ( S Ix Iy ) .
var(vy ) ≥ F−1 22 =
B. Noise estimation The CRB given above requires a global estimation of the noise. Since the translation between consecutive frames is assumed small with respect to the pixel size, by subtracting both frames we can easily estimate it. Setting It∗ = I2∗ − I1∗ where I2∗ and I1∗ have white Gaussian noise with standard deviation σ, implies that (6)
where |S| is the cardinality of the image. This estimation can be improved by taking its average over all pairs of consecutive frames. Since the shot noise occurring in satellite images is Poisson, the Anscombe VST [15] is applied previously to get an approximately white Gaussian noise. So we set " # N −1 1 X ∗ 1 X ∗ 2 σ ˆ := (I (x, y) − Ii (x, y)) N − 1 i=1 2|S| x,y i+1
(7)
where N is the number of images to register, |S| is the number of pixels of each image and Ii∗ represents each noisy VST transformed image from Ii .
∂I ∂y ,
the 2 × 2 Fisher Information h 2 i Matrix (FIM) F [16] given by −E ∂∂2 vˆ log P r(I|ˆ v ) for our problem therefore is ! P P N 2 1 X 2 P S Ix PS Ix I2y , F(v) = 2 i (3) σ S Ix Iy S Iy i=1 Setting Ix =
If we imagine now that a more reliable algorithm could match each image to each, and that the number of “independent” shift estimates would therefore be N (N + 1)/2, still, the variance of an average of all estimates obtained in that way would be, under the (inexact) independence assumption of all of these estimates, of the order of 1/N 2 times the variance of a single estimate. But here the promised CRB factor is still more favorable, of the order of 1/N 3 .
1 X ∗ (I (x, y))2 ≈ 2σ 2 |S| x,y t
T
−
that the Cramer-Rao estimates are tight, they mean that with N image samples we can divide the variance by an order of magnitude of N 3 and not N as would be expected for a simple average of N independent noisy estimates.
P 2
!−1 (5)
Comparing with the CRB on a pair of images given in [13], this result is remarkable and slightly puzzling. Assuming
C. Conditions based on the SNR of the images To measure efficiently the SNR and to ensure a reliable calculation Pof2 the global P 2 shift, a good estimate of the signal energies Ix and Iy is crucial. Yet their evaluation will be reliable if and only if the following ratios are high enough, P θxi
= P
θyi
=
∂Ii∗ 2 x,y ∂x (x, y) 2|S|ˆ σ2 ∂Ii∗ 2 x,y ∂y (x, y) 2|S|ˆ σ2
(8) (9)
where |S| is the total amount of pixels of the computed gradient and σ ˆ 2 is given by Eq. (7). These ratios are computed and verified for every image Ii∗ of the image sequence. Note that θxi , θyi ≥ 1 and that a value strictly larger that 1 entails the presence of signal while a value close to 1 entails that there is only noise in the observed image.
III.
A N ITERATIVE SOLUTION
Performing noise estimation is necessary, in particular to discard lines with no signal (for example those overflying calm waters). On the other hand noise reduction is crucial to increase the registration accuracy. We found that a paradoxical operation, performing a temporal convolution on the line, actually increased spectacularly the registration accuracy. The paradox is that this temporal convolution indeed decreases the noise, but it also introduces a motion blur. Fortunately, the motion is small enough (less than a tenth of a pixel), which is a key favorable ingredient for this strategy. How many frames should be accumulated? There are several external constraints that make this election non-trivial, namely the amount of noise the input images have, the total translation of the line, the desired algorithm’s precision and computation cost and the available memory size. A simple idea to estimate the right number of accumulated frames is to begin by evaluating the SNR (section II-C) using two frames for the temporal convolution, and to increment exponentially the size of the temporal filter until proper SNR values are found or until the line is discarded for low SNR. To estimate the shift between frames, the optical flow shift estimator presented in [6] can be used. Based on the socalled optical flow constraint equation [17], this shift estimator method minimizes the Minimum Squared Error (MSE) to obtain both translations vx , vy resulting in: P 2 P P P S Ix PS Ix I2y . vx = PS Ix It (10) vy S Iy It S Ix Iy S Iy which gives the solution !
1 vˆx = Det
X S
S
1 vˆy = Det
X
X
where Det =
Ix It
X
Iy2 −
X
Iy It
X
S
S
X
X
Ix Iy
(11) !
Iy It
S
P
S
Ix2
−
S
Ix It
Ix Iy
Algorithm 1 EstimateGlobalShift Input: IT C (k), 1 ≤ k ≤ N − 2p: temporal convolved input images, N : quantity of frames, p: amount of frames used for temporal convolution. Output: of f set global estimated transition. Ix , Iy ← ComputeGradient(IT C (1)) previous ← [0, 0] for i = 1 to N − 2 ∗ p − 1 do [vx , vy ] ← round(previous) IT C (i + 1) ← P ixelShif t(IT C (i + 1), −vx , −vy ) of f s(i) ← [vx , vy ] + OF (IT C (1), IT C (i + 1), Ix , Iy ) previous ← of f s(i) end for of f set ← M ean(DeacumOf f s(of f s)) This algorithm requires approximately 31 operations per pixel. By adding the noise estimation, the SNR validation and the temporal convolution, its complexity ranges from 40 to 107 operations per pixel depending on the noise level (which influences the amount of iterations for the SNR validation). This very low complexity, together with low memory requirements, allows the method to be successfully applied in real-time on low computational hardware usually found on satellites. IV.
(12)
S
P 2 2 S Iy − ( S Ix Iy ) and It = I2 − I1 .
P 2
S Ix
this residual shift. In that way, all subpixelian shift estimates with respect to the first frame are nearly independent and can be efficiently averaged. Furthermore, the image gradient is only calculated for the first image, thus accelerating the algorithm considerably. This procedure is shown in algorithm 1 where the method P ixelShif t shifts the image pixelwise, OF is the iterative optical flow shift estimation described in [13] and DeacumOf f s subtracts each element from its previous for the whole vector.
This estimator introduces a bias that must be corrected. As found in [13], iterating the estimation of the shift with a resampling of the second image yields an unbiased motion estimator. Since just the derivative of the first image is needed, it is only calculated once, permitting a fast implementation. However, this iterative process implies performing interpolation on the images which implies higher computation costs. We found that bicubic [18] and B-Spline interpolation [19] gave a good speed-accuracy compromise. The iterative procedure mentioned above can be used to calculate the shift between two frames. For our multiframe problem, the naive solution would be to calculate all consecutive shifts and average them. But this would accumulate errors. In a smarter technique, the shift is calculated for every frame, but always with respect to the first frame. However, due to the poor accuracy of the optical flow method when the shift gets bigger, the estimate of the previous shift can be used to realign the current frame by an integer number of pixels. Thus, the shift between the realigned current frame and the reference first frame would always remain lower than half a pixel. The optical flow shift estimation algorithm can be used to estimate
R ESULTS
To evaluate the performance of the proposed method, a simulated test set together with its ground truth was provided by CNES (French space agency). It simulated arrays of noisy images (with Poisson noise) obtained from a noiseless aerial raw image, which was finely resampled by high order splines to simulate subpixel motion. In it, there were approximately 3000 lines, each containing 64 frames, where each frame is a 50 × 50 image of 30cm per pixel resolution. Noise varied between lines, while the total shift for the N images on the horizontal plane (δx ) could range between 0 and 4.5 pixels. On the vertical plane, the provided simulated images did not contain any translation (i.e. δy = 0). To measure how the algorithms performed, the Euclidean distance with respect to δy x , N −1 was used. To compare several the global shift Nδ−1 alignment methods, a significant subset of lines were chosen so as to represent each type of problem and/or perturbation: lines 4 and 5 are considerably noisy, in lines 10 and 11, methods that assume periodicity in the images would fail and the rest have different shift/noise conditions. Three algorithms were compared, namely the proposed method, the iterative algorithm of Pham et al. [13] applied frame by frame and averaging all the calculated translations and a method based on the Fourier-based shift estimation of [10]. Using three iterations for the OptF low algorithm
and spline interpolation, table I shows the Euclidean norm of the error for each method compared to the Euclidean norm of the estimated CRB. As can be observed, the proposed # 1 2 3 4 5 6 7 8 9 10 11 Avg
δx 0 1.5 3 4.5 4.5 0 1.5 3 4.5 3 3 -
CRB 5.78e−6 6.41e−6 6.01e−6 9.58e−6 2.35e−5 1.77e−6 4.12e−6 3.35e−6 2.60e−6 4.37e−6 3.98e−6 -
[13] 4.27e−4 3.60e−4 9.44e−4 1.10e−2 4.77e−2 1.74e−5 4.21e−4 1.11e−3 1.02e−3 3.15e−3 2.48e−3 6.24e−3
Proposed 4.92e−5 4.65e−5 3.10e−5 5.72e−5 4.71e−4 2.02e−5 1.08e−4 5.55e−5 2.83e−5 4.25e−5 1.14e−4 9.30e−5
[10] 2.87e−5 4.63e−3 6.46e−4 1.20e−3 6.41e−3 2.32e−5 3.10e−3 2.99e−5 2.83e−4 8.93e−2 3.91e−2 1.32e−2
(a) BP - First
(b) BP - Last
(c) BP - Dif
TABLE I. T OTAL GROUND TRUTH TRANSLATION , C RAMER -R AO BOUNDS AND ERROR FOR EACH METHOD FOR THE SELECTED LINES IN PIXELS .
algorithm gives the best results in almost every case, except when no translation is present in which case all methods obtain excellent results. Note the poor performance of the Fourier based method in the last two cases where it suffers from its implicit assumption that images are periodic. The obtained results approach the CRB bounds in almost every line by an order of magnitude. The obtained subpixel precision is below 10−4 , which makes the method feasible for satellite imaging. On current satellite images (that are ≈ 3.104 pixels long), it would amount to a 3 pixels distortion if all registration errors accumulated. This error is acceptable. Yet, assuming the registration errors to be independent, we can expect a global distortion of the order of 10−1 pixels on average. This is potentially as accurate as the current best mechanical stabilization systems [2]. In order to analyze the results visually, the registration process was divided in three distinct steps: BP: Before Processing, original images of the line, AL: After Aligning the frames and FI: FInal result after performing a temporal convolution on the aligned frames. The absolute value of the difference between the first and the last frame is shown, augmented by a multiplicative factor to make it visible. For every example, a comparison is made with the results of the simpler iterative algorithm of Pham et al. [13], which does not use temporal convolution and the global translation is calculated by averaging over all frame by frame translations. Figure 1 shows results for an example line in which the method drastically aligns the frames. In figure 1c, the 3 pixel difference between the first and the last frames are observable. After aligning the frames, no signal is perceived over noise, thus hinting at a quality result. The compared method fails to correctly align the frames, which can be perceived in the difference image by the edges of the buildings. In figure 2, a very noisy line is evaluated. Besides the high amount of noise, there is a larger amount of translation (4.5 pixels) between the first and the last frame. Due to this reason, small objects appear duplicated in the difference image. In this particular example, the results of temporal convolution (TC) are shown to give a better understanding of the case. In fact, after performing temporal convolution (accumulating over eight continuous images in this example), objects start to be clearly perceived, which strongly justifies its use. Due
(d) AL - Dif
(e) FI - Dif
(f) Pham et al. [13] - AL - Dif
(g) Pham et al. [13] - FI - Dif
Fig. 1. Results for line 8. The frames are correctly aligned, and mostly noise is perceived on the difference after alignment.
to the high amount of noise, the compared method presents strong difficulties to align the frames, due mainly to its lack of denoising before estimating the optical flow. However, for the proposed algorithm, registration results for this line are remarkable. V.
C ONCLUSION
We found that a multiframe optical flow method can become remarkably accurate and opens the way to flexible onboard imaging on push-broom satellites, with augmented SNR. Our proposed method is computationally cheap, however experimentation showed that it is more accurate than other methods by two or three orders of magnitude and is only one order of magnitude higher than the Cramer-Rao lower bounds. Fortunately, the attained accuracy is just sufficient for our imaging goals, which shows that the problem was tight. Finally, our proposed algorithm is currently being used by CNES in a on-ground demonstrator to register TDI images in real time, being implemented in hardware to reach high computational speed. Nevertheless, the question of finding an optimal multiframe registration algorithm, potentially even closer to the Cramer-Rao bounds, remains open.
R EFERENCES [1]
[2]
[3] (a) BP - First
(b) BP - Last
(c) BP - Dif [4]
[5] [6]
(d) TC - First
(e) TC - Last
(f) TC - Dif [7]
[8]
[9]
[10]
(g) AL - Dif
(h) FI - Dif
[11]
[12]
[13]
[14]
[15] (i) Pham et al. [13] - AL - Dif
(j) Pham et al. [13] - FI - Dif
Fig. 2. Results for noisy line 5. Temporal convolution allows to better perceive objects. Only noise remains in the difference after registration for our method.
[16]
[17] [18]
ACKNOWLEDGMENTS [19]
During this work, the author had a fellowship of the Ministerio de Economia y Competividad (Spain), reference BES-2012-057113, for the realization of his Ph.D. thesis. This work was also partially supported by the Centre National d’Etudes Spatiales (MISS Project), the European Research Council (Advanced Grant Twelve Labours), the Office of Naval Research (under Grant N00014-97-1-0839), Direction G´en´erale de l’Armement and Fondation Math´ematique Jacques Hadamard and Agence Nationale de la Recherche (Stereo project).
J. W. Bergstrom, R. W. Dissly, W. A. Delamere, A. McEwen, and L. Keszthelyi, “High-Resolution Multi-Color Pushbroom Imagers for Mars Surface Characterization and Landing Safety,” LPI Contributions, vol. 1679, p. 4332, Jun. 2012. S. Fourest, X. Briottet, P. Lier, and C. Valorge, Satellite Imagery from Acquisition Principles to Processing of Optical Images for Observing ´ the Earth. C´epadus Editions, 10 2012. A. Materne, O. Puig, and P. Kubik, “Capteur d’images avec correction de boug´e par recalage num´erique de prises de vue fractionn´ees,” Patent, 2012, fR 2976754-A1. Q. Tian and M. N. Huhns, “Algorithms for subpixel registration,” Comput. Vision Graph. Image Process., vol. 35, no. 2, pp. 220–233, Aug. 1986. B. Zitov and J. Flusser, “Image registration methods: a survey,” Image and Vision Computing, vol. 21, pp. 977–1000, 2003. B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in Proceedings of the 7th international joint conference on Artificial intelligence - Volume 2, ser. IJCAI’81. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1981, pp. 674–679. B. Reddy and B. Chatterji, “An FFT-based technique for translation, rotation, and scale-invariant image registration,” IEEE Transactions on Image Processing, vol. 5, no. 8, pp. 1266 –1271, aug 1996. H. Foroosh, J. Zerubia, and M. Berthod, “Extension of phase correlation to subpixel registration,” IEEE Transactions on Image Processing, vol. 11, no. 3, pp. 188 –200, mar 2002. K. Takita, T. Aoki, Y. Sasaki, T. Higuchi, and K. Kobayashi, “HighAccuracy Subpixel Image Registration Based on Phase-Only Correlation,” 2003. M. Guizar-Sicairos, S. Thurman, and J. Fienup, “Efficient subpixel image registration algorithms,” Opt. Lett., vol. 33, no. 2, pp. 156–158, Jan 2008. V. N. Dvornychenko, “Bounds on (deterministic) correlation functions with application to registration,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. PAMI-5, no. 2, pp. 206–213, 1983. D. Robinson and P. Milanfar, “Fundamental performance limits in image registration,” Trans. Img. Proc., vol. 13, no. 9, pp. 1185–1199, Sep. 2004. T. Pham, M. Bezuijen, L. van Vliet, K. Schutte, and L. C.L., “Performance of optimal registration estimators,” in Visual Information Processing XIV., ser. SPIE, no. 5817. SPIE, 2005. N. Sabater, J.-M. Morel, and A. Almansa, “How accurate can block matches be in stereo vision?” SIAM J. Img. Sci., vol. 4, no. 1, pp. 472–500, Mar. 2011. F. Anscombe, “The transformation of Poisson, binomial and negativebinomial data,” Biometrika, vol. 35, no. 3/4, pp. 246–254, 1948. J. W. Pratt, “F. y. edgeworth and r. a. fisher on the efficiency of maximum likelihood estimation,” The Annals of Statistics, vol. 4, no. 3, pp. pp. 501–514, 1976. B. K. Horn and B. G. Schunck, “Determining optical flow,” Cambridge, MA, USA, Tech. Rep., 1980. R. Keys, “Cubic convolution interpolation for digital image processing,” Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE Transactions on, vol. 29, no. 6, pp. 1153– 1160, 1981. M. Unser, A. Aldroubi, and M. Eden, “Fast b-spline transforms for continuous image representation and interpolation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 3, pp. 277–285, Mar. 1991.