Keywords: Robust Image Matching, Robust Image Registration, Re- weighted ... between the target and template images: noise patterns can be either homoge-.
Constraint Optimisation for Robust Image Matching with Inhomogeneous Photometric Variations and Affine Noise Al Shorin, Georgy Gimel’farb, Patrice Delmas, and Patricia Riddle University of Auckland, Department of Computer Science P.B. 92019, Auckland 1142, New Zealand {al,ggim001,pdel016,pat}@cs.auckland.ac.nz
Abstract. While modelling spatially uniform or low-order polynomial contrast and offset changes is mostly a solved problem, there has been limited progress in models which could represent highly inhomogeneous photometric variations. A recent quadratic programming (QP) based matching allows for almost arbitrary photometric deviations. However this QP-based approach is deficient in one substantial respect: it can only assume that images are aligned geometrically as it knows nothing about geometry in general. This paper improves on the QP-based framework by extending it to include a robust rigid registration layer thus increasing both its generality and practical utility. The proposed method shows up to 4 times improvement in the quadratic matching score over a current state-of-the-art benchmark. Keywords: Robust Image Matching, Robust Image Registration, Reweighted Iterative Least Squares, Affine Functions, Inhomogeneous Photometric Noise, QP, Hildreth-D’Esopo Algorithm
1
Introduction
Digital images capture both photometric and geometric properties of a real world 3D scene (from this point forward, for brevity, these properties will be denoted by letters p and g respectively). Matching or registering semantically similar images has to account for their p- and g-dissimilarities or noise (not to be confused with independent random noise which is denoted here as residual noise). These dissimilarities can be caused by a great deal of factors but it is convenient to think of them as being either intrinsic or extrinsic to the scene. The examples of the former type of noise include scene shadows, changing illumination, or different object poses, while the latter are most commonly introduced after image acquisition, e.g. brightness or scale adjustments. The third dichotomy of noise introduced here can be used to describe the complexity of dissimilarity patterns between the target and template images: noise patterns can be either homogeneous (or smooth) or inhomogeneous (or non-smooth). The distinction between those is somewhat arbitrary, but in this paper it is assumed that patterns are
2
Al Shorin, Georgy Gimel’farb, Patrice Delmas, Patricia Riddle Template
Target
15◦ R 0◦ U
15◦ R 0◦ U
15◦ R 0◦ U
15◦ R 15◦ U
15◦ R 0◦ U
90◦ R 75◦ U
Difference
(a) Extrinsic homogeneous (synthetic 2-nd order polynomial)
(b) Intrinsic inhomogeneous
(c) Intrinsic inhomogeneous
Fig. 1. Homogeneous/extrinsic (a) and inhomogeneous/intrinsic (b,c) noise patterns for image 01 from Fig. 3. Notice significant effects of rotated illumination source (c). The difference images are enhanced for visualisation purposes by the contrast factor of 1.8 and the offset of 30/255. The images are from the MIT database [9].
homogeneous if they can be accurately modelled by slowly varying functions of image coordinates such as low-order polynomials. Otherwise the pattern is non-smooth or inhomogeneous. The rationale for this distinction comes from the upper bound of modelling complexity of the current state-of-the-art and is further described in Section 2. It is expected that intrinsic p-variations between the images of 2D objects on a frontal plane will be homogeneous in nature, while those of complex 3D objects will be highly inhomogeneous. Natural 3D scenes will almost certainly guarantee a great deal of inhomogeneity, for example, because variable illumination conditions will introduce complex shadow patterns on complex 3D surfaces. Figure 1 illustrates the notion of homogeneity: homogeneous p-variations (a) can be represented by a 2nd-order polynomial of image xy-coordinates, wherein the inhomogeneous patterns (b,c) arising from varying scene illumination are much more intricate. Noise complexity becomes more apparent under larger directional light changes (c). A comprehensive set of inhomogeneous patterns arising from 18 different light source positions is given in Fig. 2. These images show absolute differences between two images of the same subject: one under fixed frontal illumination and the other under the light source which was rotated along horizontal and vertical
Constraint Optimisation for Robust Image Matching 0◦ U
15◦ U
30◦ U
45◦ U
60◦ U
3
75◦ U
15◦ R
45◦ R
90◦ R
Fig. 2. The difference images between two identical faces taken under various illumination conditions: image 01 from Fig. 3 with frontal illumination vs. 18 of its variants taken under different horizontal (R) and vertical (U) lighting orientations. The images were enhanced for visualisation purposes with the contrast factor of 1.8 and the offset of 30/255. Images are from the MIT database [9].
axes. Cross-correlation C = r2 for all the pairs provide a quantifiable statistical measure of the magnitude of inhomogeneous template-target p-differences. The correlation values rapidly decrease for large changes in light source orientation. Inhomogeneous p-variations pose a challenging problem to the state-of-theart image matching as there is a notable paucity of sufficiently expressive models for representing such complex noise. Many popular algorithms [8,1] relate signal differences to low-order polynomial contrast factors and offsets. The residual is described by an independent random field of additive deviations with typically a centre-symmetric probability function. Clearly, the low-order polynomial is incapable of approximating complex natural p-noise patterns. Increasing the polynomial order past the quadratic complexity may not be appropriate as this will inevitably introduce numerical instability. An example of the practically established upper bound can be found in the recent 2nd-order polynomial model which only deals with 10 parameters [8]. While an adequately expressive p-noise model is required because inhomogeneous instrinsic noise is virtually unavoidable in practical applications, the situation is more manageable with g-variations as often it is possible to control the pose and other g-properties of the scene. Hence an-extrinsic-only g-noise model is frequently sufficient, and many g-variations encountered in practical problems can be closely approximated by planar rigid transformations. Finally, because p- and g-noise is often modelled separately with quite different mathematical tools, combining them can be a challenging problem. Although many known methods derive and utilise a singular optimisation routine incorporating both components, it is not a trivial exercise for the algorithm presented
4
Al Shorin, Georgy Gimel’farb, Patrice Delmas, Patricia Riddle
in this paper: p- and g-parts involve totally different mathematical models (QP vs. polynomial), different kernels (quadratic vs. robust), and the four orders of magnitude difference in their search space cardinalities respectively. This paper presents a method which combines two disparate g- and p-models and proposes a novel framework for image matching under realistic image variations.
2
Previous Work
Image matching under realistic signal variations has been of interest for decades. At the early days, only zero-order polynomial models of spatially constant brightness and contrast have been used to derive various correlation-based matching methods [5]. The earlier work originates from the pioneering work in computer vision in the 1960s [6]. More recently, it was followed by a number of various quadratic-kernel, polynomial-based p-noise models [3]. In response to failures of non-robust techniques [3], matching methods based on Huber’s statistics and second-order polynomial models of spatially variant brightness and contrast [7] constituted a step forward and later branched out into a family of related algorithms [2,16,15]. The latter replace the traditional squared-error kernel with a more robust, in the presence of large signal differences, M-estimator. Robust matching algorithms often utilize numerical techniques such as gradient descent search for iterative suboptimal minimisation. The number of search parameters grows quadratically, as (ν + 1)(ν + 2), with the polynomial order ν. In addition to numerical instability of higher-order polynomial functions, both the convergence rate and the speed of gradient search is notably affected as the number of parameters grows. This highlights the main dilemma of today’s image matching — either the model expressiveness can be increased by incorporating a larger number of parameters to approximate natural p-noise patterns, or its robustness using a non-quadratic formulation, but not both. Methods based on polynomial p-noise models and robust estimators (matching error kernels) have to deal with computational instability and intractability should they try to increase their expressiveness to account for real-world noise patterns. In practice, these methods are forced to tackle only a small number of parameters that hinders adequate modelling of inhomogeneous noise. As the global polynomial-based approach is not easily scalable, a number of alternative formulations have been explored in the literature. One robust approach avoids dealing with the entire N -dimensional space, where N is the image lattice cardinality, and instead uses selective heuristic sampling of the space [15]. Unfortunately, this method relies on manual crafting of a heuristic thus requiring human intervention. Additionally, most of the signal space is completely ignored. Another robust alternative is image preprocessing using edge and contour detection [16]. As with other approaches, it can only account for a low-order polynomial contrast and offset deviations, and hence it fails when realistic non-smooth variations appear.
Constraint Optimisation for Robust Image Matching
5
Other methods try to avoid using robust statistics, employing the conventional least-squares approach. Unfortunately, they are equally unable to create the richer problem space. Several notable examples include correlation-based matching in the frequency domain [4], correlation-based stereo matching [1], employing correlation between local regions rather than individual pixels [14,17], divide-and-conquer matching in the pixel space by breaking down the entire image into smaller patches and evaluating localised scores [18], or even matching with a mixture model of global and local p-parameters where the local parameters describe specific signal relationships under illumination changes due to diffuse, ambient and specular reflections [10,13]. All these methods are restricted by a number of oversimplifications, have inadequately small parametric search spaces and, as a result, cannot capture the complexity of more general noise. Recently, a promising new direction was explored wherein the problem of matching is looked at as the constraint optimisation problem [12]. This novel nonparametric quadratic programming based approach implements a model which is expressive enough to successfully capture complex intrinsic p-noise. Its large search space exploits 6N linear constraints in the immediate 2-neighbourhood of a pixel, while a fast QP algorithm is used to solve it. Despite showing a dramatic improvement over the state-of-the-art in terms of modelling power, it knows nothing about geometry and it assumes the perfect geometric alignment between the template and the target, thus seriously limiting its usefulness. To the best of our knowledge, a matching algorithm which can account for both affine g-noise and inhomogeneous p-noise has not been proposed yet.
3
The Proposed Matching Method
Suppose a greyscale image s is encoded as s : R → Q where the lattice R = [(i, j) : i = 0, . . . , m − 1; j = 0, . . . , n − 1] has the total of N = m × n pixels, and where the grey level signal is defined by the finite set Q = {0, 1, . . . , Q − 1}. Then, si,j is the image grey level, i.e. intensity or brightness, at pixel (i, j). Let t : R → Q denote the template t of the target s. The approach introduced in this paper merges two independent noise model layers: the p-model implementing the QP matching algorithm [12] and the gmodel employing a classic robust affine registration algorithm [8]. P-model: QP-based image matching The least-squares error kernel reduces the problem of image matching described below to a QP problem with 6N linear constraints sufficiently descriptive for a great range of inhomogeneous p-noise [12]. Admissible p-deviations in the target image s with respect to the template image t are represented by constraining changes of the neighbourhood signals to the predetermined range E = [emin , emax ] where 0 < emin < 1 < emax . If the image ˆs is obtained by relaxing the neighbourhood relationships in t with the admissible multiplicative range E, then the local constraints on ˆs can be defined
6
as
Al Shorin, Georgy Gimel’farb, Patrice Delmas, Patricia Riddle
∆min:i,i−1;j ≤ sˆi;j − sˆi−1;j ≤ ∆max:i,i−1;j ∆min:i;j,j−1 ≤ sˆi;j − sˆi;j−1 ≤ ∆max:i;j,j−1 0 ≤ sˆi;j ≤ Q−1
for all pixel pairs ((i, j); (i − 1, j)) and ((i, j); (i, j − 1)) in R, where ∆min:i,i−1;j = min{e (ti,j − ti−1,j )} e∈E
∆max:i,i−1;j = max{e (ti,j − ti−1,j )} e∈E
∆min:i;j,j−1 = min{e (ti,j − ti,j−1 )} e∈E
∆max:i;j,j−1 = max{e (ti,j − ti,j−1 )}. e∈E
The objective function of Eq. (1) assumes centre-symmetric residual noise. Its matching score is based on the Cartesian metric of the distance between ˆs and t under the constrained signal deviations of Eq. (1), and its minimiser is determined as X 2 ˆs = arg min , (1) (ˆ si;j − si;j ) ˆ s∈H(t;E) (i,j)∈R
where H(t; E) describes all images ˆs which satisfy the constraints imposed by Eq. (1), i.e. all images with admissible deviations from t. This QP problem is solved with the well-known Hildreth-d’Esopo algorithm thus guaranteeing the convergence of ˆs to a solution arbitrary close to the global minimiser. More details on the method including its derivation can be found in the literature [11]. G-model: Robust affine image registration Let γ = [γ0 γ1 γ2 ]T and δ = [δ0 δ1 δ2 ]T denote six parameters θ = (γ, δ) of the affine transform. Given the set of pixels si,j ∈ s, ti,j ∈ t where (i, j) ∈ R, the transformation field (∆i, ∆j) is defined as ∆i = γ T v = γ0 + γ1 i + γ2 j ∆j = δ T v = δ0 + δ1 i + δ2 j,
(2)
where v = [1 i j]T . The affine transformation expressed in terms of this transformation field (∆i, ∆j) can be rewritten as si+∆i,j+∆j = ti,j + i,j for all (i, j) ∈ R,
(3)
where is centre-symmetric p-noise due to the imprecise nature of the g-model. Finding an estimate of the affine transformation field directly is hard. The truncated first-order Taylor series decomposition of si+∆i,j+∆j can simplify the problem to ds ds T si,j = ti−∆i,j−∆j − v γ + δ + i,j ≡ ti−∆i,j−∆j − θ T ci,j + i,j , (4) di dj
Constraint Optimisation for Robust Image Matching 01
02
03
04
05
06
07
08
09
10
7
Fig. 3. All 10 images with frontal illumination 15◦ R and 0◦ U from the MIT dataset [9].
ds where the gradients ds di and dj are approximated by differences between the pixel at location (i, j) and its two adjacent neighbours [7]. Zero gradient scalars are chosen for the border condition. Weights ci,j in the linear combination of the affine parameters θ in Eq. (4) can be computed from the respective gradient values and pixel coordinates. The minimiser θˆ can be found by solving the robust function, leading to the formulation of robust affine registration X (5) ρ si,j − ti−∆i,j−∆j + θ T ci,j , θˆ = arg min γ,δ∈θ
i,j
2 where an arbitrary M-estimator ρ(. . .), e.g. the Lorentzian ρ(z) = log 1 + z2 , can be employed. The cost function can be solved by any appropriate numerical procedure. In this paper, the re-weighted iterative least-squares method was implemented. The method uses the idea of W-estimation which is conditionally equivalent to the Gauss-Newton method as shown elsewhere [11]. The approximate solution offered by this method was previously shown to be satisfactory for image matching purposes [8].
4
Experimental Results
Dataset All experiments are based on the MIT dataset [9]. It was selected due to consistency in scene poses for all images in the set, and a great deal of inhomogeneous intrinsic p-noise under strictly controlled and variant illumination as demonstrated in Fig. 2. The data base contains 360 images of 10 persons (36 per subject) captured with different vertical and horizontal orientation of two dominant illumination sources: the former changes from 0◦ (direct) to 75◦ (top), and the latter – from 15◦ (direct) to 90◦ (right). The additional ambient light source makes all facial features visible regardless of the dominant light source position. All the images have the same facial expression, geometry and
8
Al Shorin, Georgy Gimel’farb, Patrice Delmas, Patricia Riddle
Fig. 4. A typical experimental setup for each of the five test cases. Images in vertical triplets show a template (top), the template with p-noise (centre), and the template with both p-noise and affine transformation (bottom). The bottom image is the target in the template-target pair. Images are from the MIT dataset [9]
backgrounds so that the experiments could be conducted in the controlled conditions. Images of all 10 subjects taken under the frontal illumination (15◦ R and 0◦ U) are reproduced in Fig. 3. Experiments The proposed unified framework for image matching has been tested on a set of images subjected to appropriate p- and g-transformations forming five test cases described below. Introduced affine transformations combine translation, rotation, and scaling. A typical experimental setup for all five test cases is shown in Fig. 4 where the translation of (40/200, 10/200), the roπ tation of 36 , and the vertical scaling of 0.92 was performed on the template image. The affine transformations were identical in all experiments and were largest possible to keep the face within the image lattice boundaries. In terms of p-variations, the test cases are identical to the ones defined for the QP-based approach [12]: all variables are the same including the dataset selected [9], the benchmark [8], the stopping rule and the multiplicative range E. 1. Base Case: Only affine noise is involved; no p-variations. 2. Uniform Case: A potentially lossy ”darkening“ p-transformation on lowvalue signals truncated to stay within the valid signal range [0..255]. The function of fi,j = 1.3ti,j − 40/256 was used to simulate uniform p-noise. 3. Polynomial Case: 2nd-order polynomial noise was introduced with the ”brightening“ function fi,j = (0.035i − 0.00021i2)ti,j + 28/256. As with Case (2), transformed values outside [0..255] were truncated.
Constraint Optimisation for Robust Image Matching Template t
s0
Target s
Benchm’k ˆs (s0 − ˆs)
QP ˆs
9
(s0 − ˆs)
Base
Uni
Poly
Intr I
Intr II
Fig. 5. Experimental results for all the five test cases (Subject 01 from Fig. 3 [9]). Residual differences of the model ˆs from the geometrically untransformed target s0 are scaled up for visualisation purposes. The less, i.e. the darker, the residual, the better pixel-wise matching performs.
4. Intrinsic Case: An image of the same subject taken under different pconditions was chosen randomly. 5. Intrinsic Swapped Case: Same as above, but the images are swapped. Out of all test cases, the most important ones are intrinsic Cases (4) and (5). This is because the degree of the algorithm’s success needs to be judged on the matching task involving both inhomogeneous p-noise and affine g-variations. Recall that the current state-of-the-art would fail under these conditions either because intrinsic variations caused by, for example shadows, exceed the model complexity [8], or because geometry cannot be modelled at all [12]. Although, the proposed method outperforms the benchmark in all five cases, the improvements shown with Cases (4) and (5) are of greater theoretical significance. The typical outcome of a five test case run and the comparison of the proposed method to the benchmark is demonstrated in Figs. 5-7. The results for 36 target–template pairs for each scenario are validated in Table 1. The proposed method outperforms the benchmark in all implemented cases: the error means have been reduced by a factor ranging from 2.0 to 4.6. The improvement is statistically significant (p < 0.0002) in 4 out of 5 cases, while Base Case (1) still shows the satisfactory p value of 5.4%. Predictably, the highest registered mean errors were on the polynomial case. As was mentioned above,
10
Al Shorin, Georgy Gimel’farb, Patrice Delmas, Patricia Riddle Template t
s0
Target s
Benchm’k ˆs (s0 − ˆs)
QP ˆs
(s0 − ˆs)
Base
Uni
Poly
Intr I
Intr II
Fig. 6. Experimental results for all the five test cases (Subject 04 from Fig. 3 [9]). Residual differences of the model ˆs from the geometrically untransformed target s0 are scaled up for visualisation purposes. The less, i.e. the darker, the residual, the better pixel-wise matching performs.
Table 1. Total squared differences kˆs − sk2 × 106 for the experiments run. Note that κ and p-value denote the mean improvement ratios and the p-values for the one-tailed hypothesis H0 : µbenchmark > µour algorithm , respectively). Benchmark Test case Mean Std Base 0.4 0.5 Uniform 13.0 11.3 16.1 13.6 Poly Intrinsic 8.6 6.8 Intrinsic Swapped 8.2 6.0
Our algorithm Mean Std 0.2 0.3 3.8 6.1 6.9 6.1 1.8 2.5 2.1 2.5
κ 2.0 3.4 2.3 4.6 4.0
Analysis p-value 0.054 < 0.0002 < 0.0002 < 0.0002 < 0.0002
this is the result of the loss of signal depth due to value truncation. It should be also emphasized that the greatest improvement was achieved in terms of the mean error ratios in Cases (4) and (5). This shows that the point of the greatest theoretical interest has been successfully addressed here, and it constitutes the main contribution of this work.
Constraint Optimisation for Robust Image Matching Template t
s0
Target s
Benchm’k ˆs (s0 − ˆs)
QP ˆs
11
(s0 − ˆs)
Base
Uni
Poly
Intr I
Intr II
Fig. 7. Experimental results for all the five test scenarios (Subject 05 from Fig. 3 [9]). Residual differences of the model ˆs from the geometrically untransformed target s0 are scaled up for visualisation purposes. The less, i.e. the darker, the residual, the better pixel-wise matching performs.
5
Conclusions
The proposed new image matching algorithm successfully combines the recent photometric-only QP-based matching with the robust affine registration and achieves a marked performance improvement when dealing with inhomogeneous photometric noise caused, for example, by varying illumination of a 3D scene. The proposed algorithm preserves the modularity of its p- and g-components. Individually, each component creates its own robust matching methodology that, when combined, improves the known state-of-the-art approach based on the loworder polynomial model or any other noise model that is limited by the upper bound of such a polynomial [14,13]. The proposed approach does not restrict the expressibility of the photometric noise model yet it remains robust.
References 1. Basri, R., Jacobs, D., Kemelmacher, I.: Photometric stereo with general, unknown lighting. International Journal of Computer Vision 72(3), 239–257 (2007) 2. Chen, J., Chen, C., Chen, Y.: Fast algorithm for robust template matching with M-estimators. IEEE Trans. on Signal Processing 51(1), 230–243 (2003)
12
Al Shorin, Georgy Gimel’farb, Patrice Delmas, Patricia Riddle
3. Crowley, J., Martin, J.: Experimental comparison of correlation techniques. In: Proc. International Conference on Intelligent Autonomous Systems (IAS-4). pp. 86–93. Karlsruhe, Germany (March 27–30, 1995) 4. Fitch, A., Kadyrov, A., Christmas, W., Kittler, J.: Fast robust correlation. IEEE Trans. on Image Processing 14(8), 1063–1073 (2005) 5. Gruen, A.: Adaptive least squares correlation: a powerful image matching technique. South African Journal of Photogrammetry, Remote Sensing and Cartography 14(3), 175–187 (1985) 6. Kovalevsky, V.: The problem of character recognition from the point of view of mathematical statistics. In: Kovalevsky, V. (ed.) Character Readers and Pattern Recognition. Spartan, New York (1968) 7. Lai, S.: Robust image matching under partial occlusion and spatially varying illumination change. Computer Vision and Image Understanding 78(1), 84–98 (2000) 8. Lai, S., Fang, M.: Method for matching images using spatially-varying illumination change models, US patent 6,621,929 (Sep 2003) 9. M.I.T. face database [online] (accessed 24 Aug, 2006) http://vismod.media.mit.edu/pub/images 10. Pizarro, D., Peyras, J., Bartoli, A.: Light-invariant fitting of active appearance models. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008. pp. 1–6. Anchorage, Alaska (June 2008) 11. Shorin, A.: Modelling Inhomogeneous Noise and Large Occlusions for Robust Image Matching. Ph.D. thesis, University of Auckland (2010) 12. Shorin, A., Gimel’farb, G., Delmas, P., Morris, J.: Image matching with spatially variant contrast and offset: A quadratic programming approach. In: Kasparis, T., Kwok, J. (eds.) Structural and Syntactic Pattern Recognition, 2008 (SSPR 2008), and Statistical Pattern Recognition, 2008 (SPR 2008), Joint IAPR Int. Workshops on, pp. 100–107. Lecture Notes in Computer Science (v. 5342/2008), Berlin: Springer, Orlando, Florida (2008) 13. Silveira, G., Malis, E.: Real-time visual tracking under arbitrary illumination changes. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’07. pp. 1–6 (17–22 June, 2007) 14. Tombari, F., Di Stefano, L., Mattoccia, S.: A robust measure for visual correspondence. In: Proc. 14th Int. Conf. on Image Analysis and Processing (ICIAP). pp. 376–381. Modena, Italy (Sept 2007) 15. Wei, S., Lai, S.: Robust and efficient image alignment based on relative gradient matching. IEEE Trans. on Image Processing 15(10), 2936–2943 (2006) 16. Yang, C., Lai, S., Chang, L.: Robust face image matching under illumination variations. Journal on Applied Signal Processing 2004(16), 2533–2543 (2004) 17. Zhu, G., Zhang, S., Chen, X., Wang, C.: Efficient illumination insensitive object tracking by normalized gradient matching. Signal Processing Letters, IEEE 14(12), 944–947 (2007) 18. Zou, J., Ji, Q., Nagy, G.: A comparative study of local matching approach for face recognition. IEEE Trans. on Image Processing 16(10), 2617–2628 (2007)