A Klein-Bottle-Based Dictionary for Texture Representation Jose A. Perea∗1 and Gunnar Carlsson†2 1 Department 2 Department
of Mathematics, Duke University, Durham NC 27708 of Mathematics, Stanford University, Stanford CA 94503
Abstract
distance [23], the Bhattacharya metric [25], or the χ 2 similarity test [20].
A natural object of study in texture representation and material classification, is the probability density function in pixel-value space, underlying the set of small patches from the given image. Inspired by the fact that small n × n high contrast patches from natural images in gray scale, accumulate with high den2 sity around a surface K ⊆ Rn with the topology of a Klein bottle [6], we present in this paper a novel framework for the estimation and representation of the distribution around K , of patches from texture images. More specifically, we show that most n×n patches from a given image can be projected onto K yielding a finite sample S ⊆ K , whose underlying probability density function can be represented in terms of Fourier-like coefficients, which in turn, can be estimated from S. We show that image rotation acts as a linear transformation at the level of the estimated coefficients, and use this to define a multi-scale rotation-invariant descriptor. We test it by classifying the materials in three popular data sets: The CUReT, UIUCTex and KTH-TIPS texture databases.
For images in gray scale and dictionaries with finitely many elements, the coding or labeling of n × n pixel patches represented as column vectors of dimension n2 , can be seen as fixing 2 2 a partition Rn = C1 ∪ · · · ∪ Cd of Rn into d distinct classes, each associated to a texton, and letting a patch contribute to the count of the i-th bin in the histogram if and only if it belongs to Ci . For instance, if a patch is labeled according to the dictionary element to which it is closest, with respect to a given norm, then the classes Ci are exactly the Voronoi regions associated to the textons and the norm. If labeling is by maximum response to a (normalized) filter bank, which amounts to selecting the filter with which the patch has largest inner product, then the classes Ci are the Voronoi regions associated to the filters, with distances measured with the norm induced by the inner product used in the filtering stage. More generally, any partition of filter response (or feature) spaceinduces a partition of patch space, by letting two patches be in the same class if and only if the same holds true for their filter responses (resp. associated features).
1
Several partition schemes have been proposed in the literature. Leung and Malik [20], and Varma and Zisserman [29, 30] have used k-means clustering on filter responses and patches from training images to derive textons, and let the partition be the associated Voronoi tessellation. Juire and Triggs implemented in [17] an adaptive density-driven tessellation of patch space, while Varma and Zisserman [28] showed that it is possible to obtain high classification success rates with equally spaced bins in an 8-dimensional filter response space. The best classification results (to date and to our knowledge), consistently across widely different and challenging data sets, are those of Crosier and Griffin in [7]. They propose a geometrically driven partition of a feature space corresponding to multiscale configurations of image primitives such as blobs, bars and edges.
Introduction
One representation for texture images which has proven to be highly effective in multi-class classification tasks, is the histogram of texton occurrences [7, 13, 17, 18, 20, 29, 30, 33]. In short, this representation summarizes the number of appearances in an image of either, patches from a fixed set of pixel patterns, or the types of local responses to a bank of filters. Each one of these pixel patterns (or filter responses, if that is the case) is referred to as a texton, and the set of textons as a dictionary. Images are then compared via their associated histograms, using measures of statistical similarity such as the Earth Mover’s ∗
[email protected] †
[email protected]
While the relative merits of one partition over another have 1
been extensively documented in the aforementioned studies, one idea emerges as the consensus: The statistical distribution of patches, in pixel-value space, is one of the most powerful signatures of visual appearance and material class for a texture image. Moreover, any reasonable estimate of the underlying probability density function (the different histograms above, arising from the different partitions, being examples), is highly effective in a variety of challenging classification tasks. One question that arises, however, is what the right space for fitting the distribution should be. That is, should one model how patches from a particular texture class are distributed in pixel-value space, or should the estimation problem be how they accumulate around predictable and low dimensional submanifolds. This question is motivated by the following observations:
Klein bottle, which we denote by K, yielding a finite sample S ⊆ K. The choice of this space comes from point 3 above. • Theorem 3.3: It is possible to construct orthonormal bases for L2 (K, R), the space of square-integrable functions from K to R, akin to step functions (which yield histogram representations) and Fourier bases. We actually show a bit more: Any orthonormal basis for L2 (K, C) can be recovered explicitly from one of L2 (T, C), where T = S1 × S1 denotes the torus. From here on, we will use the notation L2 (X) = L2 (X, C). • Theorem 3.8: If f : K −→ R is the underlying probability density function, then its coefficients with respect to any orthonormal basis for L2 (K, R) can be estimated from S.
1. Varma and Zisserman [28] have noted that when using a regular grid-like partition of an 8-dimensional filter response space, with 5 equally spaced bins per dimension, most bins remain empty. As seen in Figure 8 of [28], 5 bins per dimension maximizes classification accuracy in the CUReT database, and increasing or decreasing the number of bins hinders performance. One conclusion which can be drawn from this observation, is that it is likely that filter responses from patches do not fill out response space completely, but instead they populate a lower dimensional (and likely highly non-linear) submanifold.
• Theorem 3.10: Image rotation has an specially simple effect on the estimated coefficients of f with respect to the K-Fourier basis. We use this to define a multi-scale rotation-invariant representation for images, which we call the EKFC-descriptor. Here EKFC stands for Estimated K-Fourier Coefficients. • The EKFC-descriptor along with a metric learnt through the Large Margin for Nearest Neighbor approach [32], yields high classification success rates on the CUReT, KTH-TIPS and UIUCTex texture databases (Table 2).
2. State of the art classifiers require histograms with thousands of bins: 1,296 for the BIFs classifier of Crosier and Griffin [7], and 2,440 for the Joint Classifier of Varma and Zisserman [29]. This is a consequence of the dimension of the feature space where the distribution is being estimated. Since these histograms are usually compared using the nearest neighbor approach, which in high dimensions is computationally costly at best and meaningless at worst [4], then low dimensional representations are desirable.
The paper is organized as follows: In Section 2 we review the results from [19] and [6] on the Klein bottle model K , and discuss how the methods from computational topology [5] can be used to model low-dimensional subspaces of pixel-value space, parametrizing sets of relevant patches. Section 3 is where we develop the mathematical machinery driving our approach. We describe a method for continuously projecting patches onto the Klein bottle K , how to generate (all) orthonormal bases for L2 (K, R), how to estimate the associated coefficients and the effect of image rotation on the resulting K-Fourier representation. We begin Section 4 with a detailed description of the data sets used in the paper, the implementation details of our method, as well as examples of computing the K-Fourier coefficients and the effects of image rotation. We end this section with the classification results obtained on the CUReT, KTH-TIPS and UIUCTex texture databases.
3. Carlsson et. al. have shown [6] that even when (meancentered and contrast-normalized) high-contrast 3 × 3 patches from natural images populate their entire state space (of dimension 7), most of them accumulate with high-density around a 2-dimensional submanifold K with the topology of a Klein bottle [12]. A detailed explanation of these results will be provided in Section 2. We present in this paper a novel framework for estimating and representing the distribution, around low dimensional submanifolds of pixel space, of patches from texture images. In particular, we show that:
2
Topological Dictionary Learning
The study of the global geometry and statistical structure of • Propositions 3.1 and 3.2: Most patches from a texture im- spaces of small patches extracted from natural images, was iniage can be continuously projected onto a model of the tiated in [19] by Mumford et al. One of the goals was to under2
stand how high-contrast 3 × 3 patches are distributed in pixelvalue space, R9 , and the extent to which there is a global organization, with respect to their pixel patterns, around a recognizable structure. In order to model this distribution, a data set M of approximately 4 × 106 patches from natural images was created. We now summarize their findings. The van Hateren Natural Image database1 , is a collection of approximately 4,000 monochrome calibrated images photographed by Hans van Hateren around Groningen (Holland), in town and the surrounding countryside [26].
which is the unique scale-invariant norm on images f . The vectors from the log-intensity, high-contrast patches were then centered by subtracting their mean, and normalized by dividing by their D-norm. This yields approximately 4 × 106 points in R9 , on the surface of a 7-dimensional ellipsoid. The combinatorics of the relation ∼ on pixels from n × n patches, can be described by means of a symmetric n2 × n2 matrix D satisfying kPk2D = hv, Dvi. When n = 3, there exists a convenient basis for R9 of eigenvectors of D called the DCT basis. This basis consists of the constant vector [1, . . . , 1]T , which is perpendicular to the hyperplane containing the centered and contrast-normalized data, and eight vectors which in patch space take the form depicted in Figure 2.
Figure 2: DCT basis in patch space. By taking the coordinates, with respect to the DCT basis, of the points in the aforementioned 7-ellipsoid, we obtain a subset Figure 1: Exemplars from the van Hateren Natural Images Data M of the 7-sphere base n o From each image, five thousand 3 × 3 patches were selected S7 = x ∈ R8 kxk = 1 at random and then the entry-wise logarithm of each patch was calculated. This “linearization” step is motivated by Weber’s which models the distribution of small optical patches in their law, which states that the ratio ∆L/L between the just notice- state space. The statistical analysis of M presented in [19] able difference ∆L and the ambient luminance L, is constant for yielded the following conclusion: High-contrast 3 × 3 patches a wide range of values of L. Next, out of the 5, 000 patches from natural images, accumulate with high density around a from a single image, the top 20 percent with respect to con- highly non-linear 2-dimensional submanifold of S7 . The dentrast were selected. Contrast is measured by means of the D- sity observation is that 50% of the points in M occupy a region norm k · kD , where an n × n patch P = [pi j ] yields the vector which accounts for only 9% of the volume of S7 , while the rev = [v1 , . . . , vn2 ]T given by vn( j−1)+i = pi j , and one lets maining points populate the 7-sphere fully and sparsely. The predicted surface comes from a Voronoi tessellation of S7 , in 2 2 kPkD = ∑ (vr − vt ) which the centers of the densest cells correspond to ideal edges, r∼t whose geometric pixel pattern can be described with two numwhere vr = pi j is related (∼) to vt = pkl if and only if bers: Direction of the edge, and its distance from the center. The global structure and a parametrization for the predicted |i − k| + | j − l| ≤ 1. surface was determined in [6], using the persistent homology One thing to note is that the D-norm k · kD can be seen as a formalism; an adaptation of tools from algebraic topology to discrete version of the Dirichlet semi-norm the world of data [5]. Algebraic topology, an old branch of ZZ 2 2 mathematics, studies the global organization of certain classes k f kD = k∇ f (x, y)k dxdy of mathematical structures, by associating algebraic invariants [−1,1]2 such as vector spaces and linear maps between them. For closed 1 Available at http://www.kyb.tuebingen.mpg.de/?id=227 surfaces without boundary that can be embedded in a bounded 3
subset of Rd , d ≥ 3, the algebraic invariants that measure number of holes and orientability are complete. That is, they are enough to determine the identity of a surface, up to continuous deformations which do not change number of holes. Please refer to [14] for a thorough treatment of homology and the classification theorem of compact surfaces. The main result of [6] was certainly unexpected: 50% of the points in M accumulate with high density around a surface K with the topology of a Klein bottle. The surprise comes not form the fact that it is a surface, since this was already argued in [19], but that the 2-manifold that arose was non-orientable. There are many descriptions for spaces with the topology of a Klein bottle [12]. A simple model, which we will use throughout the paper, is the space K obtained from the rectangle R = [π/4, 5π/4] × [−π/2, 3π/2] when (α, −π/2) is identified with (α, 3π/2) and (π/4, θ ) is identified with (5π/4, π −θ ) for every (α, θ ) ∈ R. See Figure 3.
and let P = [pi j ] be the 3 × 3 patch obtained via evaluation 2j−1 2i − 1 pi j = p −1 + , i, j = 1, 2, 3. ,1− 3 3 It follows that each (α, θ ) ∈ K determines a unique patch P, and that by taking entry-wise logarithms, centering and Dnormalizing P, we get a unique element in K . We show in Figure 4 a lattice of elements from K , arranged according to their coordinates (α, θ ) ∈ K.
Figure 4: A lattice of patches in K . The prominence of bars and edges as image primitives has been extensively documented in the literature. At the physiological level, it is known that the visual cortex in cats and macaque feature large arrays of neurons sensitive only to stimuli from straight lines at specific orientations [15, 16]. From an information theoretical perspective, they arise as the filters which minimize redundancy in a linear coding framework [3]; while in feature learning for image representation, bars and edges have appeared as prevalent dictionary elements [17, 30]. The Klein bottle model K can thus be regarded as a codebook for two of the most important features in natural images, not only from an statistical standpoint, but from a physiological and computational one. Its continuous character allows one to bypass the quantization problem2 , yielding more accu-
Figure 3: Klein bottle model K. (α, −π/2) is identified with (α, 3π/2) and (π/4, θ ) is identified with (5π/4, π − θ ) for every (α, θ ). There is a simple correspondence between K and K: For (α, θ ) ∈ K let a + ib = eiα , c + id = eiθ , let p be the polynomial (ax + by) p(x, y) = c +d 2
√ 3(ax + by)2 4
2 Provided
4
one has a “continuous projection” such as the one described in
P correspond to the line ` in R2 with parametric equation − sin(θ ) cos(θ ) + cot(φ ) . `(t) = t cos(θ ) sin(θ )
rate representations, while its low intrinsic dimensionality (a 2-manifold) provides the sparseness and economy desirable in any codding scheme. These two characteristics, however, are only available if one understands the global geometric organization of sets of relevant patches or filters. It is to this understanding that we refer to as Topological Dictionary Learning. If a set of relevant patches, e.g. in high density regions of state-space or with high discriminative power, can be described with a few attributes having some form of symmetry, then the existence of a low dimensional geometric object which parametrizes them is to be expected. In this case, the topology of the underlying space can be uncovered using the persistent homology formalism [5], while the parametrization step can be aided with methods such as Circular Coordinates [9] or the Topological Nudged Elastic Band (preprint available at http://arxiv.org/abs/1112.1993). The Klein bottle model K presented in [6] was indeed the first realization of the Topological Dictionary Learning program, but other features can be also succinctly described. Let us consider, for instance, the set of n × n patches depicting bars as the ones in Figure 5.
If φ = 0, then let P be the constant patch. If we flatten the upper hemisphere of S2 so that we get a disk, and place on each polar coordinate (z, θ ) the patch one obtains, then it follows that the constant patch will be at the center of the disk, and patches representing lines through the origin will be exactly at the boundary z = 1. Another thing to notice is that (1, θ ) yields the same patch as (1, θ + π), and that no other duplications occur. In summary, the set of n × n patches depicting lines at different orientations and positions, along with the constant patch, can be parametrized via a 2-dimensional disk where boundary points have been identified with their antipodals. Thus, the underlying space is the real projective plane RP2 , a 2-dimensional 2 manifold inside pixel-value space Rn . Notice that not only do we get a considerable reduction in dimensionality, but the fact that RP2 can be modeled as a quotient of S2 allows one to apply geometric techniques from the 2-sphere, to sets of patches depicting lines. As we have seen, sets of patches which can be described in terms of a few geometric features, can often be para-metrized with low dimensional objects embedded in pixel-value space. In what follows, we will concentrate on the Klein bottle dictionary K , and on how to use it in the representation of distributions of patches from texture images.
3
Given a digital image in gray scale, it follows from the previous section that most of its n × n patches can be regarded 2 as points in Rn close to K , if n is small with respect to the scale of features in the image. Here by “most” we mean the non-constant patches exhibiting a prominent directional pattern. These points, in turn, can be projected onto K yielding a finite set S ⊆ K, which can be interpreted as a random sample from an underlying probability density function f : K −→ R. We will refer to such an S as a K-sample, and to its associated f as its Kdistribution. Assuming that f is square integrable over K, and given an orthonormal basis {φk }k∈N for L2 (K, R), there exists a unique sequence of real numbers { fk }k∈N so that
Figure 5: 7 × 7 patches in gray scale depicting bars at several positions and orientations. One way of thinking about these patches is as the set of lines in R2 , union an extra point representing the constant patch. It follows that each line is given by its orientation and distance to the origin. For a parametrization, let us consider the upper hemisphere of S2 in spherical coordinates x
= sin(φ ) cos(θ )
y
= sin(φ ) sin(θ )
0≤φ ≤
π 2
and
Representing K-Distributions
0 ≤ θ < 2π
∞
z = cos(φ )
f=
∑ f k φk . k=1
and let (φ , θ ) describe a patch P as follows : If 0 < φ ≤ π2 , let
∞
Moreover since k f k2L2 = ∑ fk2 , then given ε > 0 there exists k=1
J ∈ N so that k ≥ J implies | fk | < ε, and therefore the vector
subsection 3.1
5
a ( f1 , . . . , fJ ) can be regarded as a way of representing (an ap1. Find a unitary vector which in average is the most parb proximation to) the distribution of n × n patches from the origiallel to ∇ fP. This way, fP will be as constant as possible nal image. −b The representation for K-distributions we propose in this in the direction. a paper is an estimate of the vector ( f1 , . . . , fJ ) from the Ksample S ⊆ K. The basis for L2 (K, R) will be derived from 2. Let the projection of P onto K (hence K), be the patch the trigonometric basis einθ for L2 (T), where T denotes the circorresponding to the polynomial cle R/2πZ. Notice that using a basis for L2 (K) consisting of √ 3(ax + by)2 2 (ax + by) step functions, instead of complex exponentials, recovers a his+d , c + d2 = 1 p(x, y) = c togram representation. 2 4 We will show in this section how to go from an image to the which is closest to fP (x, y) as measured by k · kD . K-sample S ⊆ K, how to construct bases for L2 (K), and then Let us see how to carry this out. If fP is an extension of P how to estimate the coefficients f1 , . . . , fJ from S. We show then we have the associated quadratic form that the estimators for the fk ’s can be chosen to be unbiased ZZ and with convergence almost surely as |S| → ∞ (Theorem 3.8); h∇ fP (x, y), vi2 dxdy. Q (v) = P that taking finitely many coefficients for our representation is [−1,1]2 nearly optimal with respect to mean square error; and that when inθ {φk }k∈N is derived from the trigonometric basis e , then imSince h∇ fP (x, y), vi2 is maximized as a function of v unitary, at age rotation changes our representation via an specific linear (x, y), when v is parallel to ∇ fP (x, y), then maximizing QP is in transformation (Theorem 3.10). fact equivalent to finding the vectors which in average are the most parallel to ∇ fP .
3.1
From images to K-samples
Proposition 3.1. Let P be a patch, fP an extension and let QP be the associated quadratic form, which we write in matrix form as QP (v) = vT AP v for AP symmetric.
Given an n × n patch P = [pi j ] from a digital image in gray scale, we can think of it as coming from a differentiable (or almost-everywhere differentiable) function
1. If P is purely directional and fP (x, y) = g(ax + by), for a2 + b2 = 1, then a QP = max QP (v) b kvk=1
fP : [−1, 1] × [−1, 1] −→ R via evaluation
2. Let Emax (AP ) ⊆ R2 be the eigenspace corresponding to the largest eigenvalue of AP . If S1 ⊆ R2 is the unit circle, then Emax (AP ) ∩ S1 is exactly the set of maximizers for QP over S1 .
2i − 1 2j−1 ,1− pi j = fP −1 + n n
In this context, we will refer to fP as an extension of P. We say that a patch P is purely directional if there exist an 3. If the eigenvalues of AP are distinct, then a 2 ∈ R and a function g : extension fP of P, a unitary vector Emax (AP ) ∩ S1 = {vP , −vP }, b R −→ R such that fP (x, y) = g(ax + by) for every x, y ∈ [−1, 1]. which determines a unique αP ∈ [π/4, 5π/4) so that The Klein bottle model K we have presented, can therefore be understood as the set of purely directional patches in which g is {vP , −vP } = {eiαP , −eiαP }. a degree 2 polynomial without constant term, and so that fP is Proof. unitary with respect to the Dirichlet semi-norm k fP k2D =
ZZ
1. The Cauchy-Schwartz inequality implies that for every v ∈ S1 and every almost-everywhere differentiable fP (not necessarily purely directional) one has that
k∇ fP k2 dxdy.
[−1,1]2
QP (v) ≤
This interpretation allows us to formulate our two-step strategy for projecting a given patch onto K:
ZZ
[−1,1]2
6
k∇ fP k2 dxdy.
Since the equality holds when fP (x, y) = g(ax + by) and a v= , the result follows. b
Proof. Since u and then it follows that
√ 2 3u are orthonormal with respect to h·, ·iD ,
argmin Φ(c, d) = argmin Φ(c, d)2
2. Let λmax ≥ λmin ≥ 0 be the eigenvalues of AP , and let BP be a unitary matrix such that λ 0 AP BP = max . 0 λmin v 1 If v ∈ S and v = BP x , then 1 = kvk2 = v2x + v2y , vy 2 2 QP (v) = λmax vx + λmin vy and therefore
c2 +d 2 =1
c2 +d 2 =1
= argmax c2 +d 2 =1
D
√ E fP , cu + d 3u2
D
* h f , ui + P D c = argmax , √ d 2 c2 +d 2 =1 h fP , 3u iD which can be found via the usual condition of parallelism, provided ϕ( fP , α) 6= 0.
max QP (v) = λmax .
kvk=1
Finally, since QP (v) = λmax for v ∈ S1 , if and only if v ∈ Emax (AP ), we obtain the result. 3. If the eigenvalues of AP are distinct, then Emax (AP ) is one The association P 7→ (αP , θP ) ∈ K, with αP as in Proposidimensional and thus intercepts S1 at exactly two antipodal tion 3.1 and θP as in Proposition 3.2, is well-defined up to a points. consistent choice of an extension fP , or rather that of its gradient ∇ fP . The choice we make in this paper is to let ∇ fP be piecewise constant and equal to the discrete gradient of P. The specifics on how to compute this gradient, how to deal with the Proposition 3.2. Let a + ib = eiα for α ∈ R, and let case in which AP is a scalar matrix, and what to do when the ZZ minimizer for Φ(c, d) is not well defined, will be discussed in h f , giD = h∇ f (x, y), ∇g(x, y)idxdy the Implementation Details subsection (4.2). 2 [−1,1]
Let us say a few words regarding the sense in which denote the inner product inducing the Dirichlet semi-norm k · ∗ c . Then the vector ∗ ∈ S1 which minimizes kD . If u = (ax+by) 2 P 7→ (αP , θP ) ∈ K d the k · kD -error
√ is considered a projection. The orthogonal projection onto a
Φ(c, d) = fP − cu + d 3u2 , c2 + d 2 = 1 linear subspace of an inner product space is uniquely characD terized by its distance minimizing property. While K is not a is give by 2 linear subspace of Rn , the approximation h fP , uiD c∗ = q √ h fP , ui2D + 3h fP , u2 i2D (ax + by) 3(ax + by)2 fP ≈ c +d √ 2 4 3h fP , u2 iD ∗ d = q h fP , ui2D + 3h fP , u2 i2D attempts to minimize the k · kD -error as described in Proposition 3.2. It is because of this property that P 7→ (αP , θP ) is referred whenever to as the projection of P onto K. Our labelling scheme P 7→ (αP , θP ) ∈ K can be interpreted as a continuous version of the MR8 (Maximum Response 8) representation [30], which omits the Gaussian filter and its Laplacian.
ϕ( fP , α) = h fP , ui2D + h fP , u2 i2D 6= 0 and it determines a unique θP ∈ [−π/2, 3π/2) so that c∗ + id ∗ = eiθP . 7
Figure 6: The MR8 (Maximum Response 8) filter banks. By taking the maximum filter response (or projection) across each column, the MR8 represents a patch in terms of eight real numbers. Indeed, at each one of three scales, the MR8 representation registers the maximum projection (or filter response) of a patch onto six rotated versions of a linear (vertical edge) and a quadratic (vertical bar) filter, yielding the first six out eight numbers in the descriptor. The last two, are the responses to a Gaussian and to the Laplacian of a Gaussian. The Klein bottle representation on the other hand, takes into account all possible rotations of the linear and quadratic filters, and also retains the maximizing direction. In contrast to the MR8, we make our descriptor rotation invariant not by collapsing the directional coordinate, but by determining the effect on both the projected sample S ⊆ K and the estimated Fourier-Like coefficients, as one rotates the image (Theorem 3.10). Once this is understood, one can make the distance between the descriptors invariant to the effects of rotation.
Figure 7: Z/2Z action on T . µ can be written in polar coordinates as µ(α, θ ) = (α + π, π − θ ) mod 2π. Therefore the action takes the letter “q” on the upper left to the “d” on the lower right, and the orbit space can be identified with [π/4, 5π/4] × [−π/2, 3π/2] along with the gluing rules depicted by the arrows. This recovers the model K. The quotient map q : T −→ K, on the other hand, induces an inclusion q∗ : L2 (K) ,→ L2 (T ) f 7→ f ◦ q which is isometric if we endow L2 (K) with the inner product h f , giK =
3.2
Constructing Bases for L2 (K)
1 2π 2
Z
3π 2
− π2
5π 4
Z π 4
f (α, θ )g(α, θ )dαdθ .
Thus L2 (K) sits isometrically inside L2 (T ) as the set of functions which remain fixed under the induced Z/2Z action
Let T ⊆ C × C be the set of pairs of complex numbers (z, w) µ ∗ : L2 (T ) −→ L2 (T ) so that |z| = |w| = 1. It follows from our description of K as f 7→ f ◦ µ a set of polynomials that each (a + ib, c + id) ∈ T determines a unique element in K , and that (z, w), (z0 , w0 ) ∈ T determine the The importance of this observation lies in that it yields a method same polynomial if and only if z0 = −z and w0 = −w, where w for constructing bases for L2 (K): To take the average of each denotes the complex conjugate of w. That is, we have a continorbit from the Z/2Z-action, restricted to a basis of L2 (T ). uous action Theorem 3.3. If {φk | k ∈ N} is a (Schauder) spanning set for L2 (T ), then µ : T −→ T φk + µ ∗ φk (z, w) 7→ (−z, −w) S = k∈N 2 2 of the cyclic group of order two Z/2Z ={0, 1} on the torus T , is a spanning set for L (K). 2 so that 1 acts via µ, and the orbit space {t, µ(t)} | t ∈ T can If B ⊆ S is a basis for L (K), then the Gram-Schmidt probe identified with K. This identification is clarified in Figure 7. cess on B yields an orthonormal basis for L2 (K). Moreover,
8
any orthonormal basis for L2 (K) can be recovered from one of When restricted to real valued functions f : K −→ R, one can L2 (T ) in this fashion. start with the basis for L2 (T, R) consisting of functions of the form φ (nα)ψ(mθ ) where φ and ψ are either sines or cosines, Proof. The first thing to notice is that if f , g ∈ L2 (T ), then a and look at its image under Π. This yields change of coordinates argument shows that n+m
Z 2π Z 2π 0
0
( f ◦ µ) · g dαdθ =
Z 2π Z 2π 0
)π , then Corollary 3.6. Let n, m ∈ N and let πn,m = (1−(−1) 4 the set of functions √ √ √ 1, 2 cos (mθ − π0,m ) , 2 cos(2nα), 2 sin(2nα),
f · (g ◦ µ) dαdθ
0
and therefore µ ∗ : L2 (T ) −→ L2 (T ) is a self-adjoint linear operator. Moreover, since µ ∗ is its own inverse, then it is in fact 2 cos (nα) · cos (mθ − πn,m ) , 2 sin (nα) · cos (mθ − πn,m ) an isometry and thus continuous. is an orthonormal basis for L2 (K, R), which we will refer to as Let 2 2 Π : L (T ) −→ L (T ) the trigonometric basis for L2 (K, R). f +µ ∗ f f 7→ 2 Definition 3.7. Let f ∈ L2 (K, R) and let then Π is a continuous linear map which is the identity on √ am = h f , 2 cos(mθ − π0,m )iK L2 (K), µ ∗ ◦ Π( f ) = Π( f ) = Π ◦ Π( f ) for all f ∈ L2 (T ), and √ therefore Π a linear projection with Img(Π) = L2 (K). Since µ ∗ bn = h f , 2 cos(2nα)iK √ self-adjoint implies that Π is also self-adjoint, then Π is in fact cn = h f , 2 sin(2nα)iK the orthogonal projection onto L2 (K). dn,m = h f , 2 cos(nα) cos(mθ − πn,m )iK By continuity, Π takes spanning sets to spanning sets which is the first part of the theorem. Since Π is the orthogonal proen,m = h f , 2 sin(nα) cos(mθ − πn,m )iK jection onto L2 (K), then Ker(Π) = L2 (K)⊥ and we have the be its coefficients with respect to the trigonometric basis for orthogonal decomposition L2 (K, R). If we order them with respect to their (total) frequen2 2 L (T ) = L (K) ⊕ Ker(Π). cies and alphabetic placement as This implies that any orthonormal basis O of L2 (K) can be exa1 , a2 , b1 , c1 , d1,1 , e1,1 , a3 , d1,2 , d2,1 , e1,2 , e2,1 , a4 , b2 , . . . | {z } | {z } tended to an orthonormal basis O 0 of L2 (T ) so that Π(O 0 ) = frequency=2 frequency=3 O ∪ {0}, and thus any orthonormal basis of L2 (K) can be recovered from one of L2 (T ) as claimed. then we get the ordered sequence KF ( f ), which we will re2 Since bases for L ([0, 1]), such as Wavelets or Legendre poly- fer to as the K-Fourier coefficients of f . We will denote by nomials, can be modified to yield bases for L2 (S1 ) and thus for KFw ( f ) the truncated sequence of K-Fourier coefficients conL2 (T ), then the previous theorem gives a way for constructing taining those terms with frequencies less than or equal to w ∈ N. lots of bases for L2 (K). In particular we have
3.3
Corollary 3.4. Let {φn }n∈N be a basis for L2 (S1 ). Then
Estimation in the Frequency Domain
We now turn our attention to the last step in the representation scheme: Given an orthonormal basis {φk }k∈N for L2 (K), 2 f ∈ L2 (K) a probability density function on K, and S ⊆ K a is a basis for L (T ) and therefore random sample drawn according to f , estimate the coefficients φn (α)φm (θ ) + φn (α + π)φm (π − θ ) fk = h f , φk iK from S. It turns out that this problem can be solved n, m ∈ N 2 in great generality: For any measure space (M, η) so that L2 (M) is separable, one can choose unbiased estimators for the coeffiis a spanning set for L2 (K). cients fk . Let us motivate the definition of these estimators with Corollary 3.5. Let n ∈ Z and m ∈ N ∪ {0}, then the set of func- the case M = Rd , and {hα | α ∈ Nd } the harmonic oscillator tions wave functions on Rd . inα+imθ n+m inα−imθ Let S (Rd ) be the set of rapidly decreasing complex valued e + (−1) e , m = 0 implies n ≡ 0 mod 2 functions on Rd (see [22]), S 0 (Rd ) its algebraic dual (the space 2 of tempered distributions) and let hα : Rd −→ R be given by is an orthonormal basis for L2 (K), which we will refer to as the trigonometric basis for L2 (K). hα (x1 , . . . , xd ) = hα1 (x1 ) · · · hαd (xd ) {φn (α)φm (θ ) | n, m ∈ N}
9
where α = (α1 , . . . , αd ) ∈ Nd and hk is the k-th Hermite function and X1 , . . . , XN are i.i.d. random variables distributed accordon R. It is known that {hα | α ∈ Nd } is an orthonormal basis ing to f , then N for L2 (Rd ) (Lemma 3, page 142, [22]), and that for every T ∈ bk = 1 ∑ φk (Xn ) f 0 d S (R ) the series N n=1 )h T (h ∑ α α is an unbiased estimator for fk . Moreover, fbk converges almost α surely to fk as N → ∞. converges to T in the weak-* topology. The latter is a consequence of the N-representation theorem (V.14, page 143, [22]), Proof. To see that the estimators are unbiased, notice that for and suggests the definition T (φk ) for the coefficients of a tem- each k ∈ N the expected value of the estimator fbk is given by pered distribution with respect to an arbitrary orthonormal basis Z Z N 1 for L2 (Rd ). f · ∑ φk dη = f · φk dη = fk . E[ fbk ] = N n=1 Now, the Dirac delta function centered at c ∈ Rd M M The convergence result is exactly the statement of the Strong Law of Large Numbers.
δ (x − c) : S (Rd ) −→ C ψ 7→ ψ(c)
d ( f ) denote the sequence of estimaDefinition 3.9. Let KF is a continuous linear functional and therefore an element in tors obtained from Theorem 3.8 and the trigonometric basis for S 0 (Rd ). An application of integration by parts shows that for d w ( f ) denote the truncated sequence correL2 (K, R). Let KF d every ψ ∈ S (R ) one has sponding to frequencies less than or equal to w ∈ N. Z Given a random sample 2 2 k lim ψ(x) √ e−k kx−ck dη(x) = ψ(c) k→∞ πd S = {(α1 , θ1 ), . . . , (αN , θN )} ⊆ K d R
d ( f , S) denote the sedrawn according to f ∈ L2 (K, R), let KF and therefore δ (x−c) is a weak limit of Gaussians with mean c, quence of estimated K-Fourier coefficients as the variance approaches zero. What this result tells us, is that δ (x − c) can be interpreted as the generalized function bound1 N √ abm = ∑ 2 cos(mθk − π0,m ) ing unit volume, which is zero at x 6= c and an infinite spike at c. N k=1 Moreover, if X1 , . . . , XN are i.i.d. (independent and identically 1 N √ distributed) random variables with probability density function b bn = ∑ 2 cos(2nαk ) 2 d f ∈ L (R , R), then as a “generalized statistic” N k=1 1 N fbδ (X) = ∑ δ (X − Xn ) N n=1 can be thought of as the Gaussian Kernel estimator (see [24]) with infinitely small width. The coefficients of fbδ with respect to the Hermite basis can be computed as 1 N fbδ (hα ) = ∑ hα (xn ) N n=1
cbn
=
1 N √ ∑ 2 sin(2nαk ) N k=1
dbn,m
=
1 N ∑ 2 cos(nαk ) cos(mθk − πn,m ) N k=1
ebn,m
=
1 N ∑ 2 sin(nαk ) cos(mθk − πn,m ) N k=1
d w ( f , S) be its truncated version. and let KF
and motivate the following estimation theorem. Theorem 3.8. Let (M, η) be a measure space so that L2 (M) is separable, and let {φk }k∈N be an orthonormal basis. If f : M −→ R is a probability density function with
While there are reasonably good estimators for the coefficients of f in very general settings, their dependency on the choice of basis makes it hard to identify just how much information is effectively encoded in this representation. Let us try to make this idea more transparent. The first thing to notice is that the sequence of estimators J
∞
f=
∑
SJ ( fbδ ) =
fk φk
k=1
∑ fbk φk k=1
10
does not converge to f but to fbδ (in the weak-* topology) as J → domain: A linear transformation depending solely on the angle ∞ regardless of the sample size N, provided we have pointwise of rotation. convergence Let P be a patch from an image, and let (αP , θP ) ∈ K be its projection. It follows from the Klein bottle model presented in ψ φ (x) = ψ(x) k k ∑ k Figure 4, that if the image is rotated τ degrees in the counter for all ψ ∈ S (Rd ) and all x ∈ Rd . This is true for instance clockwiseτ direction with respect to its center, then P maps to in the case of Hermite functions and trigonometric polynomi- a patch P with projection (αP + τ, θP ) ∈ K. Therefore, if f : als. More specifically, what Theorem 3.8 provides is a way of K −→ R is the probability density function corresponding to the of projected patches from the original image, then approximating a linear combination of delta functions instead distribution τ (α, θ ) = f (α − τ, θ ) describes that of its rotated version. f of the actual function f , and thus even when the fbk are almost It follows from the identity ein(α+τ) = einτ einα , that the surely correct for large N, the convergence in probability is not enough to make them decrease fast enough as k gets larger. In trigonometric basis on K is specially well suited for understandshort, taking more coefficients is not necessarily a good thing, ing the effect that image rotation has on the frequency domain. so one should choose bases in which the first coefficients carry Theorem 3.10. Let am , bn , cn , dn,m , en,m be the K-Fourier coefmost of the information, making truncation not only necessary τ , eτ be those of ficients of f ∈ L2 (K, R), and let aτm , bτn , cτn , dn,m n,m but meaningful. Moreover, it is known ([31]) that truncation is τ f (α, θ ) = f (α − τ, θ ). Then we have the identities also nearly optimal with respect to mean integrated square error (MISE), within a natural class of statistics for f . Indeed, if aτm = am within the family of estimators τ ∞ b cos(2nτ) − sin(2nτ) bn n ∗ fN = ∑ λk (N) fbk φk = k=1 cτn sin(2nτ) cos(2nτ) cn one looks for the sequence {λk (N)}k∈N that minimizes τ dn,m cos(nτ) − sin(nτ) dn,m MISE = E k fN∗ − f k22 = τ var(φk ) e sin(nτ) cos(nτ) e n,m n,m then provided the f , for k ≤ J(N), are large compared to k
N
and the fk for k > J(N) are negligible, then letting λk (N) = 1 for k ≤ J(N) and zero otherwise achieves nearly optimal mean square error. The previous discussion yields yet another reason for considering trigonometric polynomials: The high-frequency KFourier coefficients tune the fine scale details of the series approximation. Thus, most of the relevant information in the probability density function is encoded in the low frequencies, which can be easily approximated via estimators that converge almost surely as the sample size gets larger, and achieve nearly optimal mean square error within a large family of estimators.
3.4
for every τ ∈ R and every n, m ∈ N. Thus, there exists an isometric linear map Tτ : `2 (R) −→ `2 (R) depending solely on τ, and so that KF ( f τ ) = Tτ (KF ( f )) for every f ∈ L2 (K, R). Here `2 (R) denotes the set of square summable sequences of real numbers. Proof. If f , g ∈ L2 (K, R) then from a change of coordinates it follows that h f τ , giK = h f , g−τ iK , and therefore aτm
= am
bτn
√ = h f τ , 2 cos(2nα)iK √ = h f , 2 cos(2n(α + τ))iK
The effect of image rotation
A crucial aspect toward exploiting the full power of an invariant associated to images, is understanding its behavior under geometric transformations such as translation, rotation and scaling of the input. We will show in what follows that image rotation has a particularly simple effect on the distribution of patches on the Klein bottle, and that with respect to the trigonometric basis, it has an straightforward interpretation in the frequency 11
= cos(2nτ)bn − sin(2nτ)cn cτn
√ = h f , 2 sin(2n(α + τ))iK = cos(2nτ)cn + sin(2nτ)bn
To see that dR ( f , g) = 0 if and only if g = f τ for some τ, notice that a continuity argument (with respect to τ) and the fact that
Similarly τ dn,m
= cos(nτ)dn,m − sin(nτ)en,m
eτn,m
= cos(nτ)en,m + sin(nτ)dn,m .
inf k f − gτ kL2 = τ
inf
−π≤τ≤π
k f − gτ kL2
imply that the infimum is attained. The result now follows Corollary 3.11. Let Tτ : `2 (R) −→ `2 (R) be as in Theorem since k · kL2 is a norm, and therefore non-degenerate. To see that the triangle inequality holds, let τ1 , τ2 ∈ [−π, π] be so that 3.10. Then Tτ+β = Tτ ◦ Tβ . dR ( f , h) = k f − hτ1 kL2 and dR (h, g) = kh − gτ2 kL2 . Therefore Let X1 , . . . , XN be i.i.d. random variables with probability density function f : K −→ R, where Xn = (An , Θn ). If we let dR ( f , h) + dR (h, g) = k f − hτ1 kL2 + kh − gτ2 kL2 τ τ τ Xn = (An + τ, Θn ), then it follows that X1 , . . . , XN are i.i.d. with = k f − hτ1 kL2 + khτ1 − gτ1 +τ2 kL2 probability density function f τ . By applying Theorem 3.8 to ≥ k f − gτ1 +τ2 kL2 X1τ , . . . , XNτ and the trigonometric basis for L2 (K, R), we get ≥ dR ( f , g). d ( f ) is a componentwise unbiased esCorollary 3.12. Tτ KF timator for KF ( f τ ), which converges almost surely as the sam- The invariance of dR under horizontal translation follows from its definition. ple size tends to infinity. The previous discussion can be summarized as follows: Rotating an image in the counterclockwise direction, with respect to its center, by τ degrees, causes a translation of its distribution of patches on the Klein bottle model K by τ units along the horizontal axis. This shift, in turn, translates into the K-Fourier coefficients domain as a linear transformation Tτ , which can be described as a block diagonal matrix whose blocks are either the number 1, or rotation matrices which act by an angle 2nτ or nτ depending on the particular frequencies. This description allows us to define a rotation invariant distance between K-Fourier coefficients. Indeed, Proposition 3.13. Let f , q ∈ L2 (K), and let dR ( f , g) := inf k f − gτ kL2 . τ
Then dR defines a pseudo-metric on L2 (K) so that τ1
τ2
dR ( f , g ) = dR ( f , g) for every τ1 , τ2 ∈ R and dR ( f , g) = 0 if and only if g = f τ for some τ.
From the identity k f − gkL2 = kKF ( f ) − KF (g)k`2 , which is a consequence of Parseval’s Theorem, it follows that dR ( f , g) = min kKF ( f ) − KF (gτ )k`2 τ
and therefore
d 0 d min KF w ( f , S) − Tτ KF w (g, S )
2
τ
defines a rotation invariant distance between images, when applied to (estimates of) the underlying probability density functions on K. The most important point is not that this distance exists, but that it can be efficiently computed for truncated sequences of K-Fourier coefficients: Theorem 3.14. Let f , g ∈ L2 (K, R) and w ∈ N. Then there exists a complex polynomial p(z) of degree at most 2w, with coefficients depending solely on KFw ( f ) and KFw (g), and so that if τ ∗ is a minimizer for Ψ(τ) = kKFw ( f ) − KFw (gτ )k2
Proof. To see that dR is symmetric, notice that ∗
then ξ∗ = eiτ is a root of p(z).
k f − gτ kL2 = k f −τ − gkL2
Proof. From the description of Tτ as a block diagonal matrix whose blocks are rotation matrices (Theorem 3.10), it follows that
and therefore dR ( f , g) = inf k f − gτ kL2 τ
= inf k f − g−τ kL2
argmin Ψ(τ) = argmax hKFw ( f ), Tτ (KFw (g))i
τ
τ
τ
= inf k f − gkL2
τ w
τ
= argmax
= dR (g, f ).
τ
12
∑ xn cos(nτ) + yn sin(nτ)
n=1
where xn and yn depend solely on KFw ( f ) and KFw (g). By Hence even when roots(p) does not return the exact roots of p, taking the derivative with respect to τ, we get that if τ ∗ is a but those of a slight perturbation pe, their arguments can be used minimizer for Ψ(τ) then we must have as the initial guess in an iterative refinement such as Newton’s method. This is how we have implemented the dR distance: We w ∗ ∗ compute the eigenvalues of the companion matrix for p, and ∑ nyn cos(nτ ) − nxn sin(nτ ) = 0 then use their arguments (as complex numbers) in a refinement n=1 step via Newton’s method on Ψ2 . which is equivalent to having We end this section with a rotation-invariant version of our ! sequence of estimated K-Fourier coefficients. Let S ⊆ K be a w random sample drawn according to f : K −→ R, and let Re ∑ n(yn + ixn )ξ∗n = 0. n=1 db w w b v = 1,1 . n n eb1,1 Let q(z) = ∑ n(yn + ixn )z , q(z) ¯ = ∑ n(yn − ixn )z and let n=1
n=1
It follows that kb vk = 0 is a zero probability event, so there exists (with probability one) σ ( f ) ∈ [0, 2π) so that the action of Tσ ( f ) d ( f , S) restricted to b on KF v is given by kb vk b v 7→ . 0
1 w p(z) = z · q(z) + q¯ . z It follows that p(z) is a complex polynomial of degree less than or equal to 2w, and so that 1 p (ξ∗ ) = ξ∗w q (ξ∗ ) + q¯ ξ∗ w = ξ∗ q (ξ∗ ) + q¯ ξ∗
Proposition 3.15. Let σ ( f ) be as in the previous paragraph. Then σ ( f τ ) ≡ σ ( f ) − τ (mod 2π), and therefore d ( f τ , Sτ ) = Tσ ( f ) KF d ( f , S) Tσ ( f τ ) KF
= 2ξ∗w · Re (q (ξ∗ )) = 0.
for every τ.
d ( f τ , Sτ ) corresponding to Proof. Let b vτ be the portion of KF v. It follows from Theorem 3.10 and Corollaries 3.11 and 3.12 The roots of the polynomial p(z) can, in principle, be com- b puted as the eigenvalues of its companion matrix. This is how that the MATLAB routine roots is implemented. One can then look among the arguments of the unitary roots of p(z) for the cos(τ) − sin(τ) global minimizers (in [−π, π]) of Ψ. b b vτ = v It is known ([21]) that roots produces the exact eigenvalues sin(τ) cos(τ) of a matrix within roundoff error of the companion matrix of p. This, however, does not mean that they are the exact roots of a cos (τ − σ ( f )) − sin (τ − σ ( f )) kb vk polynomial with coefficients within roundoff error of those of = p, but the discrepancy is well understood. Indeed, let pe be the sin (τ − σ ( f )) cos (τ − σ ( f )) 0 monic polynomial having roots(p) as its exact roots, let ε be τ cos (τ − σ ( f )) − sin (τ − σ ( f )) kb v k the machine precision and let E = [Ei j ] be a perturbation matrix, = with Ei j a small multiple of ε, and so that if A is the companion sin (τ − σ ( f )) cos (τ − σ ( f )) 0 matrix of p then A + E is the companion matrix of pe. It follows [11] that if from which we get σ ( f τ ) ≡ σ ( f ) − τ (mod 2π), and p(z) = zn + an−1 zn−1 + · · · + a1 z + a0 d ( f τ , Sτ ) d ( f , S) Tσ ( f τ ) KF = Tσ ( f )−τ ◦ Tτ KF then to first order the coefficient of zk−1 in pe(z) − p(z) is d ( f , S) = Tσ ( f ) KF k−1
n
∑ am ∑
m=0
i=k+1
n
Ei,i+m−k −
k
∑ am ∑ Ei,i+m−k . m=k
as claimed.
i=1
13
db1,1 and let σ ( f ) ∈ [0, 2π) be so eb1,1 d ( f , S) restricted to b that the action of Tσ ( f ) on KF v is given by kb vk b v 7→ . 0 d ( f , S) be the estimated sequence of We let Tσ ( f ) KF rotation-invariant K-Fourier coefficients, for a sample S ⊆ K drawn according to f ∈ L2 (K, R). Definition 3.16. Let b v=
4
Results
4.1
The Data Sets
SIPI 3
The SIPI rotated textures data set, maintained by the Signal and Image Processing Institute at the University of Southern California, consists of 13 of the Brodatz images [2], digitized at different rotation angles: 0, 30, 60, 90, 120, 150 and 200 degrees. The 91 images are 512 × 512 pixels, 8 bits/pixel, in TIFF format.
Figure 9: The 61 materials in the CUReT data set KTH-TIPS 5 The KTH database of Textures under varying Illumination Pose and Scale, collected by Mario Fritz under the supervision of Eric Hayman and Barbara Caputo, was introduced in [13] as a natural extension to the CUReT database. The reason for creating this new data set, was to provide wide variation in scale, in addition to changes in pose and illumination. The database consists of 10 materials (already present in the CUReT database), each captured at 9 different scales and 9 distinct poses. That is, the KTH-TIPS data set has 810 images, 200 × 200 pixels, 8 bits/pixel in PNG format.
Figure 8: Exemplars from the SIPI Rotated Texture data set CUReT 4 The Columbia-Utrecht Reflectance and Texture database [8], is one of the most popular image collections used in the performance analysis of texture classification methods. It features images of 61 different materials, each photographed under 205 distinct combinations of viewing and illumination conditions, providing a considerable range of 3D rigid motions. Following the usual setup in the literature, the extreme viewing conditions are discarded, leaving 92 gray scale images per material, of size 200 × 200 pixels, 8 bits/pixel, in PNG format. What makes this a challenging data set, is both the considerable variability within texture class, and the many commonalities across materials. One drawback, in terms of scope, is its lack of variation in scale.
Figure 10: Three scales of each of the ten materials in the KTHTIPS data set. UIUCTex 6 The UIUC Texture database, first introduced in [18], and collected by the Ponce Research Group at the University of Illinois at Urbana-Champaign, features 25 texture classes with 40 samples each. The challenge of this particular data set
3 http://sipi.usc.edu/database
5 http://www.nada.kth.se/cvap/databases/kth-tips
4 http://www.cs.columbia.edu/CAVE/software/curet
6 http://www-cvr.ai.uiuc.edu/ponce
14
grp/data
lies in its wide range of viewpoints, scales, and non-rigid defor- define. If 2 ≤ i, j ≤ n − 1 then we let mations. All images are in gray scale, JPG format, 640 × 480 p − pi, j−1 pixels, 8bits/pixel. 1 i, j+1 ∇P(i, j) = 2 pi−1, j − pi+1, j pi, j+1 − 2pi, j + pi, j−1 HP(i, j) = Hxy P(i, j)
Hxy P(i, j)
pi+1, j − 2pi, j + pi−1, j
where Hxy P(i, j) =
pi−1, j+1 − pi−1, j−1 + pi+1,i−1 − pi+1, j+1 . 4
If either r ∈ {1, n} or t ∈ {1, n}, then there exists a unique (i, j) ∈ {2, . . . , n − 1}2 which minimizes |i − r| + | j − t|, and we let r−i j −t . ∇P(r,t) = ∇P(i, j) + HP(i, j) r − i j −t As for the gradient of the extension fP , we let Figure 11: Four exemplars from each of the twenty five classes in the UIUC Texture data set.
∇ fP (x, y) = ∇P(i, j) if
4.2
Implementation Details
Using the mathematical framework developed in section 3, we now give a detailed description of the steps involved in computing the estimated K-Fourier coefficients of a particular image, as well as the form of the EKFC descriptor.
Preprocessing Given a digital image in gray scale and an odd integer n, we extract its n × n patches. The choice of n odd makes the implementation simpler, but is somewhat arbitrary and has no bearing on the results presented here. If the image has pixel depth d, then each patch is an n × n integer matrix with entries between 0 and 2d − 1. We add 1 to each entry in the matrix, and following Weber’s law take entry-wise natural logarithms. Each log-matrix is then centered by subtracting its mean (as a vector of length n2 ) from each component, and Dnormalized provided its D-norm is greater than 0.01. Otherwise the patch is discarded for lack of contrast.
x − −1 + 2 j − 1 + y − 1 − 2i − 1 < 1 n n n
for some (i, j) ∈ {1, . . . , n}2 , and 0 otherwise. This definition captures directionality more accurately than simply calculating finite differences. This can√be seen, for instance, in a 3×3 patch 2 with extension fP (x, y) = 3(x+y) . 8 If the eigenvalues of the associated matrix AP are distinct (see Proposition 3.1), then we let αP ∈ [π/4, 5π/4) be the direction of the eigenspace corresponding to the largest eigenvalue. Since AP is a 2 × 2 matrix, αP can be computed explicitly in terms of ∇P. If AP has only one eigenvalue, then the patch P is discarded due to its lack of a prominent directional pattern. If Φ(c, d) is constant (see Proposition 3.2) then P is discarded, since it does not have linear nor quadratic components. Otherwise we let θP ∈ [−π/2, 3π/2) correspond to the minimizer (c∗ , d ∗ ) and compute Ψ(c∗ , d ∗ ), which can be thought of as the the distance from P to K. Notice that the triangular inequality implies that Ψ(c∗ , d ∗ ) ≤ 2. We include (αP , θP ) in the sample S ⊆ K if Ψ(c∗ , d ∗ ) ≤ rn , for rn as reported in Table 1.
Table 1: Maximum distance to Klein bottle The Projection Let P be an n × n patch represented by the Patch size (n) 3 5 7 9 centered and normalized log-matrix [pi j ]. As described in the discussion following Proposition 3.2, we let ∇ fP be piecewise rn 1.2247 1.3693 1.4031 1.4114 constant and equal to the discrete gradient of P, which we now 15
n ≥ 11 1.4135
than or equal to w = 6 whenever k ∈ Jw . We summarize the results in Figure 13.
The rationale behind this choice is as follows:
p √
fP − c∗ u + d ∗ 3u2 = 2 (1 − ϕ( fP , αP )) D
1 and the values of rn are set so that ϕ( fP , αP ) ≥ 2n−1 is the inclup sion criterion. Here ϕ( fP , αP ) can be thought of as the size of the contribution of Span (ax + by), (ax + by)2
if one were to write a series expansion for fP (x, y) in terms of the (rotated) monomials (ax + by)k (ay − bx)m , k, m ∈ N. The K-Fourier Coefficients We select the cut-off frequency d w ( f , S) includes the coefficients with large w ∈ N so that KF variance, and excludes the first regime where the fbk tend to zero. Recall that after this regime the fbk oscillate with large variance since SJ ( fbδ ) → fδ . For the results presented here we let w = 6, and fix it for once and for all. While similar cut-off frequencies have been obtained with maximum likelihood methods, see for instance [10], other choices might improve classification red w ( f , S) is a vector of length 42. sults. It follows that KF Example. We illustrate the computation of the estimated KFourier coefficients. Let us consider the K-sample from 3 × 3 patches in the image depicted in Figure 12. Figure 13: Estimated K-Fourier coefficients for the texture of Bricks. Top: First 44 estimated K-Fourier coefficients. Bottom left: Heat-map representation for the estimate fb. The points (α, β ) ∈ K are colored according to the value fb(α, β ), with dark blue representing the minimum of fb, and dark red its maximum. For colors, please refer to an electronic version of the paper. Bottom right: A lattice of patches in the Klein bottle model K , arranged according to their coordinates (α, β ) ∈ K. It follows that the hotter spots in the heat-map representation of fb correspond to patches with a vertical pattern.
Figure 12: Texture of Bricks. This image, included in the SIPI rotated texture data set, was extracted from page 94 in [2] and Example. We apply the dR distance digitized at 0 degrees. We compute the estimated K-Fourier coefficients and the corresponding estimate fb for the underlying probability density function f : K −→ R. Here
d 0 d min KF w ( f , S) − Tτ KF w (g, S ) τ
2
as described in the discussion following Theorem 3.14 to the SIPI texture image data set. Once the distance matrix is com∑ puted, we perform classical MDS (Multi-Dimensional Scaling) k∈Jw to get a representation, in R3 , of the similarities in the SIPI data 2 b where {φk }k∈N is the trigonometric basis for L (K, R), fk is as set as measured by the dR distance. The results are summarized in Definition 3.9, and Jw ⊆ N is so that φk has frequency less in Figure 14. fb(α, β ) =
fbk · φk (α, β )
16
histograms, the earth mover’s distance, L p norms, etc. While distances defined through various heuristics are useful, metrics learnt from the data often outperform them in classification tasks. One of the most popular metric learning algorithms is the Large Margin Nearest Neighbor (LMNN) introduced by Saul and Weinberger in [32]. In this algorithm, where data points are represented as vectors x ∈ RD , the goal is to learn a linear transformation L : RD −→ RD which rearranges the training set so that, as much as possible, the k nearest neighbors of each data point have the same label, while vectors labeled differently are separated by a large margin. If M = LT L then dM (x, y) = kM(x − y)k2 is referred to as the learnt Mahalanobis distance. Figure 14: Multi-Dimensional Scaling for dR -distance matrix on the SIPI data set. Each dot represents an image, and rotated 4.4 Classification Results versions of the same texture share the same color. For colors, We evaluate the performance of the EKFC descriptor by clasplease refer to an electronic version of the paper. sifying the images in the CUReT, KTH-TIPS and UIUCTex databases according to their material class. For each data set, The EKFC Descriptor Let M + N be the total number of n × the evaluation consists of a metric learning step on a randomly n patches in a given image, with M being the number of patches selected portion of the data, followed by a labeling of the rewhich are too far from the Klein bottle to be projected, and let maining images. We report the mean classification success S be the resulting K-sample from its n × n patches. We let the rate and standard deviation on each database, from 100 random EKFC (estimated K-Fourier coefficients) descriptor at scale n splits training/test set. be d w ( f , S) T KF σ ( f ) 1 Metric Learning For each texture class, we randomly select EKFCn = 1 − , M M a fixed number of exemplars which we label, and aggregate to d w ( f , S) is the truncated sequence of estiwhere Tσ ( f ) KF mated rotation-invariant K-Fourier coefficients, Definition 3.16. We obtain the (multi-scale) EKFC descriptor by concatenating the EKFCn ’s for several values of n. For the results presented here we used n = 3, 7, 11, 15, 19 and
form a training set. We then learn a Mahalanobis metric from these labeled examples using the LMNN algorithm7 . For the results presented here, we used 3 nearest neighbors to learn a global metric, as well as 20% of the training set for cross validation. This, to prevent the algorithm from going into over-fitting.
Labeling Step We use the energy-based classification as described in [32], which has been observed to produce better results than k-th nearest neighbor labeling. In this framework, a a vector of length 215. test example is assigned the label which minimizes the same loss function that determines the Mahalanobis metric dM , if the 4.3 LMNN for Metric Learning example along with the proposed label were to be added to the One of the most basic example-based classification techniques training set. is the k-th nearest neighbor method, in which a new instance is classified according to the most common label among its k Classification Once the labels have been determined via the nearest neighbors. Needless to say, this method depends largely energy-based method, we compute the percentage of test imon the notion of distance used to compare data points. For the ages which have been labeled correctly. The mean and variance special case of distances between probability density functions, 7 http://www.cse.wustl.edu/ kilian/code/code.html many dissimilarity measures exist: The χ 2 -statistic between ˜ EKFC = [EKFC3 , EKFC7 , EKFC11 , EKFC15 , EKFC19 ]
17
Table 2: Classification scores on the CUReT, UIUCTex and KTH-TIPS data sets. We include, for comparison, the results (as reported in Table 2 of [7]) from state of state-of-the-art methods for material classification from texture images. CUReT UIUCTex KTH-TIPS 43 images per class 20 images per class 40 images per class in training set in training set in training set our EKFC descriptor + LMNN metric
95.66 ± 0.45%
91.23 ± 1.13%
94.77 ± 1.3%
Crosier & Griffin [7] Varma & Zisserman - MR8 [29] Varma & Zisserman - Joint [30] Hayman et al. [13] Lazebnik et al. [18] Zhang et al. [33] Broadhurst [1] Varma and Ray [27]
98.6 ± 0.2% 97.43% 98.03% 98.46 ± 0.09% 72.5 ± 0.7% 95.3 ± 0.4% 99.22 ± 0.34%
98.8 ± 0.5%
98.5 ± 0.7%
97.83 ± 0.66% 92.0 ± 1.3% 96.03% 98.70 ± 0.4%
92.4 ± 2.1% 94.8 ± 1.2% 91.3 ± 1.4% 95.5 ± 1.3%
98.9 ± 0.68%
of these percentages is computed over 100 random splits train- KTH-TIPS (94.77 ± 1.3%) data sets. ing/test set. We summarize our classification results on Table As we have seen, the Klein bottle dictionary represents 2. patches in terms of bars and edges, which are not expected to be enough to efficiently capture all primitives in a texture image. It is thus encouraging that we have obtained such high 5 Discussion and Final Remarks classification results, though not better than the state-of-the-art, across challenging texture data sets. A fundamental question We have presented in this paper a novel framework for estimat- going forward, is how to enlarge the Klein bottle model, in a ing and representing the distribution, around low dimensional way which is motivated by the distribution of patches from texsubmanifolds of pixel-space, of patches from texture images. ture images, so that it includes other important image primitives In particular, we have shown that most n × n patches from an while keeping a low intrinsic dimensionality, and so that the 2 image can be continuously projected onto a surface K ⊆ Rn framework of estimated coefficients can be applied. One way with the topology of a Klein bottle, and that the resulting set of doing so is by following a CW-complexes philosophy [14], can be interpreted as sampled from a probability density func- in which a space is built in stages by pasting disks of increastion f : K −→ R. The K-Fourier coefficients of f can then ing dimension along their boundaries. We hope to explore this be estimated from the sample, via unbiased estimators which avenue in future work. converge almost surely as the sample size goes to infinity. The choice of the model K , comes from the fact that small high- Acknowledgements. J. A. Perea was partially supported by the contrast patches from natural images accumulate around it with National Science Foundation through grant DMS 0905823. G. high density [6]. In addition, we show that the estimated K- Carlsson was supported by the National Science Foundation Fourier coefficients of a rotated image can be recovered from through grants DMS 0905823 and DMS 096422, by the Air those of the original one, via a linear transformation which de- Force Office of Scientific Research through grants FA9550-091-0643 and FA9550-09-1-0531, and by the National Institutes pends solely on the rotation angle. By concatenating the first 43 rotation-invariant K-Fourier of Health through grant I-U54-ca149145-01. coefficients from n × n patches at 5 different scales (n = 3, 7, 11, 15, 19), we obtain the multi-scale rotation-invariant EKFC-descriptor. Note that its dimension, 215, is considerable References lower than that of popular histogram representations (2,440 [28] and 1,296 [7]), while still achieving high classification rates [1] Broadhurst, R.E. (2005). Statistical estimation of histogram on the CUReT (95.66 ± 0.45%), UIUCTex (91.23 ± 1.13) and variations for texture classification. In Procedings of the 18
fourth international workshop on texture analysis and syn- [15] Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of thesis, 25-30. Beijing, China, October 2005. single neurons in the cat’s striate cortex. J. Physiol, 148(3), 574-591. [2] Brodatz, P. (1966). Textures: A Photographic Album for Artists and Designers, Dover Publications, New York, NY, [16] Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. J. Physiol, USA. 195(1), 215-243. [3] Bell, A. J., & Sejnowski, T. J. (1997). The “Independent Components” of Natural Scenes are Edge Filters. Vision [17] Jurie, F., & Triggs, B. (2005). Creating Efficient Codebooks for Visual Recognition. In Tenth IEEE International Res., 37(23), 3327-3338. Conference on Computer Vision, 1, 604-610. [4] Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “Nearest Neighbor” Meaningful? Database [18] Lazebnik, S., Schmid, C., & Ponce, J. (2005). A Sparse Texture Representation Using Local Affine Regions. IEEE Theory - ICDT’99, 217-235. Transactions on Pattern Analysis and Machine Intelligence, [5] Carlsson, G. (2009). Topology and Data. Bulletin (new se27(8), 1265-1278. ries) of the American Mathematical Society, 46, 255-308. [19] Lee, A., Pedersen, K., & Mumford, D. (2003). The Non[6] Carlsson, G., Ishkhanov, T., de Silva, V., & Zomorodian, A. linear Statistics of High-Contrast Patches in Natural Images. (2008). On the Local Behavior of Spaces of Natural Images. International Journal of Computer Vision, 54(1), 83-103. International Journal of Computer Vision, 76(1), 1-76. [20] Leung, T., & Malik, J. (2001) Representing and Rec[7] Crosier, M., & Griffin, L. D. (2010). Using Basic Image ognizing the Visual Appearance of Materials Using ThreeFeatures for Texture Classification. International Journal of Dimensional Textons. International Journal of Computer ViComputer Vision, 88(3), 447-460. sion, 43(1), 29-44. [8] Dana, K. J., van Ginneken, B., Nayar, S. K., & Koenderink, [21] Moler, C. (1991). Cleve’s corner: ROOTS-Of polynomiJ. (1999). Reflectance and texture of real world surfaces. als, that is. The MathWorks Newsletter, 5(1), 8-9. ACM Trans. Graphics, 18(1), 1-34. [22] Reed, M., & Simon, B. (1980). Methods of Modern Math[9] de Silva, V., Morozov, D., & Vejdemo-Johansson, M. ematical Physics, VOL I, 400 pages. Academic Press Inc, (2011). Persistent Cohomology and Circular Coordinates. San Diego, CA, USA. Discrete and Computational Geometry, 45(4), 737-759. [23] Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). The Earth [10] Dudok de Wit, T., & Floriani, E. (1998). Estimating probMover’s Distance as a Metric for Image Retrieval. Internaability densities from short samples: A parametric maximum tional Journal of Computer Vision, 40(2), 99-121. likelihood approach. Physical Review E, 58(4), 5115-5122. [24] Silverman, B.W. (1986). Density Estimation for Statis[11] Edelman, A., & Muralami, H. (1995). Polynomial Roots tics and Data Analysis, 176 pages. Chapman and Hall, New From Companion Matrix Eigenvalues. Mathematics of ComYork, NY, USA. putation, 64(210), 763-776. [25] Thacker, N. A., Ahearne, F., & Rockett, P. I. (1997). The [12] Frazoni, G. (2012). The Klein Bottle: Variations on Bhattacharya Metric as an Absolute Similarity Measure for a Theme. Notices of the American Mathematical Society, Frequency Coded Data. Kybernetica, 34(4), 363-368. 59(8), 1076-1082. [26] van Hateren, J. H., & van der Schaaf, A. (1998). Indepen[13] Hayman, E., Caputo, B., Fritz, M., & Eklundh, J. O. dent Component Filters of Natural Images Compared with (2004). On the significance of real-world conditions for maSimple Cells in Primary Visual Cortex. Proc. R. Soc. Lond., terial classification. In European Conference on Computer 265(1394), 359-366. Vision 2004, 253-266. [27] Varma, M., & Ray, D. (2007). Learning the discrimina[14] Hatcher, A. (2002). Algebraic Topology, Cambridge Unitive power-invariance trade-off. In IEEE 11th international versity Press, New York, NY, USA. conference on computer vision, 2007. 19
[28] Varma, M., & Zisserman, A. (2004). Unifying Statistical Texture Classification Framworks. Image and Vision Computing, 22(14), 1175-1183. [29] Varma, M., & Zisserman, A. (2005). A Statistical Approach to Texture Classification from Single Images. International Journal of Computer Vision, 62(1), 61-81. [30] Varma, M., & Zisserman, A. (2009). A Statistical Approach to Material Classifcation Using Image Patch Exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11), 2032-2047. [31] Watson, G. S. (1969). Density Estimation by Orthogonal Series. The Annals of Mathematical Statistics, 40(4), 14961498. [32] Weinberger, K. Q., & Saul, L. K. (2009). Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 10, 207-244. [33] Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. International Journal of Computer Vision, 73(2), 213-238.
20