to evaluate F. In order to utilize a gradient-based algorithm, F .... where 0i is the i à i matrix of zeros and B is a d à (n â d) real-valued matrix. ... αij's are the directional derivatives of F in the directions given by Eij, .... is a particularization of Algorithm A.20 (p. .... The optimal subspaces found using only on the training set also.
Optimal Linear Representations of Images for Object Recognition Xiuwen Liu
∗
Anuj Srivastava
†
Kyle Gallivan‡
Abstract Linear representations have frequently been used in image analysis but are seldom found to be optimal in specific applications. We propose a stochastic gradient algorithm for finding optimal linear representations of images for use in appearance-based object recognition. Using the nearest neighbor classifier, we specify a recognition performance function with continuous derivatives. By formulating the problem of finding optimal linear representations as that of optimization on a Grassmann manifold, we first derive an intrinsic gradient flow on the Grassmannian and later generalize it to a stochastic gradient algorithm for optimization. We present several experimental results to establish the effectiveness of this algorithm.
Index terms: Optimal subspaces, Grassmann manifold, object recognition, linear representations, dimension reduction, optimal component analysis.
1
Introduction
The task of recognizing objects from their 2D images generally requires excessive memory storage and computation, as images are rather high-dimensional. High dimensionality also prohibits effective use of statistical techniques in image analysis since statistical models on high-dimensional spaces are difficult both to derive and to analyze. On the other hand, it is well understood that images are generated via physical processes that in turn are governed by a limited number of physical parameters. This motivates a search for methods that can reduce image dimensions without a severe loss in information. A commonly used technique is to project images linearly to some pre-defined low-dimensional subspaces, and use the image projections for analysis. For instance, let U be an n × d orthogonal matrix denoting a basis of an orthonormal d-dimensional subspace of IRn (n >> d) and let I be an image reshaped into an n × 1 vector. The vector a(I) = U T I ∈ IRd , also called the vector of image coefficients, provides a d-dimensional representation of I. Statistical methods for computer vision tasks such as image classification, object recognition, and texture synthesis, can now be developed by imposing probability models on a. Within the framework of linear representations, several standard bases, including principal components (PCA) and Fisher discriminant basis (FDA), have been widely used. Although they satisfy certain optimality criteria, they may not necessarily be optimal for a specific application at hand (for empirical evidence, see e.g. [3, 14]). A major goal of this paper is to present a technique for finding linear representations of images that are optimal for specific tasks and specific datasets. Our search for optimal linear representation, or an optimal subspace, is based on a stochastic optimization process that maximizes a pre-specified performance function over all subspaces. Since the set of all subspaces (known as the Grassmann manifold [4]) is not a vector space, the optimization process has been modified to account for its curved geometry. This paper is organized as follows. In Section 2 we set up the problem of optimizing the recognition performance over the set of subspaces, and describe a stochastic gradient technique to solve it in Section 3. Experimental results are shown in Section 4 followed by a discussion. ∗
Department of Computer Science, Florida State University, Tallahassee, FL 32306 Department of Statistics, Florida State University, Tallahassee, FL 32306 ‡ School of Computational Science and Information Technology, Florida State University, Tallahassee, FL 32306 †
1
2
Optimal Recognition Performance
We start with a mathematical formulation of the problem. Let U ∈ IRn×d be an orthonormal basis of an d-dimensional subspace of IRn , where n is the size of an image and d is the required dimension of the optimal subspace (generally n >> d). For an image I, considered as a column vector of size n, the vector of coefficients is given by a(I, U ) = U T I ∈ IRd . To specify a recognition performance measure F for appearance-based applications, let there be C classes to be recognized from the images; each class has ktrain training images (denoted by Ic,1 , . . . , Ic,ktrain ) and ktest test 0 , . . . , I0 images (denoted by Ic,1 c,ktest ) to evaluate F . In order to utilize a gradient-based algorithm, F 0 , U ) to be the ratio of should have continuous directional derivatives. To ensure that, we define ρ(Ic,i the between-class-minimum distance and within-class minimum distance of a test image from class 0 , U) = c indexed by i, given by ρ(Ic,i
0 ,I minc0 6=c,j d(Ic,i c0 ,j ;U ) , 0 ,I minj d(Ic,i c,j ;U )+²0
where d(I1 , I2 ; U ) = ka(I1 , U ) − a(I2 , U )k
(k · k denotes the 2-norm of a vector), and ²0 > 0 is a small number to avoid division by zero. Then, define F according to: C kX test 1 X 0 F (U ) = h(ρ(Ic,i , U ) − 1), (1) Cktest c=1 i=1 where h(·) is a monotonically increasing and bounded function. In our experiments, we have used 0 is classified h(x) = 1/(1 + exp(−2βx)), where β controls the smoothness of F . Note that Ic,i 0 correctly according to the nearest neighbor rule under U if and only if ρ(Ic,i , U ) > 1. It follows that F is precisely the recognition performance of the nearest neighbor classifier when β → ∞. Under this formulation, F (U ) = F (U H) for any d × d orthogonal matrix H as the distance d(I1 , I2 ; U ) = d(I1 , I2 ; U O).1 In other words, F depends on the subspace spanned by U but not on the specific basis chosen to represent that subspace. Therefore, our search for optimal representation(s) is on the space of d-dimensional subspaces rather than on their bases. Let Gn,d be the set of all d-dimensional subspaces of IRn ; it is called a Grassmann manifold2 [4]. It is a compact, connected manifold of dimension d(n−d). An element of this manifold, i.e. a subspace, can be represented either by a basis (non-uniquely) or by a projection matrix (uniquely). Choosing the former, let U be an orthonormal basis in IRn×d such that span(U ) is the given subspace of IRn . Let [U ] denote the set of all the orthonormal bases of span(U ), i.e., [U ] = {U H|H ∈ IRd×d , H T H = Id } ∈ Gn,d . The problem of finding optimal linear subspaces for recognition becomes an optimization ˆ ] = argmax problem: [U [U ]∈Gn,d F ([U ]). Since the set Gn,d is compact and F is a smooth function, the ˆ ] is well defined. Note that the maximizer of F may not be unique and hence [U ˆ ] may optimizer [U be set-valued rather than being point-valued.
3
Optimization via Simulated Annealing
ˆ ]. In particular, We have chosen a simulated annealing process to estimate the optimal subspace [U we adopt a Monte Carlo version of simulated annealing using acceptance/rejection at every step, and the proposal distribution results from a stochastic gradient process. Gradient processes, both deterministic and stochastic, have long been used for solving non-linear optimization problems[7, 8]. Since the Grassmann manifold Gn,d is a curved space, as opposed to being a (flat) vector-space, the 1
The choice of 2-norm in d(I1 , I2 ; U ) allows for this equality. Grassmann and similar manifolds have been used in the literature to derive effective algorithms but in different contexts; see e.g. [18] for subspace tracking, [13] for motion and structure estimation, [15] for independent component analysis, and [1, 6] for neural network learning algorithms. 2
2
gradient process has to account for its intrinsic geometry. Deterministic gradients such as NewtonRaphson method, on such manifolds with orthogonality constraints have been studied in [5]. We will start by describing a deterministic gradient process (of F ) on Gn,d and later generalize it to a Markov chain Monte Carlo (MCMC) type simulated annealing process.
3.1
Deterministic Gradient Flow
ˆ ] to The performance function F can be viewed as a scalar-field on Gn,d . A necessary condition for [U ˆ ], the directional derivative of F , in the direction be a maximum is that for any tangent vector at [U of that vector, should be zero. To define directional derivatives on Gn,d , let J be the n × d matrix made up of first d columns of the n × n identity matrix In ; [J] denotes the d-dimensional subspace aligned to the first d axes of IRn . 1. Derivative of F at [J] ∈ Gn,d : Let Eij be an n × n skew-symmetric matrix such that: for 1 ≤ i ≤ d and d < j ≤ n, 1
Eij (k, l) =
−1
0
if k = i, l = j if k = j, l = i otherwise .
(2)
Consider the products Eij J; there are d(n − d) such matrices that form an orthogonal basis of the vector space tangent to Gn,d at [J]. That is: T[J] (Gn,d ) = span{Ei,j J}. Notice that any tangent vector at [J] is of the form: for arbitrary scalars αij , "
d n X X
αij Eij J =
i=1 j=d+1
0d −B T
B 0n−d
#
J
∈ IRn×d ,
(3)
where 0i is the i × i matrix of zeros and B is a d × (n − d) real-valued matrix. The gradient vector of F at [J] is an n × d matrix given by A([J])J where: Pd
A([J]) = (
and where
Pn
n×n j=d+1 αij (J)E µ ij ) ∈ IR ¶ ²Eij ([J]) αij (J) = lim²↓0 F ([e J])−F ² i=1
∈ IR .
(4)
αij ’s are the directional derivatives of F in the directions given by Eij , respectively. The matrix A([J]) is a skew-symmetric matrix of the form given in Eqn. 3 (to the left of J) for some B, and points to the direction of maximum increase in F , among all tangential directions at [J]. 2. Derivative of F at any [U ] ∈ Gn,d : Tangent spaces and directional derivatives at any arbitrary point [U ] ∈ Gn,d follow similarly. For a given [U ], let Q be an n × n orthogonal matrix such that QU = J. In other words, QT = [U V ] where V ∈ IRn×(n−d) is any matrix such that V T V = In−d and U T V = 0. Then, the tangent space at [U ] is given by: T[U ] (Gn,d ) = {QT A : A ∈ T[J] (Gn,d )} . and the gradient of F at [U ] is a n × d matrix given by A([U ])J where: Pd
A([U ]) = QT (
i=1
Pn
and where αij (U ) =
n×n j=d+1 α µij (U )Eij ) ∈ IR ¶ T ²Eij lim²↓0 F ([Q e ²J])−F ([U ])
3
∈ IR .
(5)
The deterministic gradient flow on Gn,d is a solution of the equation: dX(t) = A(X(t)) J, dt
X(0) = [U0 ] ∈ Gn,d ,
(6)
ˆ ] and X(t) ∈ G for some with A(·) as defined in Eqn. 5. Let G ⊂ Gn,d be an open neighborhood of [U finite t > 0. It can be shown that X(t) converges to a local maximum of F but to achieve a global maxima, we will have to add a stochastic component to X(t). Numerical Approximation of Gradient Flow: Since F is generally not available in an analytical form, the directional derivatives αij are often approximated numerically using the finite differences: αij =
˜ ]) − F ([U ]) F ([U , 1 ≤ i ≤ d, and d < j ≤ n , ²
(7)
˜ ≡ QT e²Eij J is an n × d matrix that differs from U in for a small value of ² > 0. Here, the matrix U th ˜i = cos(²)Ui + sin(²)Vj , where Ui , Vj are the ith and j th only the i -column which is now given by U columns of U and V , respectively, with V defined as earlier. For a step size ∆ > 0, we will denote the values of the search process at discrete times X(t∆) by Xt . Then, a discrete approximation of the solution of Eqn. 6 is given by: Xt+1 = QTt exp(∆At )J , P P where At = di=1 nj=d+1 αij (Xt )Eij and Qt+1 = exp(−∆A)Qt .
(8)
In general, the expression exp(∆At ) will involve exponentiating an n × n matrix, a task that is computationally very expensive. However, given that the matrix At takes the skew-symmetric form before J in Eqn. 3, this exponentiation can be accomplished in order O(nd2 ) computations (see e.g. [15]), using the singular value decomposition of the (d × d) sub-matrix B contained in At . Also, Qt can be updated for the next time step using a similar efficient update.
3.2
Simulated Annealing Using Stochastic Gradients
The gradient process X(t) has the drawback that it converges only to a local maximum. For global optimization or to compute statistics under a given density on Gn,d , a stochastic component is often added to the gradient process to form a stochastic gradient flow, also referred to as a diffusion process [8]. We begin by constructing a stochastic gradient process on Gn,d and then add a MetropolisHastings type acceptance-rejection step to it to generate an appropriate Markov chain. One can obtain random gradients by adding a stochastic component to Eqn. 6 according to
n d √ X X Eij J dWij (t) , dX(t) = A(X(t))Jdt + 2D
(9)
i=1 j=d+1
where Wij (t) are real-valued, independent standard Wiener processes. It can be shown that (refer to [19]), under certain conditions on F , the solution of Eqn. 9, X(t), is a Markov process with a unique stationary probability density given by f given by: f ([U ]) =
1 exp(F ([U ])/D), where Z(D) = Z(D)
Z Gn,d
exp(F ([U ])/D) .
(10)
D ∈ IR plays the role of temperature and f is a density with respect to the Haar measure on the set Gn,d . 4
For a numerical implementation, Eqn. 9 has to be discretized with some step-size ∆ > 0. The discretized time process is given by: dAt = A(Xt )∆ +
d n √ X X 2∆D wij Eij , i=1 j=d+1
Xt+1 =
QTt
exp(∆dAt )J,
(11)
Qt+1 = exp(−∆dAt )Qt , where wij ’s are i.i.d standard normals. It is shown in [2] that for ∆ → 0, the process {Xt } converges to the solution of Eqn. 9. The process {Xt } provides a discrete implementation of the stochastic gradient process. The stochastic component of Xt helps avoid getting stuck in a local maxima but has the drawback of occasionally leading to states with low values of F . This drawback is removed by using a MCMCtype simulated annealing algorithm where we use the stochastic gradient process to generate a candidate for the next point but accept it only with a certain probability. That is, the right side of Eqn. 11 becomes a candidate Y that is selected as the next point Xt+1 according to a criterion that depends on F . Algorithm 1 MCMC Simulated Annealing: Let X(0) = [U0 ] ∈ Gn,d be any initial condition. Set t = 0. 1. Calculate the gradient matrix A(Xt ) according to Eqn. 5. 2. Generate d(n − d) independent realizations, wij s, from standard normal density. Using the value of Xt , calculate a candidate value Y according to Eqn. 11. 3. Compute F (Y ), F (Xt ), and set dF = F (Y ) − F (Xt ). 4. Set Xt+1 = Y with probability min{exp(dF/Dt ), 1}, else set Xt+1 = Xt . Note that exp(dF/Dt ) = f (Y ) f (Xt ) . 5. Set Dt+1 = Dt /γ, set t = t + 1, and go to Step 1. Here γ > 1 is the cooling ratio for simulated annealing with a typcal value of 1.0025. This algorithm is a particularization of Algorithm A.20 (p. 200) in the book by Robert and Casella [16]. Please consult that text for the convergence properties of Xt . Define the limiting point X∗ = limt→∞ Xt as the optimal subspace. A brute force implementation of Algorithm 1 will be computationally expensive. However, note that the discrete update rules (given in Eqn. 11) can be achieved in O(nd2 ) as mentioned earlier; ˜ and U differ αij ’s (given by Eqn. 7) can also be evaluated efficiently by exploiting the facts that i) U 0 ˜ only in one column (as mentioned earlier), and ii) since ² is small, ρ(Ic,i , U ) can be approximated 0 , U ) and thus F ([U ˜ ]) can be computed efficiently also. This implementation efficiently using ρ(Ic,i 2 2 yields an O(n d ) complexity for each iteration. One can further reduce the computational cost using a hierarchical optimization process [22]. In the context of object recognition using linear representations, we term our technique as optimal component analysis (OCA).
4
Experimental Results
We have applied the proposed algorithm to the search for optimal linear bases successfully on a variety of datasets. Due to limited space, we present results using only two datasets: the ORL face 5
recognition dataset3 and the CMU PIE dataset. The ORL dataset consists of faces of 40 different subjects with 10 images each. The CMU PIE dataset is a comprehensive face dataset consisting of images of 66 subjects under a variety of conditions [17]. However, since not all images are cropped precisely, we use only those frontal images (under different lighting conditions) that are cropped manually.
4.1
Optimizing Performance Using Algorithm 1
Similar to all gradient-based methods, the choice of free parameters, such as ∆, ², d, ktrain , ktest , and U0 , may have a significant effect on the results of Algorithm 1. While limited theoretical results are available to analyze the convergence of such algorithms in IRn , the case of simulated annealing over the space Gn,d is considerably more difficult. Instead of pursuing asymptotic convergence results, we have conducted extensive numerical simulations to demonstrate the convergence of the proposed algorithm, under a variety of values for the free parameters. To support the choice of MCMC-type stochastic algorithm, Figure 1 shows three examples (with different initial conditions) comparing i) the deterministic gradient algorithm, ii) the stochastic gradient algorithm (acceptance probability is set to one), and iii) the proposed MCMC stochastic algorithm. Throughout this paper, UICA is computed using the FastICA algorithm by Hyvarinen [9] and UF DA is calculated based on the procedure given by Belhumeur et al. [3]. While the deterministic gradient algorithm can be effective in some cases, in generally ends up in a local maximum. On the other hand, the stochastic gradient algorithm does not suffer from the problem of local maximum although its convergence is much slower than the proposed algorithm. As shown in these examples and the ones in Figs. 2 and 3, the proposed algorithm converges quickly under different kinds of conditions. We attribute this mainly to the fact that by taking into the geometry of the manifold into account, the gradient flow provides the most efficient way for updating [1]. (In fact, Amari [1] has shown that such a dynamical system is Fisher efficient.) Obviously, the effectiveness of the proposed algorithm depends on the choice of parameters; through experiments we have found that it works for a wide of parameter values. This can be attributed to the update rules in Eqn. 11 that guarantee that Xt is on the Grassmann manifold and no regularization or correction is needed [15, 6]. 100
90
80
0
500
1000
100
Recognition rate (%)
Recognition rate (%)
Recognition rate (%)
100
50
0
0
500
1000
90
80
0
500
Iterations
Iterations
Iterations
(a)
(b)
(c)
1000
Figure 1: Evolution of F (Xt )’s versus t for different initial conditions. In each panel, the dashed line correspond to that of a deterministic gradient process; dotted line a stochastic gradient process that accepts every step; solid line the proposed algorithm. For these curves, n = 10304, (that is 92 × 112), d = 5, ktrain = 5, and ktest = 5. (a) X0 = UP CA , (b) X0 = UICA , (c) X0 = UF DA . We have studied the variation of optimal performance versus the subspace rank denoted by d, and 3
http://www.uk.research.att.com/facedatabase.html
6
ktrain using the ORL dataset. Figures 2(a)-(c) show the results for three different values of d with ktrain = 5, ktest = 5. In Fig. 2(a), for d = 3, it takes about 2000 iterations for the process to converge to a solution with perfect performance while in Fig. 2(c), for d = 20, it takes less than 200 iterations. This is expected as larger d implies a bigger space, and hence an improved performance or easier to achieve a perfect performance. Figures 2(d)-(f) show three results with three different values of ktrain with n = 154 and d = 5. Here the division of images into training and test sets was chosen randomly. In view of the nearest neighbor classifier being used to define F , it is easier to obtain a perfect solution with more training images. The experimental results support that observation. Figure 2(d) shows the case with ktrain = 1 (ktest = 9) where it takes about 3000 iterations for the process to converge to a perfect solution. In Fig. 2(f), where ktrain = 8 (ktest = 2), the process converges to a perfect solution in about 300 iterations. 100
70
100
Recognition rate (%)
Recognition rate (%)
Recognition rate (%)
100
90
95
40 0
2000
80
4000
0
90
1000
0
Iterations
Iterations
(a)
(c)
70 60 50
Recognition rate (%)
Recognition rate (%)
80
1000
100
100
90
500
Iterations
(b)
100
Recognition rate (%)
500
70
75
40 0
1000
2000
3000
4000
40
0
1000
Iterations
Iterations
(d)
(e)
2000
50
0
1000
2000
Iterations
(f)
Figure 2: Evolution of F (Xt ) versus t for different values of d and ktrain . (a) d = 3 and ktrain = 5. (b) d = 10 and ktrain = 5. (c) d = 20 and ktrain = 5. (d) d = 5 and ktrain = 1. (e) d = 5 and ktrain = 2. (f) d = 5 and ktrain = 8. As another example, we have applied the proposed algorithm to a subset of the CMU PIE dataset [17]. The subset we have used includes the frontal images (with lighting variations) that were cropped manually. Figure 3 shows three examples of the algorithmic results. As in the previous examples, the proposed algorithm improves the performance significantly, often converging to solutions with perfect recognition performance.
4.2
Performance Comparisons with Standard Subspaces
So far we have described results on finding optimal subspaces under different conditions. In this section we focus on comparing empirically the performances of these optimal subspaces with the frequently used subspaces, namely UP CA , UICA , and UF DA . Figures 4(a) and (b) show the recognition performance (for the ORL database) with different d 7
90
80
0
500
1000
100
Recognition rate (%)
100
Recognition rate (%)
Recognition rate (%)
100
50
0
0
500
1000
95
90
0
500
Iterations
Iterations
Iterations
(a)
(b)
(c)
1000
Figure 3: Three examples of F (Xt ) versus t on the CMU PIE dataset. Here n = 100 × 100. (a) X0 = UP CA and d = 10. (b) X0 = UICA and d = 10. (c) X0 = UF DA and d = 5. and ktrain for four different kinds of subspaces: (i) optimal subspace X ∗ computed using Algorithm 1, (ii) UP CA , (iii) UICA , and (iv) UF DA . These results highlight the fact that the recognition performance of linear representations can vary significantly depending on the parameters and datasets. To reach any conclusion, a comparison of a few standard bases is not enough and some technique such as the one proposed here seems necessary. The above examples represent the non-practical situations where the performance measure F can be evaluate over the whole database. In practice, however, we often have a limited number of training images and we are interested in linear subspaces that lead to better performance on unknown test set. min 0 d(Ic,i ,Ic0 ,j ;U ) To simulate this setting, we have modified ρ to be ρ(Ic,i , U ) = minj6c=6=i c,j d(Ic,i ,Ic,j ;U )+²0 . This definition is related to the leave-one-out recognition performance on the training set. We have applied this modified measure on the ORL dataset by randomly dividing all the images into a non-overlapping training and test set. Figure 4(c) shows the leave-one-out recognition performance on the training images and Fig. 4(d) the performance on a separate test set of the optimal subspaces along with common linear representations. The optimal subspaces found using only on the training set also provide a better performance on the test set in all these cases. Such results point to the possibility of improving generalization using the proposed algorithm. We have not researched this topic in this paper.
5
Discussion
We have proposed an MCMC-based simulated annealing algorithm to find the optimal linear subspaces assuming that the performance function F can be computed. By formulating an optimization problem on the Grassmann manifold, an intrinsic optimization algorithm is presented. Extensive experiments demonstrate the effectiveness and feasibility of this algorithm. While the focus of this paper is on finding linear representations for recognition, the proposed algorithm is not limited to any particular performance function. In fact, we have applied the algorithm and its generalized versions to other problems such as i) finding optimal linear filters that are both sparse and effective for recognition [10] and ii) finding optimal representations for image retrieval applications [12]. While F given by Eqn. 1 is based on the nearest neighbor rule, its maximum is achieved when images from one class form clusters that are away from other clusters. Therefore the learned representations can improve performance of other classifiers, an example of which is shown in [11] for support vector machines [20]. While linear representations may not be sufficient to approximate manifolds generated by physical processes, which are often nonlinear in nature, the 8
100
Recognition rate (%)
Recognition rate (%)
100
50
0
1
10
50
0
20
1
Dimension of subspace
4
(b) 100
Test performance (%)
Training performance (%)
(a) 100 80 60 40 20 0
8
Number of training
2
4
6
8
80 60 40 20 0
10
Number of training
2
4
6
8
10
Number of training
(c)
(d)
Figure 4: Recognition performance of different linear subspaces versus d and ktrain on the ORL dataset. Solid line is that of (X ∗ ), dashed line that of UF DA , dotted line that of UP CA , and dashdotted line that of UICA . (a) Recognition performance versus d with ktrain = 5. (b) Recognition performance versus ktrain with d = 5. (c) and (d) The leave-one-out recognition performance on the training set and the corresponding recognition performance on a separate test set with d = 5. proposed technique can be extended to nonlinear ones using kernel methods which have been applied to principal component analysis [21].
Acknowledgments We thank three anonymous reviewers whose insightful comments help improve the clarity and presentation of this paper, and the producers of the ORL and PIE datasets for making them public. This research was supported in part by NSF DMS-0101429, ARO DAAD19-99-1-0267, NMA 201-01-2010, NSF CCR-9912415, and NSF IIS-0307998.
9
References [1] S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10:251-276, 1998. [2] Y. Amit. A multiflow approximation to diffusions. Stochastic Processes and their Applications, 37(2):213–238, 1991. [3] P. N. Belhumeur, J. P. Hepanha, and D. J. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711–720, 1997. [4] W. M. Boothby. An Introduction to Differential Manifolds and Riemannian Geometry. Academic Press, 1986. [5] A. Edelman, T. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998. [6] S. Fiori. A theory for learning by weight flow on Stiefel-Grassman manifold. Neural Computation, 13:1625–1647, 2001. [7] S. Geman and C.-R. Hwang. Diffusions for global optimization. SIAM J. Control and Optimization, 24(24):1031–1043, 1987. [8] U. Grenander and M. I. Miller. Representations of knowledge in complex systems. Journal of the Royal Statistical Society, 56(3), 1994. [9] A. Hyvarinen. Fast and robust fixed-point algorithm for independent component analysis. IEEE Transactions on Neural Networks, 10:626–634, 1999. [10] X. Liu and A. Srivastava. Stochastic search for optimal linear representation images on spaces with orthogonality constraints. In Proceedings of Fourth International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, in press, 2003 (available at http://calais.stat.fsu.edu/anuj/pub.html) [11] X. Liu and A. Srivastava. Applications of stochastic algorithms for optimal linear representations. Submitted to the Neural Information Processing Systems Conference, 2003 (available at http://calais.stat.fsu.edu/anuj/pub.html). [12] X. Liu, A. Srivastava, and D. Sun. Learning Optimal Representations for Image Retrieval Applications. In Proceedings of the International Conference on Image and Video Retrieval, in press, 2003. [13] Y. Ma, J. Kosecka, and S. Sastry. Optimization criteria and geometric algorithms for motion and structure estimation. International Journal of Computer Vision, 44(3):219–249, 2001. [14] A. M. Martinez and A. C. Kak. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001. [15] Y. Nishimori. Learning algorithm for independent component analysis by geodesic flows on orthogonal group. In Proceedings of the International Joint Conference on Neural Networks, vol. 2, 933–938, 1999. [16] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Text in Stat., 1999. 10
[17] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression database. IEEE Transactions on Pattern Analysis and Machine Intelligence, in press, 2003. [18] A. Srivastava. A Bayesian approach to geometric subspace estimation. IEEE Transactions on Signal Processing, 48(5):,1390–1400, 2000. [19] A. Srivastava, U. Grenander, G. R. Jensen, and M. I. Miller. Jump-diffusion markov processes on orthogonal groups for object recognition. Journal of Statistical Planning and Inference, 103(1-2):15–37, 2002. [20] V. N. Vapnik. The Nature of Statistical Learning Theory, 2nd ed. Springer-Verlag, New York, 2000. [21] B. Sch¨olkopf, A. J. Smola, and K. R. M¨ uller. Kernel principal component analysis. In Advances in Kernel Methods: Support Vector Learning, B. Sch¨ olkopf, C. J. C. Burges, and A. J. Smola (eds.), pp. 255-268, The MIT Press, Cambridge, MA, 2002. [22] Q. Zhang, X. Liu, and A. Srivastava. Hierarchical learning of optimal linear representations. In Proceedings of the IEEE Workshop on Statistical Analysis in Computer Vision, 2003.
11