J Optim Theory Appl (2013) 157:853–865 DOI 10.1007/s10957-011-9832-4
Superlinear Convergence of a General Algorithm for the Generalized Foley–Sammon Discriminant Analysis Lei-Hong Zhang · Li-Zhi Liao · Michael K. Ng
Received: 30 November 2010 / Accepted: 14 March 2011 / Published online: 30 March 2011 © Springer Science+Business Media, LLC 2011
Abstract Linear Discriminant Analysis (LDA) is one of the most efficient statistical approaches for feature extraction and dimension reduction. The generalized Foley– Sammon transform and the trace ratio model are very important in LDA and have received increasing interest. An efficient iterative method has been proposed for the resulting trace ratio optimization problem, which, under a mild assumption, is proved to enjoy both the local quadratic convergence and the global convergence to the global optimal solution (Zhang, L.-H., Liao, L.-Z., Ng, M.K.: SIAM J. Matrix Anal. Appl. 31:1584, 2010). The present paper further investigates the convergence behavior of this iterative method under no assumption. In particular, we prove that the iteration converges superlinearly when the mild assumption is removed. All possible limit points are characterized as a special subset of the global optimal solutions. An illustrative numerical example is also presented.
The authors would like to thank two anonymous referees and the editor for their helpful comments and suggestions on the earlier version of this paper. Research of the second author was supported in part by FRG grants from Hong Kong Baptist University and the Research Grant Council of Hong Kong. Research of the third author was supported in part by RGC grants 7035/04P, 7035/05P and HKBU FRGs. L.-H. Zhang () Department of Applied Mathematics, Shanghai University of Finance and Economics, 777 Guoding Road, Shanghai 200433, People’s Republic of China e-mail:
[email protected] L.-Z. Liao · M.K. Ng Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong, People’s Republic of China L.-Z. Liao e-mail:
[email protected] M.K. Ng e-mail:
[email protected]
854
J Optim Theory Appl (2013) 157:853–865
Keywords Dimensionality reduction · Linear discriminant analysis · Generalized Foley–Sammon transform · The trace ratio optimization problem · Superlinear convergence
1 Introduction As a classical statistical approach for supervised dimensionality reduction and classification, linear discriminant analysis (LDA) is very efficient in overcoming the curse of the dimensionality arising in modern high-dimensional data sets (see, e.g., [1–11]). To date, LDA has been widely applied in many applications of data mining, machine learning and bioinformatics. For a given high-dimensional data, the main task of LDA is to find an “optimal” linear transformation for two basic purposes: dimensionality reduction and feature extraction for classification. In general, the optimal linear transformation is defined to maximize the between-class separation and minimize the within-class cohesion simultaneously, which are commonly measured by the traces of relevant matrices (a brief review of LDA is presented in Sect. 2). A natural criterion for the optimal linear transformation, subject to orthogonal constraint, leads to the generalized Foley–Sammon discriminant analysis, which is also called the trace ratio model (see [5, 12–16]). However, due to the lack of a closed-form solution and efficient algorithms, the trace ratio criterion was previously approximated by the ratio trace criterion (see, e.g., [12–14, 17]), which has been extensively studied in the literature of LDA (see, e.g., [3, 6, 18–26]). Nevertheless, recent publications [13, 27] indicate that the solution based on the ratio trace may deviate from the original objective and may lead to uncertainty in subsequent classification and clustering processing. More discussions on the comparison between these two criteria can be found, e.g., in [13–17]. Recently, there is increasing interest on the trace ratio optimization problem. In particular, an efficient iterative method (summarized as Algorithm 1 in Sect. 2) is proposed in [13], which is proved to enjoy the global convergence to the global optimal solution; moreover, under a generic condition, the local quadratic convergence rate is also proved [14]. The generic condition assumes that the dominant eigenspace of a relevant matrix is separate from the rest, with which the set of the global solutions of the trace ratio optimization problem can be completely characterized. However, without this generic assumption, the global solution set becomes much more complicated and the convergence behavior of the iterative method remains open. The main task of the present paper is to discuss the convergence behavior of this iterative method under no assumption. In particular, when the generic assumption is relaxed, we prove that the iteration converges superlinearly, and moreover, we are able to characterize all possible limit points of this iterative method. The rest of this paper is organized as follows: in the next section, we present a brief review of linear discriminant analysis and provide some preliminary results related with the trace ratio optimization problem; without the generic assumption, in Sect. 3, we analyze the convergence rate and present some properties of the limit point(s) of Algorithm 1; in Sect. 4, we give a numerical example to support the theoretical analysis in Sect. 3; some conclusions are drawn in Sect. 5.
J Optim Theory Appl (2013) 157:853–865
855
2 Preliminary Results We first briefly review linear discriminant analysis and present some preliminary results. Suppose that we have a collection of (training) samples obtained from c independent classes, and the ith class (i = 1, 2, . . . , c) contains ni samples, each of which is represented by a high-dimensional vector in RN . Thus, these samples form c N ×d a data matrix A ∈ R with d = i=1 ni . For the purpose of dimensionality reduction and feature extraction, linear discriminant analysis seeks an optimal linear transformation, say G ∈ RN ×l (generally l N ). Under the condition l N, dimensionality reduction can be realized because a high-dimensional sample a ∈ RN could be reduced and mapped to a low-dimensional “sample” y = GT a ∈ Rl . The principle in defining an “optimal” G for classification [6] is to simultaneously maximize the between-class separation and minimize the within-class cohesion, which are usually measured by the trace tr(GT Sb G) of GT Sb G and tr(GT Sw G), respectively. The matrices Sb , Sw ∈ RN ×N are both symmetric and positive semi-definite and are called the between-class scatter matrix and the within-class scatter matrix [6], respectively. As a result, a natural criterion for the optimal G is to maximize, tr(GT Sb G) subject to some constraint, the trace ratio F1 (G) := tr(G T S G) . Maximizing the trace w ratio criterion F1 (G) subject to the orthogonal constraint max
GT G=Il
tr(GT Sb G) tr(GT Sw G)
(1)
is called the generalized Foley–Sammon discriminant analysis (GFST, [5, 12–15]). In the case of singularity of Sw , which is referred to as the undersampled problem in modern applications [28], the regularization technique [29] turns out to be one of effective remedies by simply adding a regularization term μIN (μ > 0) to the related singular scatter matrix. This then yields the regularized GFST model (RGFST) max
GT G=Il
tr(GT Sb G) , tr(GT Sw G) + μl
μ > 0,
(2)
which has been extensively discussed in [14]. From the computational point of view, both (1) and (2) need a solution of the following general optimization problem (i.e., the trace ratio optimization problem): max ψ(G),
GT G=Il
ψ(G) :=
tr(GT BG) , tr(GT W G)
(3)
where B = B T ∈ RN ×N is positive semi-definite, and W = W T ∈ RN ×N is positive definite. Let ψ ∗ be the optimal objective value of (3), then the set S ∗ of global solutions of (3), S ∗ := G ∈ RN ×l |ψ(G) = ψ ∗ , GT G = Il , can also be characterized by the following Theorem 2.1 (see [14] and [12]).
856
J Optim Theory Appl (2013) 157:853–865
Theorem 2.1 Let ψ ∗ be the global optimal objective value of (3). Then any G∗ ∈ RN ×l solves (3) globally if and only if G∗ is an orthonormal eigenbasis corresponding to the l-largest eigenvalues of the matrix E ∗ := (B − ψ ∗ · W ) ∈ RN ×N .
(4)
Moreover, the sum of the l-largest eigenvalues of the matrix E ∗ is zero. Let G∗ be an arbitrary orthonormal eigenbasis corresponding to the l-largest eigenvalues of E ∗ given by (4). Then there is a matrix M ∈ Rl×l satisfying E ∗ G∗ = G∗ M, and (M, G∗ ) is an orthonormal eigenpair [30] of E ∗ . Premultiplying (G∗ )T on both sides yields M = (G∗ )T BG∗ − ψ ∗ (G∗ )T W G∗ , and therefore the sum of the l-largest eigenvalues of the matrix E ∗ is tr (G∗ )T E ∗ G∗ = tr(M) = tr (G∗ )T BG∗ − ψ ∗ tr (G∗ )T W G∗ = 0, ∗ T
∗
tr((G ) BG ) ∗ where the last equality follows from ψ ∗ = tr((G ∗ )T W G∗ ) . The eigenspace span(G ) is said to be a simple eigenspace if the generic assumption
λl (E ∗ ) > λl+1 (E ∗ )
(5)
holds. Here, by counting the algebraic multiplicity, λl (E ∗ ) stands for the lth largest eigenvalue of the E ∗ . Under assumption (5), the eigenspace span(G∗ ) is uniquely determined by its eigenvalues λ1 (E ∗ ) ≥ · · · ≥ λl (E ∗ ) (see [30], p. 244), and therefore, according to Theorem 2.1, the set of the global solutions to (3) can be completely expressed by S ∗ = G∗ V |V T V = Il , V ∈ Rl×l = span(G∗ ) ∩ G ∈ RN ×l |GT G = Il .
(6)
However, if (5) is not satisfied, the global solution set is relatively complicated and does not possess the simple form of (6). An iterative scheme is proposed in [13] to solve (3), which can be simply described in Algorithm 1. For the convergence behavior of Algorithm 1, the global convergence [13, 14] and the linear convergence rate [14] have been established and are summarized in Theorem 2.2.
Given a symmetric and positive semi-definite B ∈ RN ×N , and a symmetric and positive definite W ∈ RN ×N , this algorithm computes a global solution to (3). 1. Initial step: Select any G0 ∈ {G ∈ RN ×l |GT G = Il } and the tolerance ε > 0. Set k = 0. 2. Compute an orthonormal eigenbasis Gk+1 corresponding to the l largest eigenvalues of Eψk := B − ψk W,
ψk := ψ(Gk ).
3. If ψk+1 − ψk < ε, then stop (if ψk+1 − ψk = 0, then Gk+1 solves (3) globally); otherwise, set k = k + 1 and goto 2.
Algorithm 1: A fast iterative scheme for (3)
J Optim Theory Appl (2013) 157:853–865
857
Theorem 2.2 Let W be a symmetric and positive definite, B be symmetric and positive semi-definite, and ψ ∗ be the global optimal objective value of (3). Then the sequence {ψk } generated by Algorithm 1 is monotonically increasing to ψ ∗ and satisfies ψ ∗ − ψk+1 ≤ (1 − ν)(ψ ∗ − ψk ), where
k = 0, 1, . . . ,
(7)
l 0 0 are the ordered eigenvalues of W. It should be mentioned that (7) implies that the linear convergence of {ψk } to ψ ∗ occurs from the first step. The local quadratic convergence of both sequences {ψk } and {Gk } under assumption (5) has recently been proved in [14], where two types of inexact computation of Step 2 of Algorithm 1 are also established. As the global solution set concerns essentially the eigenspace of E ∗ , the local quadratic convergence rate of {Gk } is measured by the distance between subspaces (see [31], Sect. 2.6.3): Definition 2.1 Let M1 and M2 be two subspaces of RN with the same dimension. The distance between M1 and M2 is defined by dist(M1 , M2 ) := πM1 − πM2 2 , where πM1 and πM2 are the orthogonal projections onto M1 and M2 , respectively. Theorem 2.3 Let G∗ be any global maximizer of (3) and suppose that (5) holds. Let {Gk } and {ψk } be generated from Algorithm 1. Then dist(span(Gk ), span(G∗ )) → 0 quadratically, and moreover, ψk → ψ ∗ quadratically as k → +∞.
3 Convergence Analysis Under no Assumption Though it is extremely difficult to find an example with the condition λl (E ∗ ) = λl+1 (E ∗ ), from a practical point of view, we may still encounter a problem where the gap δ := λl (E ∗ ) − λl+1 (E ∗ ) is very close to zero (as demonstrated by our illustrative example in Sect. 4). In this case, due to the presence of the round-off errors in practical computations, the method would behave more likely as under the condition λl (E ∗ ) = λl+1 (E ∗ ). Therefore, it is necessary to discuss the convergence behavior when condition (5) is relaxed. However, we should remind ourselves that fundamental difference of the convergence behavior of Algorithm 1 arises when condition (5) is removed. In particular, whenever (5) is satisfied, then a perturbation analysis [14] ensures that, for any ψ sufficiently close to ψ ∗ , the eigenspace corresponding to the
858
J Optim Theory Appl (2013) 157:853–865
l largest eigenvalues of B − ψW is still an l-dimensional subspace, and therefore, in the asymptotic state, the eigenspace span(Gk+1 ) generated by Step 2 of Algorithm 1 is unique. From this point of view, we can say that the sequence of subspaces {span(Gk )} generated by Algorithm 1 is locally uniquely determined by span(G0 ). This is one of the keys for the proof of the local quadratic convergence in [14]. By contrast, if condition (5) is removed, it is not guaranteed that, for any ψ sufficiently close to ψ ∗ , the eigenspace corresponding to the l largest eigenvalues of B − ψW is an l-dimensional subspace anymore. As Step 2 of Algorithm 1 does not have any specific rule for choosing a particular eigenbasis, this could lead to uncertainty in the asymptotic state of the iteration, where the eigenspace span(Gk+1 ) generated by Step 2 may not be uniquely determined. This then compromises the possibility of establishing the convergence of the sequence {Gk }.1 Fortunately, in this section, we will see that the local superlinear convergence of the sequence {ψk } to ψ ∗ can be established, and moreover, we will also show that Algorithm 1 cannot converge to any global solution in S ∗ , but only to a special subset of S ∗ . For simplicity of discussion, we denote BG := GT BG,
WG := GT W G,
SW∗ := arg max tr(WG ) ⊆ S ∗ , G∈S ∗
SB∗ := arg max tr(BG ) ⊆ S ∗ ,
(8)
G∈S ∗
and γ ∗ := max tr(WG ) > 0. G∈S ∗
If (5) holds, then it is obvious that S ∗ ≡ SB∗ ≡ SW∗ , ¯= ¯ G ˆ ∈ S ∗ , there is an orthogonal matrix V ∈ Rl×l such that G since for any two G, ˆ , and hence tr(W ¯ ) = tr(W ˆ ) and tr(B ¯ ) = tr(B ˆ ); however, if E ∗ has repeated GV G G G G l largest eigenvalues, i.e., (5) does not hold, then SB∗ and SW∗ may not be the same as S ∗ . An obvious result related to the sets SB∗ and SW∗ can be stated in this case. Lemma 3.1 Let SB∗ and SW∗ be defined by (8). Then SB∗ ≡ SW∗ . Proof Note that tr(BG ) = ψ ∗ tr(WG )
∀G ∈ S ∗ ,
and hence max tr(BG ) = ψ ∗ max tr(WG ).
G∈S ∗
G∈S ∗
1 A similar convergence phenomenon also appears in the classical Rayleigh quotient iteration [32] for the
standard eigenvalue problem, where in very special case, the strong convergence of the iterates is not guaranteed, and the cubic convergence rate is reduced to linear convergence.
J Optim Theory Appl (2013) 157:853–865
859
¯ ∈ S ∗ be arbitrary, and suppose that also G ¯ ∈ S ∗ ; then, for any G ˆ ∈ S ∗ , we Let G W B B have the contradiction: ψ ∗ tr(WGˆ ) ≤ ψ ∗ tr(WG¯ ) = tr(BG¯ ) < tr(BGˆ ),
ˆ = or ψ(G)
tr(BGˆ ) > ψ ∗. tr(WGˆ )
This shows that SW∗ ⊆ SB∗ , and similarly, we can show that SW∗ ⊇ SB∗ .
Remark 3.1 Similar to Lemma 3.1, it is easy to see that arg min∗ tr(WG ) ≡ arg min∗ tr(BG ), G∈S
G∈S
and, if assumption (5) holds, they are both equal to S ∗ . Lemma 3.2 Let {ψk } and {Gk } be generated from Algorithm 1, and let S ∗ be the global solution set of problem (3). Then, for any Gk for k = 1, 2, . . . , it follows that tr(WGk ) ≥ γ ∗ := max tr(WG ). G∈S ∗
Proof For each k = 0, 1, . . . , one has tr GTk+1 Eψk Gk+1 = tr(BGk+1 ) − ψk · tr(WGk+1 ), or equivalently ψk+1 =
tr(GTk+1 Eψk Gk+1 ) tr(BGk+1 ) = ψk + . tr(WGk+1 ) tr(WGk+1 )
(9)
Let G∗ ∈ SW∗ be arbitrary. Then, from Theorem 2.1 it follows tr GTk+1 Eψk Gk+1 ≥ tr (G∗ )T Eψk G∗ = (ψ ∗ − ψk ) · tr(WG∗ ) ≥ 0.
(10)
Consequently, (9) and (10) yield that ψk+1 = ψk +
tr(GTk+1 Eψk Gk+1 ) tr(WGk+1 )
≥ ψk + (ψ ∗ − ψk )
tr(WG∗ ) . tr(WGk+1 )
(11)
Suppose that there is k ≥ 0 such that tr(WGk+1 ) < γ ∗ ; then from (11) we have that ψk+1 > ψk + (ψ ∗ − ψk ) = ψ ∗ , a contradiction. This completes the proof.
With the aid of Lemma 3.2, we now are able to prove the local superlinear convergence of the sequence {ψk } to ψ ∗ .
860
J Optim Theory Appl (2013) 157:853–865
Theorem 3.1 Let {ψk } and {Gk } be generated from Algorithm 1. Then {ψk } converges to ψ ∗ superlinearly. That is, if ψk = ψ ∗ for k = 0, 1, . . . , we have ψ ∗ − ψk+1 = 0. k→+∞ ψ ∗ − ψk lim
Proof By (11), we only need to prove that lim tr(WGk ) = γ ∗ .
(12)
k→+∞
We proceed by contradiction. To this end, we assume by noting Lemma 3.2 that there exists a subsequence {Gki } of {Gk } such that lim tr(WGki ) = γ¯ > γ ∗ .
i→+∞
Since Gki ∈ {G ∈ RN ×l |GT G = Il } N R ×l |GT G = Il } is compact, there must that
for i = 0, 1, . . . , and the constraint {G ∈ be a subsequence, say {Gk¯i }, of {Gki } such
¯ ∈ G ∈ RN ×l |GT G = Il lim Gk¯i = G
i→+∞
and
lim tr(WGk¯ ) = tr(WG¯ ) = γ¯ > γ ∗ .
i→+∞
i
Moreover, by Theorem 2.2, we have ¯ = lim ψ(G ¯ ) = lim ψ(Gki ) = lim ψ(Gk ) = ψ ∗ , ψ(G) ki i→+∞
i→+∞
k→+∞
¯ ∈ S ∗ . This then leads to the following contradiction: which implies that G max tr(WG ) ≥ tr(WG¯ ) > γ ∗ = max tr(WG ).
G∈S ∗
G∈S ∗
Thus, we have proven (12), which, together with (11), completes the proof.
It is interesting to note that, by Lemma 3.2 and Theorem 3.1, Eψ(Gk ) − E ∗ 2 = (ψ ∗ − ψk )W 2 → 0
as k → ∞,
which implies that Eψ(Gk ) converges to the target matrix E ∗ also at least superlinearly. Moreover, we can also prove that the sequence {Gk } converges to the set SW∗ , as the following result shows. Theorem 3.2 Let {ψk } and {Gk } be generated from Algorithm 1. Then the sequence {Gk } converges to the set SW∗ defined by (8) in the sense that lim
inf Gk − G2 = 0.
k→+∞ G∈SW∗
(13)
J Optim Theory Appl (2013) 157:853–865
861
Proof Suppose that (13) does not hold, which then implies that there must exist a subsequence {Gki } of {Gk } such that Gki − G2 > > 0,
i = 0, 1, . . . ,
for some > 0 and any G ∈ SW∗ . Noting that Gki ∈ {G ∈ RN ×l |GT G = Il }, there must be a subsequence {Gk¯i } of {Gki } such that ¯ ∈ G ∈ RN ×l |GT G = Il ; lim Gk¯i = G
i→+∞
moreover, by Theorem 2.2 again, ¯ = lim ψ(G ¯ ) = lim ψ(Gki ) = lim ψ(Gk ) = ψ ∗ , ψ(G) ki i→+∞
i→+∞
k→+∞
¯ ∈ S ∗ . On the other hand, for any G ∈ S ∗ , it follows that implying G W ¯ − G2 ≥ > 0, lim Gk¯i − G2 = G
i→+∞
¯ ∈ which implies that G SW∗ . This, together with (12), then leads to the following contradiction: γ ∗ > tr(WG¯ ) = lim tr(WGk¯ ) = lim tr(WGk ) = γ ∗ . i→+∞
i
i→+∞
We thus complete the proof.
Theorem 3.2 not only describes the convergence behavior of the sequence {Gk } but also characterizes all the limit points of Algorithm 1. It is interesting to note that the computed solution of Algorithm 1 reaches both the maximum within-class cohesion and the maximum between-class separation over all optimal linear transformations. On the other hand, this also implies that, in the set S ∗ \ SW∗ , there are points which get both the minimum within-class cohesion and the minimum between-class separation over S ∗ . As these points could not be reached by Algorithm 1, one may argue that important classification information will be lost. However, as we have pointed out, the capability of the linear transformation G in classification is essentially detertr(BG ) , rather than by tr(BG ) or tr(WG ) alone. From this point of mined by the ratio tr(W G) view, we can say that all points in S ∗ are equivalent, and the convergent points of Algorithm 1 already capture all essential information for classification.
4 An Illustrative Numerical Example To show the theoretical analysis in Sect. 3 without assumption (5) numerically, we construct the following example with ⎡ ⎤ 0.56137168709867, 0.47499258836035,
B = ⎣0.41374744863441, 0.54338504112014,
0.47499258836035, 0.82967332416529, 0.49675027724257, 0.26601455757527,
0.41374744863441, 0.49675027724257, 0.67528194099622, 0.67467327690424,
0.54338504112014 0.26601455757527⎦ 0.67467327690424 0.98862766976577
862
J Optim Theory Appl (2013) 157:853–865
Fig. 1 Numerical testing of local superlinear convergence of the sequence {ψk } without assumption (5)
and ⎡
5.12991583361164, 5.69497530727315, ⎣ W = 4.20983998866400, 4.62284583681357,
5.69497530727315, 8.18143054137715, 4.74195985939636, 4.84945533757169,
4.20983998866400, 4.74195985939636, 4.93230427900313, 5.23956200886958,
⎤
4.62284583681357 4.84945533757169⎦ 5.23956200886958 , 6.98418375019935
in which N = 4, l = 2, ψ ∗ ≈ 0.24920801768976, and the second largest eigenvalue of E ∗ is repeated numerically satisfying λ1 (E ∗ ) ≈ 0.12663708589604, λ2 (E ∗ ) ≈ −0.12663708589604, λ3 (E ∗ ) ≈ −0.12663708590329, λ4 (E ∗ ) ≈ −3.10538689454485, and λ2 (E ∗ ) − λ3 (E ∗ ) < 7.2 × 10−12 . The convergence is attained after 8 iterations with a rather subtle stopping criterion ε = 10−14 in Step 2 of Algorithm 1. Figure 1 shows numerically the local superlinear convergence of {ψk }, which confirms the k+1 −ψk | theoretical analysis in Sect. 3. The y-axis in this figure is |ψ |ψk −ψk−1 | . On the other hand, we observe that tr(WG¯ ) = 2.93331634814200,
and
tr(WGˆ ) = 2.22982631110921
tr(BG¯ ) = 0.73100595237744,
and
tr(BGˆ ) = 0.55569059477676,
and
¯ is an orthogonal eigenbasis of E ∗ corresponding to the eigenvalues λ1 (E ∗ ) where G ˆ is an orthogonal eigenbasis of E ∗ corresponding to the eigenvaland λ2 (E ∗ ), and G ∗ ues λ1 (E ) and λ3 (E ∗ ). Moreover, we note that ¯ − ψ(G) ˆ = 3.249872593258374 × 10−12 , ψ(G) ¯ and G ˆ both reach the global maximum ψ ∗ . which implies that G
J Optim Theory Appl (2013) 157:853–865
863
Fig. 2 The history of tr(WGk ) without assumption (5)
Table 1 The numbers of iterations of Algorithm 1 as θ varies
θ
10−12
10−6
100
10
103
Number of iterations
8
7
6
6
5
To give a clearer picture of the convergence behavior of the algorithm, in Fig. 2, we plot the sequence {tr(WGk )} together with the line y = γ ∗ , where, according to (12), γ ∗ is the limit of {tr(WGk )}. From this figure it can be clearly seen that the sequence {tr(WGk )} is above the line y = γ ∗ , which supports the conclusion of Lemma 3.2 numerically. Finally, we point out that a fluctuation is observed at the 5th iteration in Figs. 1 and 2. This fluctuation is explainable, as we have pointed out at the beginning of Sect. 3 that, without condition (5), the eigenspace span(Gk+1 ) generated by Step 2 may not be uniquely determined and thus could lead to uncertainty in the asymptotic state of the iteration. To conclude this section, we investigate how the convergence is affected when the problem is perturbed. Because condition (5) is generic, the quadratic convergence rate is expected when a slight perturbation is imposed on. For this purpose, we give W a diagonal perturbation D(θ ) = diag{θ, 2θ, 3θ, 4θ } ∈ R4×4 parameterized by θ ∈ R. When θ is sufficiently small, condition (5) may not be guaranteed numerically; however, if θ gets larger, it is expected that the quadratic convergence will occur. Using the same stopping criterion and the same starting point, Table 1 summarizes the numbers of iterations of Algorithm 1 as θ varies. It is clearly observed that the convergence is speeded up as θ becomes larger, which implies that condition (5) is satisfied numerically and the quadratic convergence takes place.
5 Conclusions In this paper, we extend the convergence analysis in [14] for the iteration method proposed in [13]. We prove its local superlinear convergence without assumption (5); moreover, we also state clear that, in this case, the iteration actually converges to a special set SW∗ (or SB∗ ) of the global solution set S ∗ of problem (3). It should be mentioned that this special set SW∗ has the definite meanings in LDA; that is,
864
J Optim Theory Appl (2013) 157:853–865
each point in SW∗ attains both the maximum of the within-class cohesion and the maximum of the between-class separation in S ∗ . These new results, together with the analysis in [14], imply that Algorithm 1 is efficient and reliable in solving the optimization problem (3) resulting from the GFST and its variants.
References 1. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006) 2. Duchene, L., Leclerq, S.: An optimal transformation for discriminant and principal component analysis. IEEE Trans. Pattern Anal. Mach. Intell. 10, 978–983 (1988) 3. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience, New York (2001) 4. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annu. Eugen. 7, 179–188 (1936) 5. Foley, D., Sammon, J.: An optimal set of discriminant vectors. IEEE Trans. Comput. 24, 281–289 (1975) 6. Fukunaga, K.: Introduction to Statistical Pattern Classification. Academic Press, San Diego (1990) 7. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Berlin (2001) 8. Martinez, A.M., Kak, A.C.: PCA versus LDA. IEEE Trans. Pattern Anal. Mach. Intell. 23, 228–223 (2001) 9. Martinez, A.M., Zhu, M.: Where are linear feature extraction methods applicable? IEEE Trans. Pattern Anal. Mach. Intell. 27, 1934–1944 (2005) 10. McLanchlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York (2004) 11. Theodoridis, S., Koutroumbas, K.: Pattern Recognition. Academic Press, New York (1999) 12. Guo, Y.-F., Li, S.-J., Yang, J.-Y., Shu, T.-T., Wu, L.-D.: A generalized Foley–Sammon transform based on generalized Fisher discriminant criterion and its application to face recognition. Pattern Recognit. Lett. 24, 147–158 (2003) 13. Wang, H., Yan, S.-C., Xu, D., Tang, X., Huang, T.: Trace ratio vs. ratio trace for dimensionality reduction. In: Proc. International Conf. on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 14. Zhang, L.-H., Liao, L.-Z., Ng, M.K.: Fast algorithms for the generalized Foley–Sammon discriminant analysis. SIAM J. Matrix Anal. Appl. 31, 1584–1605 (2010) 15. Zhang, L.-H.: Uncorrected trace ratio LDA for undersampled problems. Pattern Recognit. Lett. 32, 476–484 (2011) 16. Ngo, T.T., Bellalij, M., Saad, Y.: The trace ratio optimization problem for dimensionality reduction. SIAM J. Matrix Anal. Appl. 31, 2950–2971 (2010) 17. Nie, F., Xiang, S., Jia, Y., Zhang, C.: Semi-supervised orthogonal discriminant analysis via label propagation. Pattern Recognit. 42, 2615–2627 (2009) 18. Howland, P., Jeon, M., Park, H.: Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition. SIAM J. Matrix Anal. Appl. 25, 165–179 (2003) 19. Howland, P., Park, H.: Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 26, 995–1006 (2004) 20. Howland, P., Park, H.: Equivalence of several two-stage methods for linear discriminant analysis. In: Proceedings of the Fourth SIAM International Conference on Data Mining, Kissimmee, FL, pp. 69–77 (2004) 21. Ng, M.K., Liao, L.-Z., Zhang, L.-H.: On sparse linear discriminant analysis for high-dimensional data. Numer. Linear Algebra Appl. 18, 223–235 (2010) 22. Park, H., Drake, B.L., Lee, S., Park, C.H.: Fast linear discriminant analysis using QR decomposition and regularization. Technical Report GT-CSE-07-21 (2007) 23. Ye, J.-P.: Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. J. Mach. Learn. Res. 6, 483–502 (2005) 24. Ye, J.-P., Xiong, T.: Computational and theoretical analysis of null space and orthogonal linear discriminant analysis. J. Mach. Learn. Res. 7, 1183–1204 (2006) 25. Ye, J.-P., Janardan, R., Park, C., Park, H.: An optimization criterion for generalized discriminant analysis on undersampled problems. IEEE Trans. Pattern Anal. Mach. Intell. 26, 982–994 (2004)
J Optim Theory Appl (2013) 157:853–865
865
26. Ye, J.-P., Xiong, T., Li, Q., Janardan, R., Bi, J.-B., Cherkassky, V., Kambhamettu, C.: Efficient model selection for regularized linear discriminant analysis. In: The Fifteenth ACM International Conference on Information and Knowledge Management (CIKM), pp. 532–539 (2006) 27. Yan, S., Xu, D., Zhang, B., Zhang, H.: Graph embedding: A general framework for dimensionality reduction. In: Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 830–837 (2005) 28. Krzanowski, W.J., Jonathan, P., McCarthy, W.V., Thomas, M.R.: Discriminant analysis with singular covariance matrices: methods and applications to spectroscopic data. Appl. Stat. 44, 101–115 (1995) 29. Friedman, J.: Regularized discriminant analysis. J. Am. Stat. Assoc. 84, 165–175 (1989) 30. Stewart, G.W.: Matrix Algorithms: Vol. II, Eigensystems. SIAM, Philadelphia (2001) 31. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996) 32. Parlett, B.N.: The Rayleigh quotient iteration and some generalizations for nonnormal matrices. Math. Comput. 28, 679–693 (1974)