May 5, 1994 - (APEX) for multiple principal component extraction. All the synaptic weights of the model are trained with the normalized. Hebbian learning rule.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 42, NO 5, MAY 1994
I201
Adaptive Principal Component Extraction (APEX) and Applications S. Y. Kung, Fellow, IEEE, K. I. Diamantaras, Member, IEEE, and J. S. Taur
Abstract-In this paper we describe a neural network model (APEX) for multiple principal component extraction. All the synaptic weights of the model are trained with the normalized Hebbian learning rule. The network structure features a hierarchical set of lateral connections among the output units which serve the purpose of weight orthogonalization. This structure also allows the size of the model to grow or shrink without need for retraining the old units. The exponential convergence of the network is formally proved while there is significant performance improvement over previous methods. By establishing an important connection with the recursive least squares algorithm we have been able to provide the optimal size for the learning step-size parameter which leads to a significant improvement in the convergence speed. This is in contrast with previous neural PCA models which lack such numerical advantages. The APEX algorithm is also parallelizable allowing the concurrent extraction of multiple principal components. Furthermore, APEX is shown to be applicable to the constrained PCA problem where the signal variance is maximized under external orthogonality constraints. We then study various principal component analysis (PCA) applications that might benefit from the adaptive solution offered by APEX. In particular we discuss applications in spectral estimation, signal detection and image compression and filtering, while other application domains are also briefly outlined.
1. INTRODUCTION
I
N recent years important advances in neurobiology, neural network theory and neurocomputation modeling shed new light into our efforts of understanding biological perceptual processing. Although the problem in general, is far from solved some specific answers have been appearing, regarding particular perceptual processing subsystems such as for example, the visual cortex of mammals. It was found that at least in the beginning stages of the mammalian visual processor the neurons self-organize to recognize specific features of the environment. It was conjectured that the selforganizing principle for these feature-analyzer cells is related to information-theoretic criteria or statistical criteria, such as principal component analysis (PCA) [IS], which maximize the information compression into few representation parameters, in this case the synaptic strengths of the neural network [28]. PCA identifies the most important features (subspaces) of a high-dimensional statistical distribution (e.g., the distribution Manuscript received December, 16, 1992; revised April 9, 1993. The associate editor coordinating the review of this paper and approving it for publication was Prof. J. N. Hwang. This work was supported in part by Air Force Office of Scientific Research under Grant ASOSR-89-050 I A. S. Y. Kung and J. S . Taur are with the Department of Electrical Engineering, Princeton University. Princeton, NJ 08.543. K. 1. Diatnantaras i s with Siemens Corporate Research, Princeton, NJ 08540. IEEE Log Number 92 16674.
of the data processed by the retina) in the sense that the projection error onto those feature subspaces is minimal. Alternatively, the PCA subspaces can be interpreted as the maximizers of the projection variance of the stochastic signal. The optimal solution under both criteria is the subspace spanned by the eigenvectors of the signal autocorrelation matrix associated with the largest eigenvalues (henceforth called principal eigenvectors). In his seminal work The Organization of Behavior (1949) [ 171, Hebb proposed a simple, yet biologically motivated rule for adjusting the synaptic weights during a neural network leaming process: when unit A and unit B are simultaneously excited, increase the strength of the connection between them. For the cake where neurons are modeled as units with continuous output activation this correlation-type rule was recasted in the following, more mathematically useful form: “Adjust the strength of the connection between units A and B in proportion to the product of their simultaneous activation.” [32]. Interestingly, this simple rule tums out to be closely related to PCA when the neural units are linearly modeled. More precisely, in 1982 Oja [33] showed that a normalized version of the Hebbian rule applied on a single linear unit extracts the principal component of the input sequence, i.e., it converges to the principal eigenvector of the input autocorrelation matrix. For the linear unit case, the realization of the Hebbian rule in its simplest form (just update each weight proportionally to the corresponding input-output product) is numerically unstable. The normalization proposed by Oja is not only vital for stability, but it is also a common feature in many standard eigenvalue-related algorithms for example, the power iteration method ([ 151, ch. 7). Although actual neurons are not simple linear units the relationship of the Hebbian rule with PCA can be characterized more than just a coincidence and may serve as an insight into the nonlinear case as well which is, in general, more difficult to analyze. Later Sanger [41], Foldiak [ 121, Rubner and Schulten [40], Oja [35], and Kung and Diamantaras [23] extended this approach to extract multiple components in multi-unit linear networks. Different extensions of Oja’s rule (and consequently of the Hebbian rule) constitute the essence of all those proposed neural net approaches (Oja’s rule alone could only cover the single component case.) The relation between PCA and twolayer linear networks was also demonstrated in [2], [5]. In addition the same rule is in the heart of networks proposed for component extraction in generalized versions of PCA such as constraint PCA (CPCA) [21], ted PCA (OPCA) [SI, [ 1 I], ometric PCA (APCA) [24] which can be divided
1053-587X/94$04.00 0 1994 IEEE
I203
KUNG er 01.: ADAPTIVE PRINCIPAL COMPONENT EXTRACTION (APEX) AND APPLICATIONS
into two classes of problems, so-called reduced-rank linear approximation [IO], [24] and cross-correlation APCA [9], [24]. A detailed discussion regarding those models can also be found in [8]. Of course, PCA is a significant mathematical tool in itself, beyond its relationship with neural networks. It is a fact for example, that principal component analysis (PCA), singular value decomposition (SVD), eigenvalue decomposition (ED) and the Karhunen-Lokve transform (KLT) are all related techniques with established applications in signal processing [7], [44], image processing [19], control theory [2O], highresolution spectral estimation [22], [39], [42], [43], antenna array processing [38], pattern recognition [34], etc. The Adaptive Principal-component Extractor (APEX) model [23] differs from most of the other models mentioned above in that it can effectively support a recursive approach for the calculation of the mth principal component given the first m - 1 ones. The motivation behind such an approach is the need to extract the principal components of a random vector sequence when the number of required PC’s is not known a priori. It is also useful in environments where the autocorrelation matrix
R, = E{XLX;} of the signal xk might be slowly changing with time (e.g., speech applications). Then the new PC may be added trying to compensate for that change without affecting the previously computed PC’s. This is similar to the idea of lattice filtering used in signal processing-where for every increased filter order, one new lattice section is added to the original structure but all the old sections remain completely intact. Among the above mentioned models only the model proposed in [4O] doesn’t differ from APEX in this respect, because it is very similar to APEX in terms of structure. The two models however, differ significantly in terms of convergence speed due to the relation of APEX with the recursive least squares (RLS) algorithm. This, as we’ll see later, allows us to determine a very aggressive schedule for the step-size parameter which results in an order of magnitude improvement in convergence speed over all the previous models which are based on stochastic-approximation-type learning. Furthermore, the APEX model can be also applied to the constrained PC problem discussed later in this paper. In Section I1 we describe the APEX model for multiple principal component extraction which is based completely on the normalized Hebbian rule. The network in addition to the feed-forward connections between inputs and outputs features a set of hierarchical lateral connections between output units themselves. The lateral connection network enforces orthogonality between the feed-forward weights which is essential for the extraction of the orthogonal principal components. This structure also facilitates the growing or shrinking of the network whenever an order update is required. Section 111 provides mathematical proofs regarding the asymptotic properties of the network, as well as the orthogonality property between the feed-forward weights. Also the connection between APEX and the RLS algorithm is established, a connection which provides us with an optimal value
for the step-size parameter and which drastically improves the performance of the algorithm. This also demonstrates a severe limitation of previous neural PCA models that rely on very small values of the step-size parameter in order to ensure convergence leading to significantly slow convergence speeds. Our findings are similar to the work of Bannour et al. [3]; however, their model does not use lateral connections between the output units, but rather they make explicit use of the “deflation” transformation for the extraction of multiple components (see Section 111). That renders their algorithm essentially sequential. As we’ll see in Section IV the APEX algorithm is parallelizable allowing the concurrent extraction of multiple principal components. We show the convergence rate of the network to be exponential and we are in fact able to analytically approximate it for each component. In Section IV we verify our prediction via simulation, and we also discuss the computational efficiency of the algorithm compared to other methods. In Section V we find that the APEX network is applicable to a new kind of Principal component analysis called constrained PCA (CPCA) [21]. This is useful in applications where one wants to find the directions containing the most information regarding a signal under an orthogonality constraint with respect to some prespecified subspace. This subspace could contain for example an interference signal which must be avoided entirely, thus we project the signal to an orthogonal direction that maximizes the variance of the projection. We explicitly show the convergence rates of APEX for CPC (as we did for standard PCA) and we show that exponentially the network converges to the required eigenvectors. In Section VI we demonstrate the applicability of the proposed model to specific problems related to signal and image processing and pattern recognition. We show detailed simulation comparisons with other techniques in the corresponding problems and we establish the superiority or limitations of the neural techniques. In particular we focus on the following applications : Spectral analysis, Detection of drifting-phase sinusoids in white Gaussian noise, and Image compression and interference cancellation. Other areas of applications have also been considered in the literature that take advantage of the adaptive solution offered by these model. We provide short discussions of such applications including mother-fetal ECG signal separation, pattern classification. etc. Finally we conclude in Section VII. 11. THE APEX MODEL [23]
The APEX model is depicted in Fig. 1. There are 72 inputs {SI . . . .c,,}, connected to m outputs { y .~. . yln,}, through the feed-forward weights { w L J}. Additionally, there exist lateral weights c J , forming vector c , that connect all the first 711 - 1 units with the mth one. These connections play a very important role in the model: they work toward the orthogonalization of the synaptic weights of the mth neuron versus the extracted principal components stored in the weights of the previous rri - 1 neurons. Eventually, all the
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 42, NO. 5. MAY 1994
I204
by the theoretical analysis in Section I11 and demonstrated by simulations in Section IV. The orthogonal learning rule also has a very important application to the problem of extracting “constrained principal components” as discussed in Section V and originally proposed in [21]. 111. PROOF OF APEX
AND CONVERGENCE
RATES
In this section we shall state the main convergence theorem related to APEX. First we state some definitions and results of matrix theory which will be useful for our analysis. Dejnition I : The mapping
R, Fig. I . The APEX model. The solid lines denote the weights w t , c , , which are trained at the rltth stage.The dashed lines correspond to the weights of the already trained neurons. Note that the lateral weights asymptotically converge to zero so they don’t appear between the already trained units.
weights in the network will span an orthonormal basis of the m-dimensional principal component subspace, while the lateral weights tend to zero as orthogonalization is achieved. Thus, we associate the term “anti-Hebbian connections” or “orthogonalization connections” with c J , in contrast to the term “Hebbian connections” associated with the feed-forward weights. The mathematical details of the anti-Hebbian rule (also called Lateral Orthogonalization Rule) used to train c, are to follow. We assume that the input vector x k , k = 0 , 1, 2, . . . , is a stationary stochastic process with positive-definite autocorrelation matrix R,. We shall arrange the normal eigenvectors el e ; . . . . ,e, of R, so that the corresponding eigenvalue sequence is in decreasing order: A1 2 A 2 2 ... 2 A., The activation of each neuron is a linear function of its inputs ~
y = wx
(1)
g7,, = WTX - c=y
(2)
where x = [-c1 . . . d l L I T , y = [y1 . .-y7,-l]T, W = [el . . . e,,-1lT is the weight matrix for the first m - 1 neurons and w is the vector comprised of the synaptic weights w,, connecting the inputs with the nith neuron. Only w and c are trained (W remains fixed). The kth iteration of the algorithm is wk+l ck+l
=w
k
+ Pk(Ymkxk
-
Yikwk)
= c k f Pk(YmkYk - y k k c k )
(3) (4)
where jjk is a positive sequence of step-size parameters. Both equations are of the Oja type [33] however, w and c participate in (2) with opposite signs. The w weights intuitively work towards finding most principal directions of the signal, that’s why we call (3) the Hebbian parr of the algorithm. The c weights intuitively subtract the first 7n - 1 components from the rrith neuron [2], thus the mth output neuron tends to become orthogonal (rather than be correlated) to all the previous components. We call (4) the orthogonalization rule which constitutes the anti-Hebbian part of the algorithm. It will be shown below that the combination of the Hebbian and orthogonalization rules works as described. This will be proved
-+
R, = (I - eleT)R,(I
- ele?)
a &jarion transfomarion of Rz.
is Facts:
(I - eler)R, = R, Also the eigenvectors of R, are e l , e2, . . . , e,, as for R,, but the corresponding eigenvalues are 0, A2, . . . , A., In other words the principal eigenvalue is “killed” while the second component becomes now the dominant one. Inductively, the m - 1 times deflation of R,
n
n
m-I
m-1
R, =
(I - eiey)R,
(I - eier)
i=l
i=l
= (I - W ~ W ) R , ( I
wTw)
(5)
= (I - WTW)R,
(6)
where W = [el,.-. ,e,-1lT, “kills” the m - 1 most principal eigenvalues so the eigenvalues now are 0. . . . ,0,, A , A m + l , . . . , A, so the mth component is dominant. These facts will be useful in our analysis below. Next we formally state the assumptions used by the main theorem: A.l) The input sequence { X k } , is at least wide sense stationary with autocorrelation matrix R,, whose eigenvalues are distinct, positive and arranged in descending order: A1 > A 2 > ... > A, > 0. Let us denote the corresponding normalized eigenvectors (choose arbitrarily the sign) by e l , e2, . . . ,e,. A.2) The first m - 1 principal components, i.e., the eigenvectors el through e,-1, are stored in the synaptic weights connected to neurons 1 through m - 1. A.3) The step-size parameter sequence p k is such that 00
= co and
pk
-+
0,as
-+
ca.
k=O
This assumption will be useful for showing the asymptotic convergence of the algorithm. Later we’ll discuss the case where it doesn’t hold. Theorem I Consider the algorithm defined in (3), (4), subject to the conditions A. 1-A. 1. Then with probability 1, w k + elrL( or -e,) and c k 0, as IC + m. -+
KUNG
pr
I205
al.: ADAPTIVE PRINCIPAL COMPONENT EXTRACTION (APEX) AND APPLICATIONS
froofoutline (for the complete proof see [SI): Using the standard approach for analyzing stochastic approximation type algorithms [26], [29] we derive the associated differential equations corresponding to (3), (4)
dW’(t) dt
A, = erR,e, = E((wTxr,)’}
0 = RTw’(t)- R,WTc’(t) - d(t)w’(t)
(7)
dc’(t) = WR,w’(t) - WR,WTc’(t) - a’(t)c’(t) (8) dt where
d(t)
froofi Obvious since w, is equal to the ith normal eigenvector e, and
(w’(t) - WTc’(t))TR,(w’(t)- WTc’(t)).
Under certain assumptions-which are satisfied here-the discrete-time variables W1,. ck tend (with probability one) to the continuous solutions w’(t). c’(t) of the above differential equations as t . k tend to infinity. The time index t, of the differential equations and the index k , of the discrete-time equations are related by the formula
(9) Expanding w’(t) into the principal component basis 11
c=l
we find the ODE for the expansion coefficients
After some analysis we can show that under the initial condition
W’(0)Tent = Q , , t ( O ) # 0
(12)
we have
H,,,(t) + *l. H,(t) + 0,
as t
x as t + x , i
Using Theorem 1 as an induction step, and using as an induction basis, the fact that the APEX rule for the first unit is equivalent to Oja’s rule, we conclude that APEX can sequentially extract all the principal components of Xk provided that infinite time can be used by the network to converge to the exact solutions. In practice of course only finite amount of time is available for training the model. That affects the accuracy of the later units which rely on the accuracy of the previous components for the corresponding deflation. So errors can accumulate as units are added. However through our simulation experience we observed that the first 8-16 components are usually estimated with quite reasonable accuracy and without excessive training time. Although Theorem 1 assumes that [& + 0 as IC -+ x, for practical purposes this is usually not the case. In fact, the well-known trade-off between tracking capability and accuracy [30] is present here as well as any other adaptive algorithm. One wants to keep the step-size small but nonzero in order to be able to follow (slow) changes in the statistics of the input signal. It is known [4], [27] that even if 81, does not go to zero but remains equal to a small constant then the mean values FJk of W1, and Ck of C1, still approximate the ordinary differential equations 7 and 8 therefore in that sense, the algorithm still converges to the solution. However now in the steady state, the adapted parameters oscillate around the fixed point without ever actually converging to it. This phenomenon is due to the variance of the stochastic input that drives the system. Naturally the amplitude of the oscillation is proportional to this variance as well as to the size of p. However in many practical applications, a small uncertainty in the final value is tolerable, traded-off for a valuable property of the algorithm, namely the capability to track the changing parameters of the observed (input) process and adapt its internal parameters to these changes. The above observations hold true for any adaptive algorithm in general, so they hold true for APEX as well. Nevertheless, when discussing APEX specifically, a lot more can be said about the step-size parameter as it will become clear next.
(14)
-+
< 711
(15)
A . Relationship with RLS
We saw before that principal component analysis can be formulated as a mean square minimization problem. Naturally then, we pursue the investigation of the relationship between w’(t) + fe,,,.as t x. (16) APEX and the recursive least squares (RLS) algorithm which is well studied in the literature. By establishing this relationWe also show that ship we will extract some valuable conclusions that relate to c’(t) + as t + x (17) the value of the step-size constant as well as the numerical performance of the algorithm. The first principal component and in fact, the rate of convergence of c’(t) and H,(t), for all optimizes the reconstruction mean-square-error by picking the I > rrt 1s “’(t) > 0. 0 best 1-D subspace to project the signal onto. The criterion for Corollap 1: After convergence, the variance of the ith the first component is output neuron, I = 1 . 2 :... 7 n . equals the Ith principal eigenvalue A , . minimize J,(w) = ~ { l l x wwTx/I2). (18) therefore
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 42. NO. 5 , MAY 1994
I206
Expanding we obtain
J1
and doing some mathematical manipulations
J1 = E{11~11~} - 2wTR,w + ( w ~ R , w ) I I w ~ ( ~ . (19) The solution to the above minimization problem is w = e l . For the mth component the criterion becomes
x WW*(I minimize J ~ ~ ( W=) ~ { l l where W = [el for J1 we get
-
w ~ w ) x (20) ~~
The value of the forgetting factor y,induces an effective timewindow of size M = 1/(1 -y), so for example, with y = 0.99 the number of previous values yk which are important in ,L? is effectively 100. Therefore y should be chosen to be closer to 1 if larger time window is to be used for the averaging, and further from 1 when the past values are less important. In practice we can use a hard-limiting window of size M , instead ~ of } an effective one induced by 7 , setting
. . en,-l]T.Indeed, expanding J,, as we did
J,, = E { 1 1 ~ 1 1 ~ -) 2wT(I - WTW)R,w [ w ~ (-I W ~ W ) R , ( I w T w ) w ]l
+
l ~ 1 1 ~ (21) .
From the facts stated in p. 6, (21) becomes
J , = E(11X112) - 2WTR,W
+ (WTR,W)IIW112
(22)
6,.
therefore ,I, is the same as J1 for the deflated matrix So the minimizer of J,,, (w) is the principal component of R, which is e,,,. In order to relate the above analysis with RLS we first transform the cost function J,,,(w) into a nonstochastic formulation, replacing the expectation by (weighted) average
where 0 < y 5 1 is a forgetting factor chosen by the user and y k = wTxk - wTWTwxk.We shall define c
ww
(24)
to get ‘yk
= WTXk
-
cTWxk.
(25)
This formula is more natural in the case where the number of samples is M , whence the algorithm uses the same data again and again in a loop (each iteration of the loop is called a sweep). This situation is may occur for example, in classification applications where the number of samples is limited and stored in a computer database, the size of which is known a priori. Equation (30) for is very similar to what was originally proposed for APEX [23] although in that paper we proposed to keep the value of B constant within each sweep but still inversely proportional to the output power. In [ l ] the authors propose to track the value 1 / o k = 1/ 9; by following a gradient descent on the error surface J = l/2[l - ,L?kok]’. However neither of these methods are as general as our current treatment since they are not applicable when 111 is not defined, while they fail to establish the relationship between /3 and the least-squares optimality criterion (23). We proceed now to establish the full connection between APEX and RLS, by showing that the updating equation for c in APEX can also be derived from recursive least squares optimization. Indeed, premultiplying (29) by W and using (24) we obtain
E:-,,+,
jrr,(w. N ) can be minimized iteratively using the RLS algorithm [30] which yields the following updating equations Wk+l
= wk $- Lk(xk - Wkyk)
(26)
-1
(27) Letting
which is the same as (4). Thus both updating equations defining APEX can be derived via the RLS theory. B. Rates of Convergence
As we discussed in the proof of Theorem 1, the decay rate for c ’ ( t ) as well as O,(t), i = 1... . r n - 1, is a’(t). Given that the discrete time parameters approach the continuous time ones with the relationship between the two time indices defined in (9) we can approximate the decay rate for the discrete-time parameters
.
the RLS algorithm becomes
wk+l = wk
+ p k ( x k - wkyk)yk
(29)
which is the same as APEX equation 3. The only difference is that now we have a specific choice for the value of ,#k which is optimal in the sense of the criterion J , (w.N ) . Moreover, the optimal choice of the step-size parameter has a profound impact in the convergence speed of the algorithm as will be discussed in Section IV. Clearly ,f3 can also be calculated iteratively by
Ck+l - C k
t ( k + 1) - t ( k )
= -o’(t(k))ck.
where ok = o ’ ( t ( k ) )= E{?&).Similarly
=+
KUNG er U / . : ADAPTIVE PRINCIPAL COMPONENT EXTRACTION (APEX) AND APPLICATIONS
component2 component3
I207
-i
...
.....
--
component4 theoreticallypredicted actual simulation
0
0
5
10
15
20
25
30
35
40
45
50
sweeps
Fig. 2. Comparison of the theoretical versus the actual decay rates via simulation.
From (34) we can derive the rate of decay of the component ratio
the final variance of the steady state of the first neuron does not affect the convergence of the second neuron. Similarly the with neuron will start converging to the rnth component ,i = rn 1,.. . , n r , = H,/H,,, no later than the m - 1 neuron has converged. We expect (35) however the final variance of the trith neuron to increase as rn increases accumulating the variances of all the previous The above theoretical predictions are justified as shown in Fig. units. The network for the parallel APEX version i u shown in Fig. 3. In [6] it was shown that the heurirtic analysis 2 where a comparison is carried out with simulation results. above is indeed true. Formally the parallel version of APEX Iv. COMPUTATIONAL EFFICIENCY AND SIMULATION RESULTS is described below. Algorithm 2 (APEX, Parallel Ver.sion): For all neurons Based on the above theoretical analysis, the APEX Algo7 r ~= 1 to N . ( N 5 a ) in parallel do: rithm is formally presented below: 1) Initialize the weights w,,,. c,,, for each neuron to \ome Algorithm I (APEX, Sequential Version): For every neuron random values. rri = 1 to N . ( N 5 n ) 2) Set [j,,, from (28) or (30). Notice that each node has its 1) Initialize w and c to some random values. own customized step-rize parameter. 2) Choose/) either as in (28) or (30), selecting y or M 3) Update w,,,k and c , , , k according to (3), (4), until a appropriately (see Section 111). stopping criterion is met 3 ) Update W k and C k according to ( 3 ) , (4), until a stopping criterion is met. One stopping criterion could be that the average square output A. APEX versus Other NN Models The APEX algorithm compares favorably with all the preCL, approximately converges to a fixed value. This value is the 711th eigenvalue of R, and the corresponding weights w vious NN methods for PCA. The claim can be supported in form the rnth normal eigenvector. Another viable criterion is the following aspects: detecting that lick 11 is below a certain threshold. This criterion 1) Ef$ciency in Recursively Computing Nebv PC's: Every cannot be used though for the first neuron because of the time an order update of the model is required one can grow absence of the lateral weights in this case. On the other hand the APEX network by one more unit or prune it by one or if someone wishes to track slow changes in the principal more units without need for retraining the old nodes. This component subspaces then probably no terminating criterion is for example impossible in Foldiak's model [I21 where all is specified but rather the system is left to follow the changes. the units are connected with each other. Referring to Sanger's Since the neurons in the APEX model are hierarchically model [41] growing the network by one unit would require structured each node does not get affected by any nodes more operations per iteration compared with APEX. This is y , w,,,~, ~ following it. If the neurons prior to a node have converged because Sanger computes the outer product CJC,,, reasonably close to the appropriate components then the node that costs O ( r r ~ 7 1 )multiplications per iteration even if the would converge to its corresponding component. So we heuris- values y J k are stored in memory, while APEX utilizes the tically contend that if we let all the nodes work in parallel lateral connection network to effectively compute something following the APEX rule the network will extract the principal similar to this product with only O(71)operations. components in parallel rather than one after the other. Indeed, 2) Convergence Speed-up by Adopting Optimal Stepsize the first node is unaffected since it has no prior nodes so it Parameter: Our study regarding the best step-size parameter will extract the first component. In tum, the second neuron /I, results in impressive convergence speeds. Such numerical will start converging to the second component no later than analysis is lacking from all other models except the model the first one converges to the first component assuming that proposed by Bannour et al. [3].
+
.
Fig. 3 . The parallel APEX network model.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 42, NO. 5, MAY 1994
Sequential APEX 2,
,
,
.
.
.
.
a
.
.
b: Zndconp. Ldcomp. d: 4thmnp.
C:
0
500
IOM
lSal
2005
7.500 Moo
U00
4033 4m
,
~
-
SMO
iterations
Sequential A P M 33
In Fig. 4 we show the convergence of the first 4 components using the sequential APEX model. We used (28) with y = 0,995 for determining the value of The convergence is exponential as predicted by our theoretical analysis in Section 111. Notice that we have superimposed the plots for each neuron in order to save space: in fact the second unit starts after the first one has converged so the actual time reference for the start of curve b is k = 5000 instead of k = 0. Similarly the third unit starts after the second one has converged, etc. Since the convergence speed for each neuron depends predominantly (see the proof of Theorem 1) upon the difference A, - X,,-I and not upon the order of the unit, it comes as no surprise that the first neuron has slower convergence compared to the other units. Only in the parallel APEX model, the speeds of convergence for the different neurons are necessarily sorted according to their order. This is indeed shown in the next set of simulations which concern the parallel APEX model. Fig. 5 shows the results of the simulation on the same data used in Fig. 4 with the same choice of /’ and y. Notice that in both figures related to the parallel model the final variance of the components is larger than the variance of the corresponding figures in the sequential case. As mentioned before, this is due to the fact that each neuron in the parallel model receives a noisy estimate of the components prior to it so this noise adds up to inherent variance that the algorithm produces. Notice also that in the parallel model, the first unit converges faster than the second which in turn converges faster than the third, etc., as expected from the fact that each unit needs the convergence of all previous ones prior to its own convergence.
3
nvccpr
(b)
Fig. 1. ( a ) Plot of the square distance lie,,$- W , ~ , 11’A. between the actual components and the ones c\timated using sequential APEX. The data are repeated in w e e p s ( 1 sweep = 200 iterations). (b) Average !/:,, over each \weep for cach neuron I I I ,
3) Pnrtrlleli,-cibilit~: Bannour’s model is essentially sequential. It requires the extraction of the previous trL - 1 components for the successful extraction of the ,mth one. As we saw the APEX model is capable of parallel multiple component extraction. 4) E.ytensibility to the CPC Problem:: The APEX model is even more general in that it can be used for extracting the constrained principal components of a random process as will be discussed in Section V.
V. CONSTRAINED PRINCIPAL COMPONENT ANALYSIS (CPC)
In this section we’ll show that APEX introduces a new application domain in its ability to extract constrained principal components (CPC’s). The problem arises in cases where certain subspaces are less preferred than others, thus affecting the selection of “best” components. In CPC, we assume that there is an undesirable subspace C, spanned by the columns of an orthonormal constraint matrix V . C may stand for the interference subspace in interference-cancellation applications or for a redundant subspace in applications of extracting novel components. More formally, the CPC problem can be defined as follows: given an U-dimensional stationary stochastic input vector process { x k } . and an P-dimensional ( E < 71) constraint process, {vk}, such that
B. Sirnuldion Results
Vk = V X k .
In this section we present two typical sets of simulations that demonstrate the numerical performance of the APEX algorithm. The two sets are one for the sequential and one for the parallel model. The data are comprised of an artificial colored random vector sequence with n, = 64. The same data are used for both the sequential and the parallel model for purposes of comparison. In order to ensure stationarity we have repeated the sequence periodically, so that the convergence performance of the algorithm can be studied. Each period of the data is called a sweep and contains 200 data samples.
V : orthonormal, spanning L
+
find the most representative n~,-dimensional(P ‘rri 5 7)) subspace Is,, in the principal component sense constrained to be orthogonal to C. In other words, we are looking for the optimal linear transformation
Yk = W X k
’
Obviously we could have also used (30) with -If = 200 achieving similar results.
KUNG
cI d.: ADAPTIVE
I209
PRINCIPAL COMPONENT EXTRACTION (APEX) AND APPLICATIONS
Fig. 6 . (a) The single-output model. The connections denote the weights I I ' , . c l , which are trained. (b) The multiple-output model.
ilantions
(a)
A. APEX Solves CPC
P d l c l APM
The following facts can be easily shown regarding the eigenvalues/vectors of R,: 1 ) All eigenvalues are real and nonnegative. To see that notice that the matrix I - VTV is idempotent, i.e., ( I - V T V ) k= ( I - VTV),for all k 2 1 and let e be a _~~~__..___..__....-----.. right eigenvector associated with eigenvalue Assume for the moment that e is complex; then
x.
e*R,(I - V'V)R,e = ie*R,e 3 \ ( ( I- VTV)R,e))' = i))Rk/2e112 5
I5
10
20
25
w=P
(b)
Fig. 5. (a) Plot of the quantity lie,,, - w , , ,11'~. Same data used in Fig. 4. (b) Average I / : , ~ ~over each sweep.
where W is orthonormal. such that the error
where * denotes conjugate transpose. Therefore , iis not only real but also X 2 0, and e is also real. 2 ) There are exactly C zero eigenvalues of R, and the rest 71, - l eigenvalues are positive. This is true because the matrix (I - VTV) has a null space of dimension C and R, was assumed invertible. 3) Eigenvectors corresponding to different nonzero eigenvalues are orthogonal. Indeed, let
.I = El(x - X(12 = E((x- W T ~ ( ( 2 is minimal under the constraint
WVT = 0.
(36)
is basically the projection of xk on the subspace spanned by the columns of W which is orthogonal to C.In [21] it was shown that the optimal solution to the CPC problem is
Then eTR,(I - VTV)R,e, = j\,erR,e, = i,eTR,e,
yk
W* =
[el . . . em]'
eTR,(I
-
where e l , . . . . errlare the principal right eigenvectors of the input sken'ed autocorrelation matrix
R,s= (I - VVT)R,. while the minimum error is
e:.,
xl.
+l
x,,
where x 2 . . . .. are the eigenvalues of R, in decreasing order. Much like PC analysis, the components now maximize the output variance, but under the additional constraint that they are orthogonal to V.
+
= 0.
We'll show now that the APEX model tackles the CPC problem defined above. Consider the APEX network in Fig. 6(a) where the first C units correspond to the undesirable components VI.. . . , vg. We shall call them constraint units. We train the first nonconstraint neuron with the APEX rule
I1
l=rlI
VTV)R,e, = 0.
But
eTR,(I - VTV)R,e, = i?j\,eTe, = 0 (37)
+-
We can prove the following:
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 42. NO. 5. MAY 1994
1210
Theoretn 2: Consider the algorithm defined above. Then with probability 1, the quantity q k = wk - v T c k converges U to f e l as k: + cc. Again for the proof the reader is referred to [SI. In the general case for CPC, it is not true that c k 0. In the special case where the rows of V are principal eigenvectors of R, then the problem is reduced to the standard principal component analysis and Theorem 1 applies. The CPC problem can be solved recursively be the APEX model. Since the components are orthogonal, once the first component has been extracted it can be appended to the rows of V to obtain a new orthogonal constraint matrix
Sequential AF'FX
Z
1.6
r
.
1
.
I
,
.
.
a: Idcamp. b: 2ndmmp. c: 3rd"p. d: 4m mop.
b
---f
OO
500
lOD0
1500
UXT)
2500
3ooo
3500 4Mn 4500 .!
x)
V' = [e;] which will be used to obtain the next component.
B. Simulation Results Fig. 7 depicts the convergence of the APEX network for the CPC problem. The data are artificially created and have dimension 71 = 63. There F = 8 constraint neurons while the orthonormal constraint matrix V is randomly picked. The convergence is exponential as in the PCA case.
-1
7.3
/----I
_____.. --.-. ....___. ......._. _.. ...................._. ......
VI. APPLICATIONS The potential application domain of PCA type problems is very broad. Examples can be found in biomedical signal processing, speech and image processing, antenna applications, seismic signal processing, geophysical exploration, electromagnetic problems, etc. Typical application classes are e.g., noise cancellation in signalhmage processing; adaptive linear prediction in speech processing; or reconstructing the parameters of an underlying physical model from externally observed signals. Some of the applications in this section were also covered in [25]. A, Application of SVD to Spectral Analysis The goal of theso called high resolution techniques is to achieve maximal resolution of closely spaced frequencies using as small a sample size as possible. Clearly, there are relationships betweenthe sample size, the signal to noise ratio, the model order and the minimal frequency spacing that limits the resolving capabilities of any method [31]. We now deal with harmonic retrieval of very closely spaced frequencies in presence of noise. The same method can be easily applied to the problem of direction-of-arrival estimation by an antenna array. We consider the more realistic case where the signal .rk. is embedded in an additive stationary white noise ' r ) k with variance o2 Ill
c=l
Let ? . ( I ) = E{.I:A.J;.+~} be the autocovariance sequence of x k , where denotes the complex conjugate of :CL+/. Let R, be the infinite Toeplitz correlation matrix constructed with r ( l )
R,.= R + a21
0.5 I
I 5
10
I5
20
2s
weeps
(b)
Fig. 7 . APEX model performance for CPC. We use 200 @-dimensional data vectors repeated in sweeps. (a) The convergence of the error 11qr - e , , , 112 for each neuron m. (b) The output variance: average { y 2 } over each sweep for each neuron, The final variance equals approximately to the corresponding eigenvalue A,,, .
where I is the identity matrix, and R is the correlation matrix for the noise-free sinusoidal signals. R, is now full-rank and its eigenvalues have been translated by 0 2 . Its eigenvectors remain the same as those of R. The method proposed by Kung et al. [22] uses the SVD of R,
R, = UEU* where the superscript * denotes complex conjugate transposition. R, is generally full-rank. However, scrutinizing the singular values we can have an estimate T?Z of the number of sources since it can be expected that there will be a gap between the 7 n larger eigenvalues and the rest: the former correspond to the nonzero eigenvalues of R, while the latter are estimates of 0 2 . The true correlation matrix R should be of rank 711. Therefore, a rank& approximation is sought and is given by U,E,U: where U, are the eigenvectors corresponding to the 6largest eigenvalues of R, and Esthe diagonal matrix composed of these eigenvalues. By analogy with the decomposition R, = OC of the infinite case, the method uses for 0 the matrix U, of the signal eigenvectors.
KUNG
cr U/.: ADAPTIVE
121 I
PRINCIPAL COMPONENT EXTRACTION (APEX) AND APPLICATIONS
F is then given by the least-square solution of USF = U:, where U: is U, deprived from its last row, and U! is U, deprived from its first row. We will also assume that the signal is only known in a finite interval, therefore only a finite dimensional estimate R, of the correlation matrix can be computed. Let K be the total number of samples. Let
TABLE I SIMULATION RESULTS FOR HARMONIC RETRIEVAL; ' * ' FOLLOWING A NUMBER DENOTES THAT THE FREQUENCY CANNOTBE RESOLVED APEX
Matlab
No. of Samples
Order of Covariance 30 25
Matrix 20
59
1.1086 0.97 I6 1.1162 0.9236 I a299 -1.63 14*
1.1054 0.9550
49 44
No. of Neurons 30 25
1.1086 0.97 16 1.1108 1.1164 0.9323 0.9235 1.0842 1.0299 0.8764 -1.6328*
1.0955 0.9856 1.1033 0.9720 I .08 18 0.9081
1.0955 0.9857 1.1033 0.9719 1.0806 0.9087
20 1.1054 0.955 I 1.1109 0.9324 1.0840 0.8761
B. Signal Detection: Drifting Phase Sinusoids In the non-noisy case we have:
There are various detection applications in underwater acoustics and optical communications where the problem can be formulated using the following two alternative hypotheses
x,,= SD"a. Consider the matrix X = [xg. . . x~-'%-]. It is well-known [43] that X has rank-in. If we take for the so-called covariance estimation matrix I< - K
then rank(R,) = rri which implies span(R,) = span(S). But, we also have: span(R,) = span(U,). Thus we have: S = U,P with P nonsingular. On another hand, we have the fundamental relationship: S I D = St. The relationship S = U,P still holds when one row is suppressed. Therefore, we have
Ho Hi
: :Ck
761,
:: ~ = k ~k
+
ilk
+
= A C O S ( W ~ : €0,
+ 4)+ ' ~ b k
where :Ck is a scalar observation process, 71.k is white Gaussian noise, 0 k is a unit Brownian motion2 and 4 is an independent random initial phase uniformly distributed in [O; 27~1.Without loss of generality we can assume that rik has unit variance. The above model is suitable for the detection problem of weak acoustic signals which require relatively long observation intervals. Although the signal might be nominally harmonic, during such long intervals the phase of the signal cannot be assumed constant [46]. A similar problem formulation arises in optical communications where t9k models the laser phase noise [ 131 although typically the signal are modeled in continuous time. The parameters A and w denote the amplitude and the frequency of the sinusoid, respectively, and is called the bandwidth of the phase drift. Obviously when = 0 then the problem is reduced to detecting sinusoids with random but constant phase immersed in white Gaussian noise. It is known that in this case the quadrature (standard noncoherent) detector is optimal [47]. When E > 0 however the optimal Neyman-Pearson detector is different and difficult to calculate analytically. Given an observation sequence x = [:cl . . . : c , ~ ] the likelihood ratio is defined as