Formalization of the Two-Step Approach to Overcomplete BSS Fabian J. Theis Institute of Biophysics University of Regensburg D-93040 Regensburg, Germany email:
[email protected] ABSTRACT We discuss overcomplete blind source separation (BSS), that is separation with more sources than sensors. Finding the mixing matrix solves the linear quadratic BSS problem; in overcomplete BSS however, it is then still not clear and not even unique how to get the sources from the mixtures and the mixing matrix. We therefore follow Bofill and Zibulevsky in [5] and many others and take a two-step approach to overcomplete BSS: In the first so called blind mixing model recovery (BMMR) step, the mixing model has to be reconstructed from the mixtures - in the linear case this would mean finding the mixing matrix. Then, in the blind source recovery (BSR) step, the sources have to be reconstructed given the mixing matrix and the mixtures. We furthermore introduce some notation and sayings and describe the usual BSR step in order to enable forthcoming overcomplete BSS papers to concentrate on one of the two steps, mainly on the BMMR step. Finally, we prove that the shortest-path algorithm as proposed by Bofill and Zibulevsky in [5] indeed solves the maximum-likelihood conditions in the BSR step. KEY WORDS geometric independent component analysis, blind source separation, overcomplete independent component analysis
Elmar W. Lang Institute of Biophysics University of Regensburg D-93040 Regensburg, Germany email:
[email protected] for the non-Euclidean Riemanian structure of the space of weight matrices. These algorithms generally assume that at least as many sensor signals as there are underlying source signals are provided. In overcomplete ICA however, more sources are mixed to less signals. The ideas used in overcomplete ICA originally stem from coding theory, where the task is to find a representation of some signals in a given set of generators which often are more numerous than the signals, hence the term overcomplete basis. Sometimes this representation is advantageous as it uses as few ’basis’ elements as possible, which is then called sparse coding. Olshausen and Fields first put these ideas into an information theoretic context decomposing natural images into an overcomplete basis [23]. Later, Harpur and Prager [10] and, independently, Olshausen [22] presented a connection between sparse coding and ICA in the quadratic case. Lewicki and Sejnowski [19] then were the first to apply these terms to overcomplete ICA, which was further studied and applied by Lee et al [16]. De Lathauwer et al [15] provided an interesting algebraic approach to overcomplete ICA of 3 sources and 2 mixtures by solving a system of linear equations in the third- and fourth-order cumulants, and Bofill and Zibulevsky [5] treated a special case (’delta-like’ source distributions) of source signals after Fourier transformation.
1 Introduction 2 In independent component analysis (ICA), given a random vector, the goal is to find its statistically independent components. This can be used to solve the blind source separation (BSS) problem which is, given only the mixtures of some underlying independent sources, to separate the mixed signals thus recovering the original sources. In contrast to correlation-based transformations such as principal component analysis, ICA renders the output signals as statistically independent as possible by evaluating higherorder statistics. The idea of ICA was first expressed by Jutten and Herault [13] [14] while the term ICA was later coined by Comon in [8]. However the field became popular only with the seminal paper by Bell and Sejnowski [4] who elaborated upon the Infomax-principle first advocated by Linsker [20] [21]. Amari [1] as well as Cardoso and Laheld [6] later simplified the Infomax learning rule introducing the concept of a natural gradient which accounts
Basics
For m, n ∈ N let Mat(m × n) be the R−vectorspace of real m × n matrices, and Gl(n) := {W ∈ Mat(n × n) | det(W ) 6= 0} be the general linear group of Rn . In the general case of linear blind source separation (BSS), a random vector X : Ω → Rm called mixed vector is given; it originates from an independent random vector S : Ω → Rn , which will be denoted as source vector, by mixing with a mixing matrix A = (a1 , ..., an )T ∈ Mat(m × n), i.e. X = AS. Here Ω denotes a fixed probability space. Only the mixed vector is known, and the task is to recover both the mixing matrix A and the source signals S. Let ai := Aei denote the columns of A, where ei are the unit vectors. We will assume that the mixing matrix A has full rank and any two different columns ai , aj are linearly independent, i 6= j. Note that for the quadratic case (m = n) this just
means that A is invertible ie. A ∈ Gl(n). In this case we can of course recover S from A by S = A−1 X. For less sources than mixtures (m > n) the BSS problem is said to be undercomplete, and it is easily reduced to a quadratic BSS problem by selecting m mixtures or applying some more sophisticated preprocessing like PCA. The separation problem we are mainly interested here is the overcomplete case where less mixtures than sources are given (m < n). The problem stated like this is ill-posed, hence further restrictions will have to be made. For the quadratic case, many different algorithms have been proposed with the popular Bell-Sejnowski maximum entropy algorithm [4] and the FastICA algorithm by Hyv¨arinen and Oja [12] [11] being one of the most efficient algorithms among them. Some have also been extended to overcomplete situations [18] [17].
dent columns such that there exists an independent random vector S 0 with X = A0 S 0 . If we assume that at most one component of S and S 0 is Gaussian, then for m = n it is known [8] that A0 is equivalent to A; here we consider two matrices B, C ∈ Mat(m × n) equivalent if C can be written as C = BP L with an invertible diagonal matrix (scaling matrix) L ∈ Gl(n) and an invertible matrix with unit vectors in each row (permutation matrix) P ∈ Gl(n). In the overcomplete case, however, no such uniqueness theorem is known, and we believe that without further restrictions it would not be true anyway. Taleb however claims that this is true (CHECK!). The aim of this paper is not to give an explicit BMMR algorithm (refer to [24] [25] as an example), but to provide the architecture for BMMR algorithms to be considered as solutions to overcomplete BSS.
3
5
A Two Step Approach to the Separation
In the linear quadratic BSS case it is sufficient to recover the mixing matrix A in order to solve the separation problem, because the sources can be reconstructed from A and X by inverting A. For the overcomplete case as presented here, however, it is useful to consider two separate problems. First, in the so called blind mixing model recovery (BMMR) step, the mixing model, in our case the mixing matrix A, has to be reconstructed. Trying to solve the equation As = x for fixed x gives a n − m-dimensional affine vector space, so, in a second step called blind source recovery (BSR) special solutions of this equation have to be chosen using a suitable boundary condition. Hence with this algorithm we follow a two step approach to the separation of more sources than mixtures; this two-step approach has been proposed recently by Bofill and Zibulevsky [5] for delta distributions. It contrasts to the single step separation algorithm by Lewicki and Sejnowski [19], where both steps have been fused together into the minimization of a single complex energy function. In [24] we give a geometric liner BMMR algorithm; there, we show that our approach resolves the convergence problem induced by the complicated energy function, and, moreover, it reflects the quadratic case as special case in a very obvious way.
Blind Source Recovery
The general problem of the blind source recovery (BSR) can be formulated as follows: Given a random vector X : Ω −→ Rm and a mixing function f as above, find an independent vector S : Ω −→ Rn satisfying possibly additional assumptions such that X = f ◦ S. Using the results from the BMMR step given above, we can assume that an estimate of the original mixing matrix A has been found. In order to solve the overcomplete BSS problem, we are therefore left with the problem of reconstructing the sources using the mixtures X and the estimated matrix A ie. with the BSR problem as formulated above. Since A has full rank, the equation x = As yields the n − m-dimensional affine vectorspace A−1 {x} as solution space for s. Hence, if n > m the source-recovery problem is ill-posed without further assumptions. An often used [19] [5] assumption can be derived using a maximum likelihood approach, as will be shown next. Considering X = AS, i.e. neglecting any additional noise, X can be imagined to be determined by A and S. Hence the probability of observing X given A and S can be writen as P (X|S, A). Using Bayes Theorem the posterior probability of S is then P (S|X, A) =
4
Blind Mixing Model Recovery
For now, let the mixing model be arbitrary that is let f : Rn −→ Rm be differentiable such that the differential Df (s) has full rank and pairwise linearly independent columns for all s ∈ Rn , and let X = f ◦ S. The goal of blind mixing model recovery (BMMR) then is given only X to find f 0 : Rn −→ Rm differentiable as above such that there exists and independent random vector S 0 with X = f 0 ◦ S0. In the linear case treated here, this means, given only the mixtures X, the goal is to find a matrix A0 ∈ Mat(m × n) with full rank and pairwise linearly indepen-
P (X|S, A)P (S) , P (X)
the probability of an event of S after knowing X and A. Given some samples of X, a standard approach for reconstructing S is the maximum-likelihood algorithm which means maximizing this posterior probability after knowing the prior probability P (S) of S. Using the samples of X one can then find the most probable S such that X = AS. In terms of representing the observed sensor signals X in a basis {ai } this is called the most probable decomposition of X in terms of the overcomplete basis of Rm given by the columns of A. Using the posterior of the sources P (S|X, A), we can obtain an estimate of the unknown sources by solving the
following relation S
= arg maxX=AS P (S|X, A) = arg maxX=AS P (X|S, A)P (S).
Since X is fully determined by S and A, P (X|S, A) is trivial, which leads to S = arg maxX=AS P (S). Note that of course the maximum under the constraint X = AS is not necessarily unique. If we assume P (S) to be a Gaussian distribution, this leads to S
Figure 1. Shortest Path Algorithm. The shortest-path decomposition of the xλ in the above picture uses only the vectors a1 and a2 .
= arg maxX=AS exp(−|S1 |2 − . . . − |Sn |2 ) = arg minX=AS |S1 |2 + . . . + |Sn |2 = arg minX=AS |S|2
¡P ¢ 2 1/2 where |v|2 := denotes the Euclidean norm. i |vi | Indeed, this source estimation is a linear operation and can be achieved using the Moore-Penrose inverse A+ of A. Hence, in the Gaussian case, the solution of the sourcerecovery is unique; in fact, for a given sample xλ it is the Euclidean perpendicular from 0 onto the hyperplane A−1 {xλ } ⊂ Rn . This exists and is unique, because Rn is a Hilbert space and the plane is convex and closed. If P (S) is assumed to be Laplacian that is P (Si )(t) = a exp(−|t|), then we get S
= arg maxX=AS exp(−|S1 | − . . . − |Sn |) = arg minX=AS |S1 | + . . . + |Sn | = arg minX=AS |S|1 P where |v|1 := i |vi | denotes the 1-norm. Uniqueness of S holds as well in this case as will be shown in lemma 5.1 for the case m = 2. Note that the S may not be unique for other norms; for example considering the supremum norm |x|∞ , the perpendicular from 0 onto an affine vectorspace is not unique. The general algorithm for the source-recovery step therefore is the maximization of P (S) under the constraint X = AS. This is a linear optimization problem which can be tackled using various optimization algorithms [7]. In the following we will assume a Laplacian prior distribution of S which is characteristic to a sparse coding of the observed sensor signals. In this case, the minimization has a nice visual interpretation, which suggests an easy to perform algorithm: The source-recovery step consists of minimizing the 1-norm |sλ |1 under the constraint Asλ = xλ for all samples xλ . Since the 1-norm of a vector can be pictured as the length of a path with parallel steps to the axes, Bofill and Zibulevsky call this search shortestpath decomposition — indeed, sλ represents the shortest path to xλ in Rm along the lines given by the matrix columns ai = Aei of A, as will be shown in the subsequent section, see figure 1.
Figure 2. Shortest Path Algorithm, proof of µ(s) = 3 case. Given a decomposition of xλ into three vectors ai , the above figure shows that the path can be shortened, and it then only uses the two vectors closest to xλ .
5.1
Shortest-Path Algorithm
Bofill and Zibulevsky first proposed this algorithm in [5]. We present a proof for the fact that finding the shortest path indeed means minimizing the 1-norm. For simplicity, we only deal with two-dimensional mixtures, i.e. m = 2. The goal is to find a arg minxλ =As |s|1 . As above let A = (a1 | . . . |an ) denote the normalized columns of A. The shortest-path algorithm is based on the following lemma: Lemma 5.1 Let a1 , . . . , an ∈ R2 , n > 1, be normalized 2 n and pairwise linearly Pnindependent. Let x ∈ R and sˆ ∈ R s. Let j, k ∈ {1, . . . , n} be such that xλ = i=1 sˆi ai = Aˆ be such that aj or −aj lies closest to xλ from below and ak or −ak lies closest to xλ from above in terms of their angle to a fixed axis, arbitrary if xλ = 0. Then sˆ = arg minxλ =As |s|1 holds if and only if sˆi = 0 for i 6= j, k. Furthermore, the sˆ is unique. Proof: Without loss of generality we assume that xλ 6= 0. Let sˆ = arg minxλ =As |s|1 . Let µ(s) = |{i|si 6= 0}| be the number of non-zero si ’s. We claim that µ(ˆ s) = 2. Next assume the claim not to be true. A simple geometric consideration presented in figure 2 indicates that µ(ˆ s) > 3. So let sˆi 6= 0 and define y := xλ − sˆi ai and t := sˆ − sˆi ei . Then At = y and by induction (µ(t) < µ(ˆ s)) we know that µ(t) = 2. Hence µ(ˆ s) = 3, which is a contradiction. So we conclude that µ(ˆ s) = 2. As a simple consequence it follows from figure 3 that indeed sˆi = 0 for i 6= j, k holds.
Figure 3. Shortest Path Algorithm, proof that the shortest path decomposition among all paths with µ(s) = 2 uses the (closest to xλ ) vectors aj and ak . The triangle inequality shows this claim.
Now let sˆi = 0 for i 6= j, k. Since sˆj and sˆk are linearly independent, they form a basis; therefore sˆ is uniquely determined by sˆi = 0 for i 6= j, k. As shown above, an s0 with s0 = arg minxλ =As |s|1 fulfills these equations; uniqueness then shows that sˆ = arg minxλ =As |s|1 . So the algorithm can be formulated as follows. For a given sample xλ pick the columns aj and ak of A that lie closest to xλ as in lemma 5.1. Then sλ ∈ Rn is defined by ¢ ¡ −1 ¡(aj |ak ) xλ ¢j i = j (sλ )i = (aj |ak )−1 xλ k i = k 0 otherwise It is easy to check that Asλ = xλ , so by lemma 5.1 this means that sλ is a minimum of the 1-norm of all s with As = xλ .
6
is the performance index, often called crosstalking error, of a n × n-matrix C introduced by Amari, Cichocki and Yang [2]. Lemma 6.2 Let C ∈ Gl(n). E1 (C) = 0 if and only if C ∈ Π, i.e. if C is the product of a scaling and a permutation matrix. Proof: Π consists of matrices with exactly one nonzero element per column and per row. As C is invertible, C has at least one nonzero element per column and row, and E1 (C) = 0 obviously if and only if C is of that type, i.e. if C ∈ Π.
Indices for Experiments
In order to compare the mixture matrix A with the recovered matrix B from the matrix-recovery step, we calculate the generalized crosstalking error E(A, B) of A and B defined by E(A, B) := min kA − BM k, M ∈Π
where Π is the group of all invertible real n × n-matrices where only one entry in each column differs from 0 and k.k is a fixed matrix norm. Lemma 6.1 E(A, B) = 0 if and only if A is equivalent to B. Proof: Note that Π consists of all n×n-matrices of the type P L, where L is a non-degenerated diagonal matrix (scaling matrix) and P a permutation matrix. If A is equivalent to B, then by definition there exists a M ∈ Π such that A = BM , therefore E(A, B) = 0. Vice versa, if E(A, B) = 0, then there exists M ∈ Π with kA − BM k = 0, i.e. A = BM . 1 n
can be calculated more easily by first normalizing the column rows of A and B and then taking the minimum over the finite set (2n! elements) Π0 ⊂ Π of all matrices with only one {−1, 1} in each column. Comon also defines a distance between matrices up to permutation and scaling which he calls the gap [9]; taking as matrix ’norm’ the sum of the Euclidean norm of the columns in E(A, B) exactly gives the formula of the gap. In order to analyze the source-recovery step, we look at the crosstalking error E1 (Cor(S, S 0 )) of the correlation matrix of the original sources S and the recovered sources S 0 , where n n X X |c | ij − 1 E1 (C) = max |c | k ik j=1 i=1 Ã n ! n X X |cij | + −1 maxk |ckj | j=1 i=1
P In our examples, we used1 the norm kAk = i,j |aij |, but without the factor n . In practice, E(A, B)
Corollary 6.3 E1 (Cor(S, S 0 )) = 0 if and only if S and S 0 are decorrelated.
7
Experimental Results
In this section, we give some demonstration of overcomplete BSS using the BMMR algorithm from [24] [25] and the BSR algorithm from above. The calculations have been performed on a AMD Athlon 1 GHz computer using Matlab and took no more than one minute at most. Example 1 is a toy example, where we mixed three independent Gamma distribution signals (distribution proportional to exp(−|x|γ )) with zero mean, unit variance and an approximate kurtosis of 9.3, to two signals using the mixture matrix µ ¶ 0.7071 −0.4472 −0.9487 A= . 0.7071 0.8944 0.3162 Geometric matrix-recovery has been performed using the above geometric algorithm, where the learning rate has been decreased in a manner following [3]. Some tweaking of the parameters was necessary for establishing sufficiently stable convergence. The initial learning rate was
Scatterplot 1 5 3 0.5 2
0
1
1
−0.5 4
0.5 6
−1
2
1
1
0
0
0 −0.5
−1.5 −1.5
2
−1
−0.5
0
0.5
1
1.5
−1
0
2
4
−1
6
0
2
4
4
Figure 4. Example 1: Three mixed gamma distributions with the source directions indicated by the lines and the trained neurons marked with asterixes.
1 0
0
−0.5
−1
−1
2
4
−2
6
0
2
4
4
which is already very close to A and can be improved by more sophisticated training procedures. The generalized crosstalking error of A and B is E(A, B) = 0.1183. A scatter plot of the mixing space together with the trained neurons on the circle is shown in figure 4. After the source recovery step, we compared the demixed sources with the original ones and got the following correlation matrix 0.1656 0.9522 0.1048 −0.2551 0.0719 0.9219 −0.8827 −0.0974 0.1949
6 x 10
−2
6
0
2
4
4
6 4
x 10
x 10
2
0.5
1
0
0
−0.5
−1
−1
4
1
x 10
1
2
2
0
0
0
4
x 10
2
0.5
−1
0.02 with a cooling step every 50 epochs using the cooling parameter β = 60 as in [3]. After 105 iterations, the following mixture matrix was found µ ¶ 0.9608 0.7118 −0.4079 B= , −0.2773 0.7024 0.9130
−1
6 4
x 10
1
0
2
4
−2
6
0
2
4
4
6 4
x 10
x 10
Figure 5. Example 2: The three sources to the left, the two mixtures in the middle, and the recovered signals to the right. The speech texts were ’californication’, ’peace and love’ and ’to be or not to be that’. The signal kurtosis were 8.9, 7.9 and 7.4.
with E1 (Cor(S, S 0 )) = 1.9493. Then, we mixed three speech signals using the mixture matrix ¶ µ 0.9923 0.4472 0.2425 A= 0.1240 0.8944 −0.9701
Scatterplot 1.5
to two mixed signals as shown in figure 5, left side and middle. After 105 iterations, we found the following mixing matrix µ ¶ 0.9670 0.4522 −0.2677 B= , 0.2546 0.8919 0.9635 which is satisfactorily similar to A (E(A, B) = 0.1952), and after source-recovery, we got a correlation of demixed signals and sources as follows 0.8897 0.2700 −0.3910 0.1585 0.7957 0.2376 0.2075 −0.3067 −0.8287 with E1 (Cor(S, S 0 )) = 3.7559. In the figure 5 to the right, the demixed signals are shown. One can see a good resemblance to the original sources, but the crosstalking error is still rather high.
1
5
3
0.5 1 0 2 −0.5
4
6
−1
−1.5 −1
−0.5
0
0.5
1
1.5
Figure 6. Example 2: Scatterplot of the three mixed speech signals.
5
[2] S. Amari, A. Cichocki, and H.H. Yang. A new learning algorithm for blind signal separation. Advances in Neural Information Processing Systems, 8:757–763, 1996.
4.5
4
mean crosstalking error
3.5
3
2.5
[3] C. Bauer, C.G. Puntonet, M. Rodriguez-Alvarez, and E.W. Lang. Separation of EEG signals with geometric procedures. C. Fyfe, ed., Engineering of Intelligent Systems (Proc. EIS’2000), pages 104–108, 2000.
2
1.5
1
0.5
0
0
5
10
15
20 25 angle/degree
30
35
40
45
Figure 7. Source Recovery Test. Plot of the mean crosstalking error E(Cor(S, S 0 )) of the correlation matrix for 100 recovery runs, 1000 samples each, for every angle α.
We claim that this is a fundamental problem of the source-recovery step, which, to our knowledge, using the above probabilistic approach cannot be improved any further. For this, we performed an experiment using the source recovery algorithm to recover three Laplacian signals mixed with µ ¶ 1 cos(α) cos(2α) Aα = , 0 sin(α) sin(2α) where we gave the algorithm already the correct mixing matrix. We then compared the crosstalking error E(Cor(S, S 0 )) of the correlation matrix of the recovered signals S 0 with the original ones (S) for different angles α ∈ [0, π2 ]. Figure 7 shows that the result is nearly independent on the angle, which makes sense because one can show that the shortest-path-algorithm is invariant under coordinate transformations like Aα . This experiment indicates that there might be a general border on how good sources can be recovered in overcomplete settings.
8
Conclusion
We have presented a two-step approach to overcomplete blind source separation. First, the original mixing matrix is approximated. Then, the sources are recovered by the usual maximum-likelihood approach with a Laplacian prior. We have also given a proof for the shortest-path recovery algorithm proposed by Bofill and Zibulevsky. Finally, we have discussed a toy and a real-world example for overcomplete separation, and have taken a glimpse into an analysis of the source-recovery step. In future work, new BMMR algorithms have to be found, and the BSR step has to be analyzed in more detail, especially the question if there is a natural information theoretic barrier of how well data can be recovered in overcomplete settings.
References [1] S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
[4] A. J. Bell and T.J. Sejnowski. An informationmaximisation approach to blind separation and blind deconvolution. Neural Computation, 7:1129–1159, 1995. [5] P. Bofill and M. Zibulevsky. Blind separation of more sources than mixtures using sparsity of their shorttime fourier transform. Proc. of ICA 2000, pages 87– 92, 2000. [6] J.F. Cardoso and B.H. Laheld. Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44(12):3017–3030, 1996. [7] S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. Technical report, Dept. Stat., Stanford Univ, Stanford, CA, 1996. [8] P. Comon. Independent component analysis - a new concept? Signal Processing, 36:287–314, 1994. [9] P. Comon. Blind channel identification and extraction of more sources than sensors. SPIE Conf. Adv. Sig. Proc. VIII, San Diego, pages 2–13, 1998. [10] G.F. Harpur and R.W. Prager. Development of lowentropy coding in a recurrent network. Network, 7:277–284, 1996. [11] A. Hyv¨arinen. Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):626–634, 1999. [12] A. Hyv¨arinen and E. Oja. A fast fixed-point algorithm for independent component analysis. Neural Computation, 9:1483–1492, 1997. [13] C. Jutten and J. Herault. Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24:1–10, 1991. [14] C. Jutten, J. Herault, P. Comon, and E. Sorouchiary. Blind separation of sources, parts I, II and III. Signal Processing, 24:1–29, 1991. [15] L. De Lathauwer, P. Comon, B. De Moor, and J. Vandewalle. ICA algorithms for 3 sources and 2 sensors. IEEE Sig. Proc. Workshop on Higher-Order Statistics, June 14-16, 1999, Caesarea, Israel, pages 116– 120, 1999.
[16] T. Lee, M.S. Lewicki, M. Girolami, and T.J. Sejnowski. Blind source separation of more sources than mixtures using overcomplete representations. IEEE Signal Processing Letters, 6(4):87–90, 1999. [17] M. Lewicki and B.A. Olshausen. A probabilistic framework for the adaptation and comparison of image codes, 1999. [18] M.S. Lewicki and T.J. Sejnowski. Learning nonlinear overcomplete representations for efficient coding. In M.I. Jordan, M.J. Kearns, and S.A. Solla, editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press, 1998. [19] M.S. Lewicki and T.J. Sejnowski. Learning overcomplete representations. Neural Computation, 1998. [20] R. Linsker. An application of the principle of maximum information preservation to linear systems. Advances in Neural Information Processing Systems, 1, 1989. [21] R. Linsker. Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Computation, 4:691–702, 1992. [22] B.A. Olshausen. Learning linear, sparse, factorial codes. Technical Report AIM-1580, 1996. [23] B.A. Olshausen and D. Field. Sparse coding of natural images produces localized. Technical Report CCN-110-95, 1995. [24] F.J. Theis and E.W. Lang. Geometric overcomplete ICA. Proc. of ESANN 2002, pages 217–223, 2002. [25] F.J. Theis and E.W. Lang. A theoretical framework for overcomplete geometric BMMR. SIP 2002 accepted, 2002.