1
JADE for Tensor-Valued Observations
arXiv:1603.05406v1 [math.ST] 17 Mar 2016
Joni Virta, Bing Li, Klaus Nordhausen and Hannu Oja
Abstract—Independent component analysis is a standard tool in modern data analysis and numerous different techniques for applying it exist. The standard methods however quickly lose their effectiveness when the data are made up of structures of higher order than vectors, namely matrices or tensors (for example, images or videos), being unable to handle the high amounts of noise. Recently, an extension of the classic fourth order blind identification (FOBI) specifically suited for tensorvalued observations was proposed and showed to outperform its vector version for tensor data. In this paper we extend another popular independent component analysis method, the joint approximate diagonalization of eigen-matrices (JADE), for tensor observations. In addition to the theoretical background we also provide the asymptotic properties of the proposed estimator and use both simulations and real data to show its usefulness and superiority over its competitors. Index Terms—Independent component analysis, multilinear algebra, kurtosis, limiting normality, minimum distance index.
I. I NTRODUCTION The following presentation relies on multilinear algebra and before the actual ideas can be described we first review some key properties of tensors and matrices needed later. A tensor of rth order X ∈ Rp1 ×···×pr can be seen as a higher order analogy of vectors and matrices. Whereas a matrix can be viewed either as a collection of rows or that of columns, a tensor of rth order has in total r modes. The m-mode vectors of a tensor are given by letting the mth index vary while keeping all other indices fixed, m = 1, . . . , r. A tensor X ∈ Rp1 ×···×pr thus contains ρm := Πrs6=m ps m-mode vectors of length pm . The opposite construct, fixing a single index im and varying the others, then gives what we call the m-mode faces of a tensor. The number of m-mode faces then totals pm and each is a tensor of size p1 × · · · × pm−1 × pm+1 × · · · × pr . For representing tensor contraction, or summation, we use the Einstein summation convention in which a twice-appearing index in a product implies summation over the range of the index. For example, for a tensor X = {xi1 i2 i3 } we have xi1 i2 j xi1 i2 k :=
p1 X p2 X
(X −m Y)jk = xi1 ...im−1 jim+1 ...ir yi1 ...im−1 kim+1 ...ir . (2) The special case X −m X provides higher order counterparts for the products of a vector x ∈ Rp1 or a matrix X ∈ Rp1 ×p2 with itself, such as xxT , XXT or XT X, and proves useful in defining the “covariance matrix” of a tensor. Finally, define the vectorization vec(X) ∈ Rp1 ···pr of a tensor X ∈ Rp1 ×···×pr as the stacking of the elements xi1 ...ir in such a way that the leftmost index goes through its cycle the quickest and the rightmost index the slowest. Then it holds for a tensor X ∈ Rp1 ×···×pr and matrices A1 ∈ Rp1 ×p1 , . . . , Ar ∈ Rpr ×pr that vec(X 1 A1 · · · r Ar ) = (Ar ⊗ · · · ⊗ A1 )vec(X), where ⊗ is the Kronecker product. In this paper we assume that the tensor-valued i.i.d. random elements Xi ∈ Rp1 ×···×pr , i = 1, . . . , n, are observed from the recently suggested [1] tensor independent component model: X = µ + Z 1 Ω1 · · · r Ωr ,
(3)
where Ω1 ∈ Rp1 ×p1 , . . . , Ωr ∈ Rpr ×pr are full rank mixing matrices, µ ∈ Rp1 ×···×pr is the location center, and Z ∈ Rp1 ×···×pr is an unobserved random tensor. The model (3) is further equipped with the following assumptions. Assumption 1. The components of Z are mutually independent. Assumption 2. The components of Z are standardized in the sense that E[vec(Z)] = 0 and Cov[vec(Z)] = I. Assumption 3. For each m = 1, . . . , r, at most one m-mode face of Z consists entirely of Gaussian components.
xi1 i2 j xi1 i2 k .
i1 =1 i2 =1
Assumption 2 implies that E[X] = µ and that
Two special cases of tensor contraction prove especially useful for us. The product X m A of tensor X ∈ Rp1 ×···×pr with a matrix A ∈ Rpm ×pm , m = 1, . . . , r, is defined as the p1 × · · · × pr -dimensional tensor with the elements (X m A)i1 ...ir = xi1 ...im−1 jm im+1 ...ir aim jm .
That is, the multiplication X m A linearly transforms X from the direction of the mth mode without changing the size of the tensor. The operation can alternatively be viewed as applying the linear transformation given by A separately to each mmode vector of the tensor. The second useful product, X −m Y, of two tensors of the same size, X, Y ∈ Rp1 ×···×pr is defined as the pm × pm -dimensional matrix with the elements
(1)
J. Virta, K. Nordhausen and H. Oja are with the Department of Mathematics and Statistics, University of Turku, 20014 Turku, Finland (e-mail:
[email protected]). B. Li is with the Department of Statistics, Pennsylvania State University, 326 Thomas Building, University Park, Pennsylvania 16802, USA.
Cov[vec(X)] = (Ωr ΩTr ) ⊗ · · · ⊗ (Ω1 ΩT1 ) has the so-called Kronecker structure. Assumption 3 is a tensor analogy for the usual vector independent component model assumption on maximally one Gaussian component and without it some column blocks of some of the matrices Ω1 , . . . , Ωr could be identifiable only up to a rotation. After the above assumptions we can still freely change the signs and orders of the columns of all Ω1 , . . . , Ωr , or multiply any Ωs by a constant and divide any Ωt by the same constant, but
2
this indeterminacy is acceptable in practice. The model along with its assumptions now provides a natural extension for the standard independent component model which is obtained as a special case when r = 1. Alternatively, the model can be seen as an extension of the general location-scatter model for tensor-valued data, which is equivalent to (3) with only Assumption 2 and is often, for r = 1, 2, combined with the assumption on Gaussianity or sphericity of vec(Z). Under the location-scatter model the covariance matrix of vec(X) again has the above Kronecker structure. In addition to requiring less parameters to estimate than a full p1 · · · pr × p1 · · · pr covariance matrix, the assumption on Kronecker structure is a natural choice in many applications, see e.g. [2]. For the estimation of covariance parameters under the assumption on Kronecker structure in the matrix case, r = 2, see [3], [4], [5]. For the general tensor Gaussian distribution and the estimation of its parameters see [6], [7]. The extension of dimension reduction methods from vector to matrix or tensor observations is in signal processing usually approached via different tensor decompositions such as the CP-decomposition and the Tucker decomposition. A review of them along with a plethora of references for applications is given in [8], see also [9] for more applications. For examples of particular dimension reduction methods incorporating matrix or tensor predictors, see e.g. [10], [11], [1] for independent component analysis, [12], [13], [14], [15] for sufficient dimension reduction and [16], [17] for principal components analysis-based techniques. More references are also given in [12], [1]. In tensor independent component analysis the objective is to estimate, based on the sample X1 , . . . , Xn , some unmixing matrices Φ1 , . . . , Φr such that X 1 Φ1 · · · r Φr has mutually independent components. A na¨ıve method for accomplishing this would be to vectorize the observations and resort to some standard method of independent component analysis, but in doing so the resulting estimate lacks the desired Kronecker structure. In addition, vectorizing and using standard tools meant for vector-valued data requires the stronger, componentwise version of Assumption 3, inflates the number of parameters and can make the dimension of the data too large for standard methods to handle. To circumvent this, [10], [11], [1] proposed estimating an unmixing matrix separately for each of the modes and [1] presented an extension of the classic fourth order blind identification (FOBI) [18] for tensor observations called TFOBI. In the vector independent component model, x = µ + Ωz, the standardized vector xst := Cov[x]−1/2 (x − E[x]) equals Uz for some orthogonal matrix U, see e.g. [19]. In FOBI the rotation U is then found using the eigendecomposition of the matrix of fourth moments B := E[xst xTst xst xTst ]. This same approach is taken in TFOBI by performing both steps of the procedure, the standardization and the rotation, on all r modes of X. Assuming centered X, in [1] the m-mode covariance matrices, Σm (X) := ρ−1 m E [X −m X] ,
m = 1, . . . , r,
(4)
are first used to standardize the observations as Xst :=
−1/2
−1/2
X 1 Σ1 · · · r Σr . The tensor Z is then found by rotating Xst from all r modes and the rotation matrices can be found from the eigendecompositions of the m-mode matrices of fourth moments: Bm := ρ−1 m E [(Xst −m Xst )(Xst −m Xst )] . Another widely used independent component analysis method for vector-valued data, called the joint approximate diagonalization of eigen-matrices (JADE) [20], also uses fourth moments to estimate the required final rotation but utilizes them in the form of cumulant matrices (assuming E(x) = 0) Cij (x) := E xi xj · xxT − E[xi xj ]E xxT (5) − E [xi · x] E xj · xT − E [xj · x] E xi · xT . The final rotation from xst to z is in JADE obtained by jointly diagonalizing the matrices Cij (xst ) = E xst,i xst,j · xst xTst − δij I − Eij − Eji , (6) where Eij is a matrix with a single one as element (i, j) and zeroes elsewhere and δij is the Kronecker delta. Compared to FOBI which only uses p(p + 1)/2 sums of fourth joint moments of xst JADE thus has a clear advantage in using all possible fourth joint cumulants of xst in the estimation of the rotation matrix. Because of the well-known fact that JADE outperforms FOBI in most cases it is natural to expect that the extension of JADE to tensor-valued data would similarly be superior to TFOBI. This is indeed the case, and in the following sections we formulate the tensor joint diagonalization of eigen-matrices (TJADE) which is obtained from JADE by applying very much the same extensions as required when moving from FOBI to TFOBI. We first briefly discuss the standard vectorvalued independent component model and review the theory and assumptions behind the original JADE in Section II. The corresponding aspects of TJADE are presented in Section III and the asymptotical properties of both methods in Section IV. Simulations comparing TJADE to TFOBI and both the original JADE and original FOBI are presented in Section V along with a real data example and we close in Section VI with some discussion. The proofs can be found in Appendix A. II. O RIGINAL JADE The original JADE assumes that the vector-valued observations are generated by the vector independent component model xi = µ + Ωzi ,
i = 1, . . . , n, p×p
(7)
has full rank, µ ∈ Rp where the mixing matrix Ω ∈ R p and the i.i.d. random vectors zi ∈ R have mutually independent components standardized to have zero means and unit variances. To ensure the existence of the JADE solution we have to further assume that at most one of the independent components has zero excess kurtosis [19]. Assuming next that the data are centered, that is, E[x] = 0, we standardize the vectors as xst = Σ−1/2 x. The standardized vectors can be shown to satisfy xst = Uz for some orthogonal matrix U, see for example [19]. To estimate U, JADE uses the
3
cumulant matrices Cij (xst ), i, j = 1, . . . , p, in (6). Under the independent component model the cumulant matrices can be shown to satisfy, for all i, j = 1, . . . , p, ! p X ij kk C (xst ) = U uik ujk κk E UT , (8) k=1
where κk := E(zk4 ) − 3, the excess kurtosis of the kth component, and uab are the components of U. The expression in (8) is the eigendecomposition of Cij (xst ) and thus any single matrix Cij (xst ) could be used to find U. However, to use all the information available in the fourth joint cumulants, JADE simultaneously (approximately) diagonalizes them all, that is, finds UT as T
U = argmax
p X p X
We extend the cumulant matrices by noting that the operation −m provides an m-mode analogy for the outer product of vectors. By writing the random quantity xi xj · xxT in (5) with outer products either as eTi xxT ej · xxT or as xxT ei eTj xxT two straightforward tensor m-mode analogies for the matrix of fourth cumulants Cij , i, j = 1, . . . , pm , in (5) are then given by T −1 Cij 1,m (X) = ρm E ei (X −m X)ej · (X −m X) T ∗ ∗ − ρ−1 m E ei (X −m X )ej · (X −m X) T ∗ (11) ∗ − ρ−1 m E ei (X −m X)ej · (X −m X) T ∗ ∗ − ρ−1 m E ei (X −m X)ej · (X −m X ) , and
ij
T
2
kdiag(UC (xst )U )k .
(9)
U: UT U=I i=1 j=1
Optimization problems of type (9) are so-called joint diagonalization problems for which many algorithms exist, see [21] for discussion and one particular algorithm. In [19], a thorough analysis of the statistical properties of JADE is given and it is shown there that the JADE estimator is an independent component functional, that is, the resulting components are invariant up to sign-change and permutation under affine transformations to the original data. III. T ENSOR JOINT APPROXIMATE DIAGONALIZATION OF EIGEN - MATRICES In formulating TJADE we assume that the data are generated by the tensor independent component model (3) and satisfy Assumptions 1, 2 and 3. Assuming E[X] = 0, we next go separately through the tensor analogies of the standardization and rotation steps of the original JADE.
A. Standardization step We take the same approach for standardization of X as in [1], that is, use the m-mode covariance matrices, Σ1 , . . . , Σr , to standardize X simultaneously from all r modes. This gives us the standardized tensor −1/2
Xst := X 1 Σ1
B. Rotation step
· · · r Σ−1/2 . r
where, for the asymptotics, we assume that the standardization −1/2 functionals Σm , m = 1, . . . , r, are chosen to be symmetric, ˆ 1, . . . , Σ ˆ r of the m-mode covariance see e.g. [22]. Estimates Σ matrices are obtained by applying (4) to the empirical distribution of X. The next step towards Z is guided by Theorem 5.3.1 in [1] which states that Xst = τ · Z 1 U1 · · · r Ur ,
(10)
for some orthogonal matrices U1 ∈ Rp1 ×p1 , . . . , Ur ∈ Rpr ×pr Qm 1/2 and for τ = ( i=1 pm )r−1 kΩr ⊗ · · · ⊗ Ω1 k1−r F , where k · kF is the Frobenius norm.
ij −1 Cij 2,m (X) = ρm E (X −m X)E (X −m X) ∗ ∗ ij − ρ−1 m E (X −m X )E (X −m X) ∗ ij ∗ − ρ−1 m E (X −m X)E (X −m X) ∗ ij ∗ − ρ−1 m E (X −m X)E (X −m X ) ,
(12)
with m = 1, . . . , r, where X∗ is an independent copy of X. Theoretically, a third way to generalize the idea is obtained by considering xxT ej eTi xxT . However, that would be redundant as the resulting set of matrices for i, j = 1, . . . , pm is the same as with (12) and the individual matrices can be obtained by just reversing i and j in (12). Naturally, for vector observations, r = 1, both (11) and (12) are equivalent. Define next for the model (3) its kurtosis tensor κ ∈ Rp1 ×···×pr as (κ)i1 ...ir := E[zi41 ...ir ] − 3 and its m-mode (m) (m) ¯ (m) := (¯ average kurtosis vector as κ κ1 , . . . , κ ¯ pm ), where (m) κ ¯ k is the average of the excess kurtoses of the random variables in the kth m-mode face of the tensor Z, k = 1, . . . , pm . The following theorem then shows that (11) and (12) actually serve in TJADE the same purpose as their vector counterpart in JADE. Theorem 1. If τ , U1 , . . . , Ur are as defined in (10), then, for c = 1, 2 and m = 1, . . . , r, the matrices of fourth cumulants Cij c,m , i, j = 1, . . . , p satisfy ! pm X (m) (m) (m) kk ij 4 Cc,m (Xst ) = τ · Um uik ujk κ ¯k E UTm . k=1
According to Theorem 1, UTm simultaneously diagonalizes all matrices Cij c,m (Xst ), i, j = 1, . . . , pm , regardless of c, giving two straightforward ways of estimating the m-mode rotation Um using (9) with Cij (xst ) replaced by Cij c,m (Xst ) for the chosen value of c. However, in estimating an individual matrix Cij c,m (Xst ) in (11) or (12) we have to estimate four matrices in total, the last two of which are costly to estimate because of the independent copies X∗ . Using the method of the proof of Theorem 1 one can show that, analogously to the vector-valued case, T ij ij ji Cij Ξm , 1,m (Xst ) = B1,m − Ξm δij ρm I + E + E −1 T j where Bij 1,m := ρm E ei (Xst −m Xst )e · (Xst −m Xst ) 2 and Ξm := ρ−1 m E [Xst −m Xst ] = τ I, which provides a
4
natural estimator for τ 2 . Similarly Cij 2,m (Xst )
Bij 2,m
IV. A SYMPTOTIC PROPERTIES
δij I + ρm Eij + Eji ΞTm ,
− Ξm ij −1 where Bij 2,m := ρm E (Xst −m Xst )E (Xst −m Xst ) and Ξm is as above. Natural estimates for the previous matrices are provided by T ˆ ij := B ˆ m δij ρm I + Eij + Eji Ξ ˆ , ˆ ij − Ξ C (13) m 1,m 1,m =
and T ˆ ij := B ˆ m δij I + ρm Eij + Eji Ξ ˆ , ˆ ij − Ξ C m 2,m 2,m
(14)
ˆm ˆ ij , B ˆ ij and Ξ where i, j = 1, . . . , pm , and the estimates B 1,m 2,m ij ij are obtained by applying the definitions of B1,m , B2,m and Ξm to the empirical distribution of X, including an empirical ˆ −1/2 , . . . , Σ ˆ −1/2 standardization by Σ . Choosing then either r 1 of the sets, c = 1, 2, the rotation matrix UTm , m = 1, . . . , r is found by simultaneous (approximate) diagonalization as UTm = argmax
pm X pm X
U: UT U=I i=1 j=1
T 2 kdiag(UCij c,m (Xst )U )k .
(15)
ˆ Tm , m = 1, . . . , r, are obtained The corresponding estimates U by replacing in (15) the matrices Cij c,m (Xst ) with their estiˆ ij . mates C c,m Combining the standardization and the rotation, the final TJADE algorithm for a sample, Xi ∈ Rp1 ×···×pr , i = 1, . . . , n, consists of the following steps. ˆ 1, . . . , Σ ˆ r. 1) Center Xi and estimate Σ −1/2 ˆ ˆ −1/2 . 2) Standardize: Xi ← Xi 1 Σ1 · · · r Σ r ˆ T1 , . . . , U ˆ Tr by 3) Choose c and estimate the r rotations U diagonalizing for each m = 1, . . . , r simultaneously the ˆ ij , i, j = 1, . . . , pm . sets C c,m ˆ T1 · · · r U ˆ Tr . 4) Rotate: Xi ← Xi 1 U Using Lemma 5.1.1 from [1] the final result can be written ˆ −1/2 ˆ 1 · · · r Φ ˆ r , where Φ ˆ m := U ˆ Tm Σ , as the product Xi 1 Φ m m = 1, . . . , r, is the m-mode TJADE estimate. Remark 1. Technically, there is no reason why we could not use different c for estimating different rotations Um . However, the asymptotic properties of the different approaches are in the next section shown to be equivalent and thus the choice of c is for large enough samples irrelevant. For a vector valued x ∈ Rp and a full-rank matrix A ∈ Rp×p , (Ax)st = Uxst for some orthogonal U [22]. Unfortunately, the analogous relation in the tensor setting, (X 1 A1 · · · r Ar )st = Xst 1 U1 · · · r Ur
(16)
for some orthogonal U1 , . . . , Ur , holds only for orthogonal A1 , . . . , Ar . This lack of m-affine equivariance of Σm (X), m = 1, . . . , r, is discussed in [1] along with a conjecture that in the general tensor case, r > 1, no standardization functional exists which would lead into the property (16). In practice this means that outside the model (3) a change (other than rotation or reflection) in the coordinate system leads into different estimated components. However, the TJADE estimator is still Fisher consistent by Theorem 1.
The asymptotical properties of JADE were considered in [23], [19], [24] and are in [19], [24] based on the fact that the JADE functional is affine equivariant, allowing them to consider only the case of no mixing, Ω = I. In the following we consider the analogous case of Ω1 = I, . . . , Ωr = I for TJADE. However, because of the lack of full affine equivariance, the results generalize only to orthogonal mixing from all r modes. For a tensor X ∈ Rp1 ×···×pr define its m-flattening X(m) ∈ Rpm ×ρm as the horizontal stacking of all m-mode vectors of the tensor into a matrix in a predefined order, see [25] for a rigorous definition. If the stacking order is assumed to be cyclical in the dimensions in the sense of [25] we have for X∗ := X 1 A1 · · · r Ar the identity T
X∗(m) = Am X(m) (Am+1 ⊗ · · · ⊗ Ar ⊗ A1 ⊗ · · · ⊗ Am−1 ) . (17) The reason why m-flattening is particularly useful for us is that it allows us to write the m-mode product of a tensor with itself as an ordinary matrix product, namely X −m X = X(m) XT(m) , regardless of the stacking order. This, combined with the fact that the matrices Cij c,m (X) depend on X only via the previous product, implies that it is sufficient to derive the asymptotics for the case r = 2 only. The results for tensors of order r > 2 are then obtained by applying the case r = 2 for each of the m-flattened matrices X(1) , . . . , X(r) . Similarly, even for the case r = 2 we only need to consider the 1-mode TJADE ˆ 1 (matrix multiplication from left) as the results estimate Φ ˆ 2 follow by simply transposing X or, in the language of for Φ tensors, flattening X from the second mode. Interestingly, we also have no need to specify the used set of cumulant matrices c, as the two choices, c = 1 and c = 2, are shown to lead into asymptotically equivalent estimators. We next provide the asymptotic expressions for the elements ˆ 1 =: Φ ˆ in the case of a matrixof the TJADE estimate Φ p1 ×p2 , i = 1, . . . , n. The asymptotic valued sample Xi ∈ R ˆ can be shown to depend on row means of properties of Φ ¯ (1) various moments of Z, particularly on the elements of κ but also on the following p2 X 4 ¯ (1) := 1 E[z1l ], . . . , E[zp41 l ] T , β p2 l=1
p2 1 X 3 := Var[z1l ], . . . , Var[zp31 l ] T . p2
¯ (1) ω
l=1
Define further the covariance of two rows of kurtoses as ρ
kk0
p2 1 X (1) (1) = (βkl βk0 l ) − β¯k β¯k0 , p2 l=1
4 where βkl := E[zkl ]. To construct an asymptotic expression ˆ for Φ in Theorem 2 we need the terms
5
sˆkk0 qˆkk0 rˆkk0
p2 1 X := p2 l=1 p2 X
n 1X zi,kl zi,k0 l n i=1
! ,
! n 1X 3 3 z − E[zkl ] zi,k0 l , n i=1 i,kl l=1 ! p2 p2 n 1 XX 1 X 2 z zi,kl0 zi,k0 l0 , := p2 n i=1 i,kl 0
1 := p2
l=1 l =1 l0 6=l
the joint limiting normality of which is easy to show, assuming the eighth moments of Z exist. Theorem 2. Let Z1 , . . . , Zn be a random sample from a distribution with finite eighth moments and satisfying Assumptions 1, 2 and 4 (see below). Then there exists a sequence of TJADE ˆ →P I and estimates such that Φ √ 1√ n(φˆkk − 1) = − n(ˆ skk − 1) + oP (1), 2 √ ˆ √ √ √ nψkk0 + nψˆk0 k − dkk0 nˆ skk0 nφˆkk0 = + oP (1), (1) 2 (1) 2 (¯ κk ) + (¯ κk0 ) (1) where k 6= k , ψˆkk0 := κ ¯ k (ˆ rkk0 + qˆkk0 ) and dkk0 := (p2 + (1) (1) (1) 2)(¯ κk − κ ¯ k0 ) + (¯ κk )2 . 0
Using the expressions of Theorem 2 the asymptotic variˆ can now be computed. ances of the elements of Φ Corollary 1. Under the assumptions of Theorem 2 the limiting √ ˆ is multivariate normal with mean distribution of n vec(Φ−I) vector 0 and the following asymptotic variances. ASV (φˆkk ) =
(1) β¯k − 1 , 4p2 (1)
(1) (1)
ζk + ζk0 + (¯ κk0 )4 − 2¯ κk κ ¯ k0 ρkk0 ASV (φˆkk0 ) = , (1) 2 (1) 2 2 p2 ((¯ κk ) + (¯ κk0 ) ) (1)
(1)
(1)
(1)
k 6= k 0 ,
(1)
where ζk := (¯ κk )2 [¯ ω k −(β¯k )2 ]+(¯ κk )2 (¯ κk −2)(p2 −1). It is easily seen that the expressions in Corollary 1 revert to the forms of Corollary 4 in [19] when r = 1, that is, we ¯ (1) contains just the observe just a vector x. In this case κ element-wise kurtoses of the elements of z. Of the popular ICA methods, FastICA, FOBI and JADE, it is well-known that only for FOBI does the asymptotic behavior of φˆkk0 depend on components other than zk and zk0 . The analogous result holds also for TFOBI and TJADE in the sense that in TFOBI (m) the asymptotic behavior of φˆkk0 depends on the whole tensor Z [1] and in TJADE only on the kth and k 0 th m-mode faces of Z. The denominators in Theorem 2 imply that for the existence of the limiting distributions we need the following assumption. Assumption 4. For each m = 1, . . . , r, at most one of the ¯ (m) is zero. components of κ Assumption 4 for TJADE is much less restrictive than the assumption needed for TFOBI, for each m = 1, .., r the ¯ (m) are distinct [1], and the one needed for components of κ vector JADE, at most one element of κ is zero [19]. More
specifically, in TJADE, and in tensor independent component analysis in general, several individual elements of Z are allowed to be Gaussian, as long as Assumption 3 is not violated. Conveniently located, a majority of the elements of Z can thus be Gaussian. The analytical comparison of TJADE and TFOBI via the asymptotic variances involves in general case rather complicated expressions and thus we resort to simulations for their comparison in the next section. V. S IMULATIONS AND EXAMPLES In the following all computations were done in R 3.1.2 [26] especially using the R-packages JADE [27], Rcpp [28], [29] and ggplot2 [30]. For the approximate joint diagonalization, an algorithm based on Jacobi rotations was used, see e.g [21]. Testing the algorithms in various settings showed that both c = 1 and c = 2 yield almost identical results with respect to the MDI-values (see below) but the former is computationally more efficient and thus the TJADE solution in the simulations is computed with the choice c = 1. A. Efficiency comparisons We compared the separation performance of TJADE with its nearest competitor, TFOBI, and also with regular FOBI and JADE as applied to vectorized tensor data, called here VFOBI and VJADE. Note that VFOBI and VJADE do not use the prior information on the data structure and are therefore expected to be worse than TFOBI and TJADE. The simulation setting was the same as in [1]: we simulated n independent 3 × 4 matrix observations with individual elements coming from a diverse array of distributions. The excess kurtoses of the distributions used were -1.2, -0.6, 0, 1, 2, 3, 4, 5, 6, 8, 10 and 15 and the exact distributions used are given in Appendix A. We generated 2000 repetitions for each of the sample sizes, n = 1000, 2000, 4000, 8000, 16000, 32000, and for each sample the same data was mixed using three different distributions for the elements of the 1-mode and 2-mode mixing matrices, A and B. In the first case the mixing matrices were random orthogonal matrices of sizes 3 × 3 and 4 × 4 distributed uniformly with respect to the Haar measure. In the second and third case the elements of both matrices were generated independently from N (0, 1) and Uniform(−1, 1) distributions, respectively. The mixed data were then subjected to each of the four different methods which produced the four unmixing matrix ˆ V F , (Φ ˆ 2,M F ⊗ Φ ˆ 1,M F ), Φ ˆ V J and (Φ ˆ 2,M J ⊗ estimates, Φ ˆ Φ1,M J ). To allow comparing we took the Kronecker product of the 2-mode and 1-mode unmixing matrices of TFOBI and TJADE meaning that all the four previous matrices estimate the inverse of the same matrix (B ⊗ A), up to scaling, signchange and permutation of its columns. The actual comparison was done by first computing the minimum distance index (MDI) [31] of the estimates 1 ˆ − IkF , ˆ inf kCΦΩ D(ΦΩ) =√ p − 1 C∈C ˆ ∈ Rp×p is the estimated unmixing matrix, Ω ∈ Rp×p where Φ is the true mixing matrix and C is the set of all p × p matrices
6
having a single non-zero element in each row and column. ˆ is from the set C. The MDI thus measures how far away ΦΩ index varies from 0 to 1 with 0 indicating that the separation worked perfectly. In our simulation we further transformed the MDI-values as n(p − 1)MDI2 which in vector-valued independent component analysis converges in distribution to a random variable with finite mean and variance [31]. The mean transformed MDI-values for different sample sizes, methods and mixing matrices are shown in Figure 1. The lines for both FOBI and JADE are for all mixings identical since both methods are affine equivariant. For TFOBI and TJADE the separation is best under orthogonal mixing, the results for normal and uniform mixing being a bit worse. But the main implication of the plot is that none of the other methods can really compete with TJADE in matrix independent component analysis. Interestingly, also regular JADE combined with vectorization is better than TFOBI. B. Assumption comparisons As our second simulation study we compared the four methods of the previous simulation via their assumptions. For this we constructed three simulation settings of 3×3×2 tensors with independent elements having either Gaussian (N), Laplace (L), exponential (E), or continuous uniform (U) distributions standardized to have zero means and unit variances. The distributions of the tensors are shown in the following by the two 3 × 3 × 1 faces of each setting: U U U N L E U L L Setting 1 : L L E U L E E E E U U U N L L U L L Setting 2 : L L L U L L L L L E E N N N N N N N Setting 3 : E E N N N N N N N It is easy to see that none of the above settings satisfies the assumptions of VFOBI as all of them have at least two identical components. Only setting 1 satifies the assumption of TFOBI on distinct kurtosis means in all modes and settings 1 and 2 satisfy the assumption on maximally one component having zero excess kurtosis required by VJADE. All three settings satisfy Assumption 4 on maximally one zero kurtosis mean in each mode required by TJADE. We simulated 2000 repetitions of all three settings for different sample sizes using identity mixing and the resulting transformed MDI-values of the four methods are depicted in Figure 2. The above reasoning about the violation of assumptions is clearly visible in the plots. The mean transformed MDI-values of the different methods break one-by-one when the setting changes from 1 to 2 to 3 leaving TJADE as the only method able to handle all three settings. Interestingly, VJADE failed to converge 4601 times out of the 36000 total repetitions across all settings, the majority of failures occurring in the third setting.
The plot for setting 1 further indicates that there exist cases where TFOBI beats VJADE, proving that, though very efficient, the JADE methodology itself is not the only factor in the superior performance of TJADE; the tensor structure also plays an important role. C. Real data example Extreme kurtosis can be shown to be associated with multimodal distributions and thus independent component analysis is commonly used as preprocessing step in classification to obtain directions of interest. In this spirit we consider the semeion1 data set, available in the UCI Machine Learning Repository [32] as a classification problem. The data consist of 1593 binary 16 × 16 pixel images of hand-written digits. For this example we chose only the images representing the digits 0, 1 and 7, having respective group sizes of 161, 162 and 158. The objective is to find a few components separating the three digits. Subjecting the data to TJADE gives the results depicted in Figure 3. The left-hand side plot shows the scatter plot of the two resulting components with the lowest kurtoses using the individual digit images as plot markers. Clearly the two found directions are sufficient to separate all three groups of digits. The same conclusion can be drawn from the corresponding density estimators and rug plots on the right-hand side of Figure 3. As a next step, some low-dimensional classification algorithm could be applied to the extracted components to create a classification rule. VI. D ISCUSSION In this paper we proposed TJADE, an extension of the classic JADE suited for tensor-valued observations. Based on the same idea of diagonalizing multiple cumulant matrices as JADE, TJADE was shown to be very effective in solving the independent component problem. In the course of the paper we first reviewed the theory and the algorithm behind JADE and then formulated TJADE analogously giving two different, although asymptotically equivalent, ways of estimating the needed rotations. The asymptotic behaviors of the elements of the TJADE-estimates under orthogonal mixing were next provided allowing theoretical comparison to other methods. Finally, simulation studies comparing TJADE to TFOBI, and the na¨ıve approaches combining vectorization with either FOBI or JADE showed that TJADE is superior to all the previous competitors in tensor independent component analysis. As the investigation of ICA methods for non-vector-valued objects is still in an early stage, much further research is needed. Below we outline some ideas planned to follow this work. As the number of matrices to jointly diagonalize in estimating the m-mode rotation in TJADE grows proportional to the square of the corresponding dimension pm , an extension like k-JADE [33] is worth considering also for TJADE. And as a competing alternative also a tensor version of the 1 Semeion Research Center of Sciences of Communication, via Sersale 117, 00128 Rome, Italy; Tattile Via Gaetano Donizetti, 1-3-5,25030 Mairano (Brescia), Italy.
7
●
●
●
8000
●
●
4000
●
MIXING Orthogonal Normal
N (p − 1) MDI2
2000
Uniform
METHOD
1000
●
VFOBI TFOBI
500 VJADE TJADE 250
125
1000
2000
4000
8000
16000
32000
N
Fig. 1. Plot of sample size versus the transformation n(p − 1)MDI2 under combinations of the four different methods and three different distributions for the mixing matrices.
Setting 1
Setting 2
Setting 3 ●
●
256000
● ●
●
128000
● ●
●
64000
● ●
●
N (p − 1) MDI2
32000
● ●
●
●
16000
METHOD
●
●
●
●
8000
VFOBI TFOBI VJADE
4000
TJADE 2000 1000 500 250 125 1000
2000
4000
8000
16000
32000 1000
2000
4000
8000
16000
32000 1000
2000
4000
8000
16000
32000
N
Fig. 2. Means of transformed MDI-values for different combinations of setting, sample-size and method. Moving from left to right, all other methods but TJADE break down one-by-one.
A PPENDIX
√ √ the matrix and down and right, Uniform(− 3, √3), √ moving √ Triangular(− √6, 6, 0), N (0, 1),√ t10 , Gamma(3, 3), Laplace(0, 1/ 2), χ23 , Gamma(1.2, 1.2), Exp(1), χ21.5 , χ21.2 and InverseGaussian(1, 1). The distributions were further shifted to have zero means and unit variances.
The distributions used in the first simulation of Section V are, starting from the upper left corner of
The proof of Theorem 1. Consider first the case c = 1 and the
FastICA algorithm [34] will be investigated. This opens many possibilities allowing choosing both the non-linearity function g and the norm used in the maximization problem, see [35].
8
Density
0.4
1
0.2
0 −1
0.0
−2
Component with the second lowest kurtosis
0.6
2
comp_1st_lowest_kurtosis comp_2nd_lowest_kurtosis
−2
−1
0
1
Component with the lowest kurtosis
−3
−2
−1
0
1
2
Component value
Fig. 3. Results of applying TJADE on the semeion data. The plot on the left-hand side shows the scatter plot of the two components having the lowest kurtoses found by TJADE with the individual images as markers. The three digits clearly form three groups in the plane. The density plots along with the rugs on the right-hand side imply the same. The lower rug corresponds to the component with the lowest kurtosis (min kurtosis 1) and the coloring of the groups in the rugs is the same as in the scatter plot.
four terms in (11) separately fixing the choice of m. Denoting the first term of (11) by Bij 1,m (X), then according to Lemma 5.4.1 in [1] we have Bij 1,m (Xst ) =
h i τ4 (m) (m) Um E (ui )T Z(m) ZT(m) uj · Z(m) ZT(m) UTm , ρm
where Z(m) is the flattened matrix defined in Section IV and (m) (ui )T is the ith row of Um . Using the standard properties of expected value and independent random variables the (k, k 0 ) element of the inner expectation can be shown to be for k 6= (m) (m) (m) (m) k 0 equal to uik ujk0 + ujk uik0 and for k = k 0 equal to (m) (m) (m) δij ρm +uik ujk (¯ κk +2). Using these to construct a matrix form for the expectation we have ! p X (m) (m) (m) 4 Bij uik ujk κ ¯ k Ekk UTm 1,m (Xst ) = τ Um k=1
+ τ 4 δij ρm I + τ 4 Eij + τ 4 Eji . The second, third and fourth terms in (11) then serve to remove the extra constant terms above. That they indeed cancel oneby-one the final terms can easily be shown by examining them in the above manner using the independence of X and X∗ . This concludes the proof for c = 1 and the corresponding result for c = 2 can be proven in precisely the same manner.
The proof of Theorem 2. The consistency of the TJADE estimator is proven similarly as the consistency of the TFOBI estimator in the proof of Theorem 5.2.1 in [1]. In the following we assume that r = 2 and we are interested in the asymptotical behavior of the 1-mode unmixing matrix. As discussed in Section IV, for the general case of arbitrary r and m-mode unmixing matrix, it suffices to m-flatten the ˆ −1/2 with Σ ˆ −1/2 ˆ −1/2 tensor and replace in the following Σ ,Σ m 1 2 −1/2 −1/2 −1/2 −1/2 ˆ ˆ ˆ ˆ with Σ ⊗Σ ⊗ ··· ⊗ Σ m+1 ⊗ · · · ⊗ Σr 1 m−1 , p2 with ρm and use the corresponding row mean quantities. √ Forˆ the asymptotic expressions of the diagonal elements of n(Φ−I) it suffices to use the same arguments as in the proof of Theorem 5.2.1 in [1] and for the off-diagonal elements we aim to use Lemma 2 from [19]. But first, define the symmetric standardization functionals ˆ −1/2 giving the ˆ −1/2 and R ˆ = (ˆlkk0 ) := Σ ˆ = (ˆ L rll0 ) := Σ 2 1 ˆZ ˜ iR ˆT , standardized identity-mixed observations as Xst,i = L ˜ i = Zi − Z. ¯ We then have where Z √ √ n(ˆlkk0 − δkk0 ) = −(1/2) n(ˆ skk0 − δkk0 ) + oP (1), see and as simple moment-based estimators we have both √ [1], ˆ − I) = OP (1) and √n(R ˆ − I) = OP (1), regardless n(L of whether we really have r = 2 or use flattened tensors of higher order. ˆ kk0 , k, k 0 = Assume then first that c = 1. The matrices C 1,1 1, . . . , p, in (13) to be simultaneously diagonalized satisfy 0 0 0 (1) ˆ kk := C ˆ kk →P Ckk (Zi ) = δkk0 κ C ¯ k Ekk . In the view of 1,1 1,1
9
Lemma 2 in [19] this means that the only matrices Crs 1,1 (Zi ), r, s = 1, . . . , p having non-zero kth or k 0 th diagonal elek0 k0 ments are Ckk 1,1 (Zi ) and C1,1 (Zi ), respectively, yielding the following form for the (k, k 0 ), k 6= k 0 , element of the rotation ˆ := U ˆ T1 estimated by (15). U (1) √ ˆ kk (1) √ ˆ k0 k0 √ κ ¯ nCkk0 − κ ¯ k 0 nC kk0 nˆ ukk0 = k + oP (1), (1) 2 (1) 2 (¯ κk ) + (¯ κk0 ) ˆ kk is the (r, s) element of C ˆ kk . The above expression where C rs 0 then together with the (k, k ), k 6= k 0 , element of the left ˆ gives an asymptotic expression for standardization matrix L the off-diagonal elements of the estimated left TJADE matrix, see [1]: √ √ √ nφˆkk0 = nˆ ukk0 + nˆlkk0 + oP (1), (18) reducing the problem of finding the asymptotics of √TJADE ˆ kk0 into the task of finding the asymptotic behaviors of nC kk √ ˆ k0 k0 and nCkk0 . Dropping the subscripts for clarity, note that ˆ aa = B ˆ 2 I + 2Eaa )Ξ ˆ T and starting from B ˆ aa − Ξ(p ˆ aa write C it out as n X ˆT Z ˜ iR ˆ aa = 1 ˆ ∗Z ˜ Ti L ˆ a) · L ˆZ ˜ iR ˆ ∗Z ˜ Ti L ˆT , (L B p2 n i=1 a ˆ ˆ ∗ := R ˆ T R. ˆ An arbitrary ˆ Ta is the ath row of L where L √ ˆ aaand R aa off-diagonal element of n(B − B (Zi )) then has after the matrix multiplication the form X √ √ aa ∗ ∗ ˆ ˆ ˆ ˆ ˆ de,gf,st,vu , ˆ kk0 = 1 nB nˆ ref rˆtu lad lag lks lk0 v H p2 n
0 k an arbitrary off-diagonal element of √ 6= ˆk . Consequently ˆ T − p2 I − 2Eaa ) is also oP (1) implying n(Ξ(p2 I + 2Eaa )Ξ that the term actually contributes√ nothing to√ the asymptotic ˆ aa0 = nB ˆ aa0 + oP (1) variances of the estimator. Thus nC kk kk and the result of Theorem 2 is obtained by plugging everything in into (18) and using the fact that the standardization functionals are symmetric. The asymptotic variances of Corollary 1 are then straightforward to obtain, e.g. using the table of covariances in the proof of Theorem 5.2.1 in [1]. Although the starting expressions for√c = 1 and c√= 2 are ˆ kk0 and nC ˆ k0 k0 0 different the final expressions for both nC kk kk actually match exactly. The corresponding proof for c = 2 is then obtained in exactly likewise manner, expanding the terms suitably and using Slustky’s theorem, and is thus omitted here.
ACKNOWLEDGMENTS The work of Virta, Nordhausen and Oja was supported by the Academy of Finland Grant 268703. The work of Li was supported by the National Science Foundation Grant DMS1407537. R EFERENCES
[1] J. Virta, B. Li, K. Nordhausen, and H. Oja, “Independent component analysis for tensor-valued data,” arXiv preprint arXiv:1602.00879, 2016. [2] K. Werner, M. Jansson, and P. Stoica, “On estimation of covariance matrices with kronecker product structure,” Signal Processing, IEEE Transactions on, vol. 56, no. 2, pp. 478–491, 2008. [3] M. S. Srivastava, T. von Rosen, and D. Von Rosen, “Models with a kronecker product covariance structure: estimation and testing,” Mathedef gstuv matical Methods of Statistics, vol. 17, no. 4, pp. 357–370, 2008. (19) [4] A. Wiesel, “Geodesic convexity and covariance estimation,” Signal Pn Processing, IEEE Transactions on, vol. 60, no. 12, pp. 6182–6189, 2012. ˆ where Hde,gf,st,vu = (1/n) i=1 z˜i,de z˜i,gf z˜i,st z˜i,vu →P [5] Y. Sun, P. Babu, and D. P. Palomar, “Robust estimation of structured E(zi,de zi,gf zi,st zi,vu ). Next we expand the multiplicands rˆ∗·· covariance matrix for heavy-tailed elliptical distributions,” arXiv preprint arXiv:1506.05215, 2015. and ˆl·· in (19) one-by-one such as ˆlab = (ˆlab − δab ) + δ√ab , [6] M. Ohlson, M. R. Ahmad, and D. Von Rosen, “The multilinear normal the first term of which is OP (1) when combined with n distribution: Introduction and some basic properties,” Journal of Multivariate Analysis, vol. 113, pp. 37–47, 2013. allowing the use of Slutsky’s theorem to the whole multiple [7] A. M. Manceur and P. Dutilleul, “Maximum likelihood estimation for sum and the second term of which produces an expression like the tensor normal distribution: Algorithm, minimum sample size, and (19) only with one summation index less. empirical bias and dispersion,” Journal of Computational and Applied Mathematics, vol. 239, pp. 37–49, 2013. Starting from left √ ˆthis process √ ˆthen produces the √ ˆterms [8] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” oP (1); oP (1); δak nlkk0 + δak0 nlk0 k + oP (1); δak nlkk0 + √ √ SIAM review, vol. 51, no. 3, pp. 455–500, 2009. (1) δak0 nˆlk0 k + oP (1); δak0 (¯ κk0 + p2 + 2) nˆlkk0 + (1 − [9] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “A survey of √ ˆ √ ˆ (1) multilinear subspace learning for tensor data,” Pattern Recognition, δak0 )p2 nlkk0 + oP (1) and δak (¯ κk + p2 + 2) nlk0 k + (1 − √ ˆ vol. 44, no. 7, pp. 1540–1551, 2011. δak )p2 nlk0 k + oP (1) finally leaving us with the expression [10] M. A. O. Vasilescu and D. Terzopoulos, “Multilinear independent components analysis,” in Computer Vision and Pattern Recognition, n 1 X 1 X 2 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. √ z˜i,ae z˜i,kt z˜i,k0 t + oP (1). (20) IEEE, 2005, pp. 547–553. p2 et n i=1 [11] L. Zhang, Q. Gao, and L. Zhang, “Directional independent component analysis with tensor representation,” in Computer Vision and Pattern Substituting now either a = k or a = k 0 , expanding z˜i,ab = Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. zi,ab − z¯ab and using the quantities√defined in√Section IV the 1–7. [12] B. Li, M. K. Kim, and N. Altman, “On dimension folding of matrix-or 0 + 0 + oP (1) expression in (20) gets the forms nˆ r nˆ q kk kk √ √ array-valued statistical objects,” The Annals of Statistics, pp. 1094–1121, and nˆ rk0 k + nˆ qk0 k +√oP (1), respectively. 2010. kk ˆ Using the above, e.g. nBkk0 gets the form [13] R. M. Pfeiffer, L. Forzani, and E. Bura, “Sufficient dimension reduction for longitudinally measured predictors,” Statistics in medicine, vol. 31, √ √ √ √ (1) no. 22, pp. 2414–2427, 2012. κk +p2 +2) nˆlk0 k + nˆ rkk0 + nˆ qkk0 +oP (1). (p2 +2) nˆlkk0 +(¯ [14] Y. Xue and X. Yin, “Sufficient dimension folding for regression mean function,” Journal of Computational and Graphical Statistics, vol. 23, ˆ 2I + For the asymptotic behavior of the remaining term Ξ(p no. 4, pp. 1028–1043, 2014. aa ˆ T 2E )Ξ one can first use techniques similar to the above [15] S. Ding and R. D. Cook, “Tensor sliced inverse regression,” Journal of √ ˆ ˆ ˆ Multivariate Analysis, vol. 133, pp. 216–231, 2015. to show for Ξ = (ξkk0 ) that n(ξkk0 − δkk0 ) = oP (1) for
10
[16] ——, “Dimension folding PCA and PFC for matrix-valued predictors,” Statistica Sinica, vol. 24, pp. 463–492, 2014. [17] K. Greenewald and A. Hero, “Robust kronecker product PCA for spatio-temporal covariance estimation,” IEEE Transactions on Signal Processing, vol. 63, no. 23, pp. 6368–6378, 2015. [18] J.-F. Cardoso, “Source separation using higher order moments,” in International Conference on Acoustics, Speech, and Signal Processing, 1989. ICASSP-89. IEEE, 1989, pp. 2109–2112. [19] J. Miettinen, S. Taskinen, K. Nordhausen, and H. Oja, “Fourth moments and independent component analysis,” Statistical science, vol. 30, no. 3, pp. 372–390, 2015. [20] J.-F. Cardoso and A. Souloumiac, “Blind beamforming for non-gaussian signals,” in IEE Proceedings F (Radar and Signal Processing), vol. 140, no. 6. IET, 1993, pp. 362–370. [21] A. Belouchrani, K. Abed-Meraim, J.-F. Cardoso, and E. Moulines, “A blind source separation technique using second-order statistics,” Signal Processing, IEEE Transactions on, vol. 45, no. 2, pp. 434–444, 1997. [22] P. Ilmonen, H. Oja, and R. Serfling, “On invariant coordinate system (ICS) functionals,” International Statistical Review, vol. 80, no. 1, pp. 93–110, 2012. [23] S. Bonhomme and J.-M. Robin, “Consistent noisy independent component analysis,” Journal of Econometrics, vol. 149, no. 1, pp. 12 – 25, 2009. [24] J. Virta, K. Nordhausen, and H. Oja, “Joint use of third and fourth cumulants in independent component analysis,” arXiv preprint arXiv:1505.02613, 2015. [25] L. De Lathauwer, B. De Moor, and J. Vandewalle, “A multilinear singular value decomposition,” SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1253–1278, 2000.
[26] R Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2014. [27] J. Miettinen, K. Nordhausen, and S. Taskinen, “Blind source separation based on joint diagonalization in R: The packages JADE and BSSasymp,” Conditional accepted for publication in the Journal of Statistical Software, 2015. [28] D. Eddelbuettel, R. Franc¸ois, J. Allaire, J. Chambers, D. Bates, and K. Ushey, “Rcpp: Seamless R and C++ integration,” Journal of Statistical Software, vol. 40, no. 8, pp. 1–18, 2011. [29] D. Eddelbuettel, Seamless R and C++ integration with Rcpp. Springer, 2013. [30] H. Wickham, ggplot2: elegant graphics for data analysis. Springer New York, 2009. [Online]. Available: http://had.co.nz/ggplot2/book [31] P. Ilmonen, K. Nordhausen, H. Oja, and E. Ollila, “A new performance index for ICA: properties, computation and asymptotic analysis,” in Latent Variable Analysis and Signal Separation. Springer, 2010, pp. 229–236. [32] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml [33] J. Miettinen, K. Nordhausen, H. Oja, and S. Taskinen, “Fast equivariant JADE,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2013, May 2013, pp. 6153–6157. [34] A. Hyv¨arinen, J. Karhunen, and E. Oja, Independent Component Analysis. New York, USA: John Wiley & Sons, 2001. [35] J. Miettinen, K. Nordhausen, H. Oja, S. Taskinen, and J. Virta, “The squared symmetric FastICA estimator,” arXiv preprint arXiv:1512.05534, 2015.