Condition numbers for the tensor rank decomposition

1 downloads 0 Views 578KB Size Report
Jun 16, 2017 - asymptotically by the product of the condition number and the norm of the ... Keywords: tensor rank decomposition, CP, condition number, ...
Condition numbers for the tensor rank decomposition Nick Vannieuwenhoven1 KU Leuven, Department of Computer Science, Leuven, Belgium.

Abstract The tensor rank decomposition problem consists of recovering the parameters of the model from an identifiable low-rank tensor. These parameters are analyzed and interpreted in many applications. As tensors are often perturbed by measurement errors in practice, one must investigate to what extent the unique parameters change in order to preserve the validity of the analysis. The magnitude of this change can be bounded asymptotically by the product of the condition number and the norm of the perturbation to the tensor. This paper introduces a condition number that admits a closed expression as the inverse of a particular singular value of Terracini’s matrix, which represents the tangent space to the set of tensors of fixed rank. A practical algorithm for computing this condition number is presented. The latter’s elementary properties such as scaling and orthogonal invariance are established. Rank-1 tensors are always well-conditioned. The class of weak 3-orthogonal tensors, which includes orthogonally decomposable tensors, contains both well-conditioned and ill-conditioned problems. Numerical experiments confirm that the condition number yields a good estimate of the magnitude of the change of the parameters when the tensor is perturbed. They also suggest that the condition number can be inversely related to the distance to ill-posed tensor rank decomposition problems, where the ill-posedness arises either from the nonclosedness of the set of tensors of fixed rank or from the existence of infinitely many decompositions. Keywords: tensor rank decomposition, CP, condition number, stability analysis, Terracini’s matrix 2010 MSC: 65F35, 15A69, 15A72, 65H04, 14Q15, 14N05, 58A05, 58A07

1. Introduction A tensor is an element of T = Fn1 ⊗ · · · ⊗ Fnd , where F = C or R, nk ∈ N, and ⊗ denotes the tensor product. A rank-1 tensor in T is defined as a1 ⊗ · · · ⊗ ad with ak ∈ Fnk \ {0}. Every tensor is expressible as a linear combination of rank-1 tensors: A=

r X

a1i ⊗ a2i ⊗ · · · ⊗ adi ,

where aki ∈ Fnk .

(1)

i=1

If the number of terms r is minimal, then this decomposition due to Hitchcock [29] is called a tensor rank decomposition. The integer r is then called the rank of the tensor. The representative p = (a1 , . . . , ad ) of a rank-1 tensor a1 ⊗ · · · ⊗ ad is called essentially unique or identifiable because the vectors ak are unique up to scaling: if q = (b1 , . . . , bd ) is such that a1 ⊗ · · · ⊗ ad = b1 ⊗ · · · ⊗ bd , then bk = λk ak for k = 1, . . . , d, and furthermore λ1 · · · λd = 1. A rank-r tensor with d ≥ 3 often admits an r-identifiable decomposition [16, 23, 34]: the set of rank-1 tensors appearing in a decomposition is uniquely determined. For instance, if d = 3 the well-known criterion due to Kruskal [34] states that (1) is the essentially unique decomposition if r ≤ 21 (k1 + k2 + k3 − 2), where ki is the Kruskal Email address: [email protected] (Nick Vannieuwenhoven) author was supported by a Postdoctoral Fellowship of the Research Foundation—Flanders (FWO).

1 The

Preprint submitted to Elsevier

June 16, 2017

rank of Ai = {aij }rj=1 ; that is, the integer ki such that every subset of Ai of cardinality ki forms a linearly independent set. Identifiability of a tensor rank decomposition is essential in several applications, such as in the parameter identification problem in latent variable models [3, 4]. Models whose parameters can be inferred with this decomposition and symmetrical variants were recently surveyed in a unified tensor-based framework [4]; they include exchangeable single topic models, hidden Markov models, and Gaussian mixture models. The blind source separation problem, or independent component analysis, in signal processing is another example where the uniqueness of the symmetric tensor rank decomposition of a cumulant tensor is key in identifying the statistically independent signals [18]. In all of the above applications, a mathematical theory guarantees that the tensor A has a rank-r decomposition; however, in practice this tensor is often estimated from imperfect measurements. This entails that the true but unknown tensor A is approximated by the estimated tensor A00 whose rank is in general not equal to r. Therefore, it is common practice to approximate A00 by a nearby rank-r tensor A0 . The latter’s rank-r decomposition then serves as a proxy for A’s decomposition. This is a standard approach in data-analytic applications, yet one primordial question has thus far remained unanswered: What is the relationship between the essentially unique decompositions of two rank-r tensors A and A0 when kA − A0 k ≤  for small perturbations  > 0? The goal of this paper is characterizing that relationship to first order via a study of the condition number of the associated computational problem of computing a tensor rank decomposition. 1.1. Condition numbers The structured absolute condition number of a computational problem g is a standard concept in numerical analysis [13, 28, 48, 60]. It measures the maximum change at the output of a function g : D ⊂ Fm → Fn when infinitesimally perturbing the input x ∈ D to another x + ∆x ∈ D: κα,β (x) := lim

max

→0 k∆xk≤, x+∆x∈D

kg(x + ∆x) − g(x)kα , k∆xkβ

(2)

where k · kα and k · kβ are norms on Fn and Fm respectively, and x ∈ D ⊂ Fm . The condition number yields an asymptotically sharp a posteriori bound on the absolute forward error: kg(x + ∆x) − g(x)kα . κα,β (x) · k∆xkβ ,

(3)

where x, x + ∆x ∈ D, and where k∆xkβ should be sufficiently close to zero. That is, the absolute forward error is asymptotically bounded by the product of the absolute backward error multiplied with the condition number, which is a well-known rule of thumb in numerical analysis. Unfortunately, the computational problem of computing a tensor rank decomposition admits no known closed expression in general. Therefore, its condition number will be investigated via the study of its much simpler inverse problem, which asks to compute the tensor given the parameters of its rank-r decomposition (1). This problem can be described explicitly as a “tensor computation” function f . In the literature, the parameters are succinctly represented by the factor matrices Ai = [aij ]rj=1 ; however, dealing with their vectorizations will be more convenient in this paper. Therefore, we define the bijections vec : Fn1 × · · · × Fnd → FΣ+d  1 ai   pi = (a1i , . . . , adi ) 7→  ...  ,

vecr : (Fn1 × · · · × Fnd )×r → Fr(Σ+d)   vec(p1 )   (p1 , . . . , pr ) 7→  ...  ,

and

adi

where Σ =

Pd

k=1 (nk − 1).

Let Π = M

f :F

Π

→F ,

Qd

k=1

vec(pr ) nk and M = r ·

Pd

k=1

nk . The tensor computation function is then

p = vecr(p1 , p2 , . . . , pr ) 7→

r X

a1i ⊗ a2i ⊗ · · · ⊗ adi ,

(4)

i=1

which takes the vectorized factor matrices (VFMs) p as input and sends them to the tensor they represent. 2

1.2. The decomposition problem versus the approximation problem This paper is principally concerned with the condition number of the tensor rank decomposition problem:2 Given a rank-r tensor X ∈ FΠ , find x ∈ FM so that X = f (x).

(TDP)

The input to this problem is always a rank-r tensor. Therefore, we study the structured condition number, which accounts only for structured perturbations that take one rank-r tensor to another rank-r tensor. Mathematically, we consider the condition number in (2) where D is the set of tensors of rank equal to r, and g : D → FM maps a rank-r tensor X to VFMs x so that X = f (x). One could complementarily investigate the condition number of the tensor rank-r approximation problem: Given a tensor X ∈ FΠ , find x ∈ FM so that kX − f (x)k2 = min kX − f (y)k2 . y∈FM

(TAP)

The input now consists of any element of FΠ . The corresponding condition number would be as in (2) with D = FΠ and g : FΠ → FM maps a tensor X to the parameters of its best rank-r approximation (assuming it exists). Such a condition number measures the sensitivity of the best rank-r approximation of X to arbitrary perturbations to X. As this paper focuses on the first condition number, it is important to understand when it is appropriate. Consider the situation where an unknown theoretical tensor X admits a unique rank-r decomposition, i.e., X = f (x), but where X is not known exactly due to the occurrence of measurement, model, representation, or computation errors. An example of this is presented in the next subsection. In this case, only an e ≈ X is known. In general, X e does not admit an exact rank-r decomposition. However, it approximation X b can be well approximated by one, say X = f (b x). Even though we need to solve the TAP in this step, if we b is the are interested in understanding the true VFMs x of X, then the condition number of the TDP at X b and the only quantity of interest because, by definition, it bounds the error between the computed VFMs x e (or X) b offers no information of true but unknown VFMs x. The condition number of the TAP with input X this kind, as it looks at unstructured perturbations while in this application we know that we seek a nearby tensor with an exact rank-r decomposition. Note that at a rank-r tensor X, the condition number of the TAP at X is always bounded from below by the condition number of the TDP at X by definition. 1.3. Applications The condition number is expected to find application in data analysis for estimating the forward error of the TDP, as is explained next. Recall that decomposition (1) was employed for data analysis in fluorescence spectroscopy by Appellof and Davidson [5]. In a diluted solution, fluorophores will absorb a certain amount of light at the ith wavelength and re-emit a fraction of the absorbed energy as light at the jth wavelength [53, Section 2.3]. Mathematically, xi,j = χλi µj , where xi,j is the intensity of the light emitted at the jth wavelength when the fluorophore is excited by light at the ith wavelength, λi models the fraction of light absorbed at the ith wavelength, µj models the fraction of light emitted at the jth wavelength, and χ is a chemical constant proportional to the concentration of the fluorophore in the solution. IfPr fluorophores r occur jointly in a solution and they are chemically inert, then their signal is additive: xi,j = `=1 χ` λi,` µj,` . If several of such mixtures with varying concentrations of fluorophores are simultaneously analyzed, the Pr model for the intensity of the emitted light is a tensor rank decomposition: xi,j,k = `=1 χk,` λi,` µj,` . As stated in [53, Section 2.3], “the uniqueness property of [this model] guarantees that, if [it] is adequate, [...] χ` ] the estimated loading vectors for the `th component are readily interpretable as concentration profiles [χ µ λ and emission and excitation spectra ([µ ` ] and [λ ` ] respectively) of the `th fluorophore.” This enables an identification of the fluorophores present in a chemical mixture of unknown composition. While the theoretical tensor X = f (x) admits an exact rank decomposition with rank equal to the number of fluorophores, e can be obtained from physical measurements. The reason is that such in practice only an approximation X 2 Ignore for now that this problem is ill-posed due to the non-uniqueness of the representation of rank-1 tensors. In Section 2, this will be addressed in detail.

3

measurements are always subject to errors because of the finite resolution of the equipment. In addition, the supposed model is only approximately valid in practice, as the assumptions of a sufficiently diluted solution e that is available and chemical inertness may not be satisfied exactly. This means that the data tensor X e F ≤µ for data analysis must be regarded as a perturbation of the theoretical rank-r tensor X with kX − Xk e and where µ is known from the application domain. The data tensor X is then approximated by a rank-r b = f (b b − Xk e F , we know that the distance between X and the approximation tensor X x). Letting ν := kX b computed from the sampled data X e is at most µ + ν. Armed with the condition number κ of the TDP X b = f (b bk . κ · (ν + µ), which bounds the at X x), we can compute the asymptotically sharp bound kx − x b and the true but unknown VFMs x. If κ · (ν + µ) is difference between the computed approximate VFMs x small, then one can safely match the computed rank-1 terms with chemical compounds, as the true rank-1 terms in the decomposition of the true tensor X will have been well approximated. 1.4. Contributions The main contribution of this paper is the introduction of a condition number for the TDP. It is defined for every rank decomposition that satisfies a mild condition called robust r-identifiability. This condition is satisfied by generic3 rank-r decompositions in tensor spaces that are generically r-identifiable; see Section 4.1. The prime result, Theorem 10, can be stated informally as follows. Theorem 1. Let N 0 = min{r(Σ + d), Π}, N = r(Σ + 1) < N 0 ≤ Π, and let A = f (x) be a robustly r-identifiable tensor in Fn1 ×n2 ×···×nd . If J = [∂fi /∂xj ]i,j is the Π × r(Σ + d) Jacobian matrix of f evaluated −1 can be at x and ς1 ≥ ς2 ≥ · · · ≥ ςN ≥ ςN +1 = · · · = ςN 0 = 0 are its singular values, then κ(x) := ςN interpreted as the absolute condition number of the tensor rank decomposition problem at x. This Jacobian matrix is required in every step of gradient-based optimization methods for computing approximate tensor rank decompositions; see, e.g., [2, 27, 39–42, 46, 54]. In these methods, the condition number can thus be obtained with little extra effort by computing the singular values of the Jacobian matrix in the final iteration. Nevertheless, the condition number is a property of a computational problem, so it does not depend on the method for solving it. In particular, the condition number of rank decompositions obtained via ALS-type methods or via direct decomposition algorithms [6, 19, 36] is also governed by aforementioned Jacobian matrix, even though this matrix is nowhere employed in these methods. 1.5. Prior work The Cram´er–Rao bounds that were investigated in [32, 37, 51, 52] measure the stability of a tensor rank decomposition in a statistical framework, wherein additive Gaussian noise is assumed to corrupt the factor matrices. The quantity of interest is the squared angular error between the true parameters and the estimated parameters [52]. The Cram´er–Rao lower bound (CRLB) for estimating the parameters p of the tensor rank decomposition of A = f (p) is then defined in [52] as the inverse of the Fisher information matrix 1 T 2 σ 2 J J, where J is as in Theorem 1 and σ is the variance of the Gaussian noise. However, this matrix is not invertible, so [52] suggest to alter the TDP by removing some of the parameters. Hence, the CRLB depends on the particular elimination of the variables that is chosen. 1.6. Outline The rest of this paper is structured as follows. In the next section, the classic theory for conditioning of implicit inverse functions is stated. From geometrical insights it is established that this theory does not apply unreservedly. Three strategies are identified for overcoming this problem, two of which are developed in succeeding sections. Strategy I relies on the explicit elimination of variables and is explored in Section 3. The main contribution of this paper consists of developing Strategy II, which applies the classic definition of a condition number in (2) to the TDP, albeit with respect to a suitable premetric; Section 4 presents this 3 A property that can be admitted by the elements of a set S is called generic if all elements of S \ N admit the property, where N ⊂ S has measure zero in S in some topology.

4

development which results in a condition number of the TDP at specific VFMs p. Based on these results, the condition number of norm-balanced VFMs is proposed in Section 5 as the condition number of the TDP at a rank-r tensor A, and its elementary properties are stated. Numerical experiments illustrate the behavior of the proposed condition number in Section 6. The paper is concluded in Section 7 by a discussion of the main conclusions. 1.7. Notation and conventions Varieties are typeset in a calligraphic upper-case font (S, V), tensors in an upper-case fraktur font (A), matrices in upper-case letters (A, U, V ), tuples of vectors in a lower-case fraktur font (p), vectors in boldface lower-case letters (a, x), and scalars and points on varieties in lower-case letters (a, b, λ, p, q). The symbol m m×n F denotes either the real field R or the complex field C. Let Fm , AT 0 := F \ {0}. For a matrix A ∈ F H † is its transpose, A is its√conjugate transpose, and A is its Moore–Penrose pseudoinverse. The Euclidean norm of v ∈ Fm is kvk = vH v, which is induced from the Hermitian inner product ha, bi = bH a. A block diagonal matrix with diagonal blocks A1 , . . . , Ad is denoted by diag(A1 , . . . , Ad ). By span(A) the column span of the matrix A is meant. The ith largest singular value of a matrix A ∈ Fm×n is denoted by ςi (A). The prototypical tensor in this paper is the rank-r tensor A ∈ T = Fn1 ⊗ · · · ⊗ Fnd . Hence, the scalar d is used exclusively for the order of A, r denotes its rank, the scalars n1 , n2 , . . ., nd are reserved for A’s dimensions, and we also define Σ=

d X

(nk − 1),

M = r(Σ + d),

N = r(Σ + 1),

and

Π=

d Y

nk ,

k=1

k=1

which are respectively the dimension of the variety of rank-1 tensors in T minus one, the dimension of the domain of f , the expected dimension of the smallest variety containing the tensors of rank r in T, and the dimension of T. 2. The classic approach to conditioning for implicit inverse functions Given the parameters of a rank-r decomposition, constructing the corresponding tensor amounts to evaluating the smooth “tensor computation” function f in (4). Unfortunately, no explicit description is known of f −1 which sends a rank-r tensor to the parameters of its rank-r decomposition. In certain cases, the related function f † which sends a tensor to the parameters of its best rank-r approximation can be known up to first order by invoking the Inverse Function Theorem [33, 47]. This suffices for characterizing the condition number in the 2-norm, as is shown next. Let g : Rm → Rn be a smooth function with m ≤ n, and let Jg (x) denote its Jacobian matrix evaluated at x. Then, the local Taylor series expansion of g about x is g(x + ∆x) = g(x) + Jg (x)∆x + O(k∆xk2 ). From the definition of the structured condition number in (2) it follows immediately that if D is an open neighborhood of x and α = β = 2, then the absolute condition number of g at x is simply the spectral norm of the Jacobian of g at x, because κ(x) = lim

max

→0 k∆xk≤ x+∆x∈D

kg(x) + Jg (x)∆x + O(k∆xk2 ) − g(x)k = kJg (x)k. k∆xk

The following variant of the Inverse Function Theorem then results in a characterization of the condition number of the approximation problem associated with g. I was unable to locate this version in the literature, but it is probably known to the experts. For convenience, an extension of Spivak’s proof in [47] is included. Theorem 2. Let g : Rm → Rn with m ≤ n be a smooth (C ∞ ) function. Assume that the Jacobian matrix Jg (x0 ) ∈ Rn×m has rank m at x0 . Then, the set-valued map s : y 7→ arg minm ky − g(x)k2 x∈R

5

(5)

has a unique smooth single-valued localization gx† 0 : Rn → Rm in an open neighborhood of y0 = g(x0 ) for which gx† 0 (g(x0 )) = x0 and whose Jacobian is † Jgx† (y0 ) = Jg (x0 ) , 0



T

where A = (A A)

−1

T

A is the Moore-Penrose pseudoinverse of A ∈ Rm×n .

Proof. First, we construct an open set where gx† 0 is defined about y0 . Since g is differentiable in a neighborhood of x0 , we can expand g for sufficiently small ∆ as follows: g(x0 + ∆) = g(x0 ) + J∆ + O(k∆kkr(∆)k), where J = Jg (x0 ) ∈ Rn×m is the Jacobian matrix of g at x0 , and lim∆→0 kr(∆)k → 0. For every sufficiently small τ > 0, there exists a δτ so that supk∆k≤δτ kr(∆)k ≤ τ . Consider then the open ball U = Bδτ of radius δτ centered at x0 . Let x0 + ∆ ∈ ∂Bδτ be a point on the boundary. Then it follows that kg(x0 + ∆) − g(x0 )k ≥ kJ∆k − Cδτ τ ≥ δτ · (ςm (J) − Cτ ), where C is a constant, and ςm (J) > 0 is the mth singular value of J, which is nonzero by assumption on 1 ςm (J). Then, the rank of J. Since τ was arbitrary, we can assume that τ ≤ 2C 1 δτ ςm (J) = 2 > 0. 2 With this definition of , let B be the open ball of radius  in Rn centered at y0 . Next, we show how s(y) can be characterized locally as a solution of a nonlinear system of equations. Fix a y ∈ B ⊂ Rn . Then, for every x ∈ ∂Bδ ⊂ Rm we now have the inequality kg(x0 + ∆) − g(x0 )k ≥

ky − g(x)k ≥ kg(x0 ) − g(x)k − ky − g(x0 )k > 2 −  > ky − g(x0 )k, where the first inequality is due to the triangle inequality. This implies that for all y ∈ B the minimizers of s(y) are all attained in the interior of U . This entails that the gradient of the smooth objective function hy (x) = 21 ky − g(x)k2 vanishes at each minimizer xi ∈ s(y) [38]. The stationary points of hy (x) can be obtained as the zero locus of the smooth map φ : B × U → Rm , (y, x) 7→ (Jg (x))T (y − g(x)), where the Jacobian matrix Jg (x) ∈ Rn×m is of rank m in a neighborhood of x0 . Hence, we can state that s(y) ⊂ S(y) := {x ∈ U | φ(y, x) = 0} n

for all y ∈ B ⊂ R . By assumption, φ(y0 , x0 ) = 0. The implicit function theorem [24, Theorem 1B.1] states that if the partial derivative of φ with respect to the second argument is of full rank at (y0 , x0 ), then there exists a smooth function gx† 0 that corresponds with a single-valued localization around x0 of the set-valued solution map S(y) in an open neighborhood of y0 ∈ Rn , i.e., gx† 0 (y0 ) = x0 . In addition, the Jacobian of gx† 0 is given by −1 Jgx† (y) = − ∇x φ(y, gx† 0 (y)) ∇y φ(y, gx† 0 (y)), 0

where ∇x φ denotes the partial derivative of φ with respect to x and similarly for ∇y φ. By the chain rule,  ∇x φ(y0 , x0 ) = ∇x Jg (x)T (x0 ) · (y0 − g(x0 )) + Jg (x0 )T (−Jg (x0 )) = −(Jg (x0 ))T Jg (x0 ). As Jg ∈ Rn×m is of rank m in a neighborhood of x0 by semicontinuity of matrix rank, it follows that the partial derivative of φ to x is also of full rank at (y0 , x0 ) because it is the Gram matrix of Jg . So, the implicit function theorem applies, and we find that there is a local inverse function gx† 0 defined in a neighborhood of y0 with gx† 0 (y0 ) = x0 and whose Jacobian in y0 is −1 † Jgx† (y0 ) = (Jg (x0 ))T Jg (x0 ) (Jg (x0 ))T = Jg (x0 ) , 0

as it is easy to verify that ∇y φ(y0 , x0 ) = Jg (x0 )T In . 6

The following result is an immediate consequence of the above discussions. Corollary 3. If g : Rm → Rn with m ≤ n is a smooth function with a full rank Jacobian matrix Jg (x) ∈ −1 Rn×m at x, then the condition number of gx† at g(x) is κ = k(Jg (x))† k = ςm (Jg (x)) . Applying this corollary to the smooth tensor computation function f would thus characterize the condition number of the TAP for real4 tensors. Unfortunately, it does not apply unreservedly because the Jacobian matrix of f is not of full rank due to the scaling indeterminacies. This is naturally understood from the (semi-)algebraic geometry of the tensor rank decomposition. 2.1. Geometry of the tensor rank decomposition I: Dimension The relevant basic facts and definitions from (semi-)algebraic geometry are recalled next; the reader is referred to Landsberg [35] for greater detail. Let Fn0 := Fn \ {0}. The Segre variety over F = C or R is the variety of rank-1 tensors. It is denoted by SF := Seg(Fn0 1 × Fn0 2 × · · · × Fn0 d ) ⊂ Fn1 ⊗ Fn2 ⊗ · · · ⊗ Fnd ∼ = FΠ , and its elements can be parametrized explicitly via the Segre map: Seg : Fn1 × Fn2 × · · · × Fnd → Fn1 ⊗ Fn2 ⊗ · · · ⊗ Fnd ∼ = FΠ (a1 , a2 , . . . , ad ) 7→ a1 ⊗ a2 ⊗ · · · ⊗ ad , Qd where ⊗ is the tensor product and Π = k=1 nk . For notational brevity, I consider the Segre map that embeds into FΠ so that the tensor product can be realized in coordinates as the Kronecker product. The 2factor Segre variety embedded in Fn1 ⊗ Fn2 ∼ = Fn1 ×n2 is the smooth variety of rank-1 matrices. The specific 1 2 d tuple pi = (ai , ai , . . . , ai ) in the representation of a point pi = Seg(pi ) ∈ SF is called a representative of pi . Π . Then, let Let r ≥ 2 be strictly subgeneric: r < Σ+1 σr0 (SF ) = {p1 + p2 + · · · + pr | pi ∈ SF } = {A ∈ Fn1 ⊗ · · · ⊗ Fnd | rankF (A) ≤ r} denote the set of tensors of F-rank at most r. Its closure in the Euclidean topology is σr (SF ) := σr0 (SF ). By definition, σr0 (SF ) is a dense, constructible subset of σr (SF ). It is known that σr (SF ) is an irreducible algebraic variety 5 if F = C (see, e.g., [35, Corollary 5.1.1.5]) and a semi-algebraic variety if F = R (see, e.g., [20, Theorem 6.2]). An algebraic variety is the common zero set of a system of polynomial equations over C (see, e.g., [26]) and a semi-algebraic set is the solution set of a system of polynomial equations and inequalities over R; see [10]. At most points varieties and semi-algebraic sets locally resemble a linear space. One way to describe this local linear approximation to a variety or semi-algebraic set V ⊂ FΠ at a point p ∈ V is the tangent space; see respectively [25, pp. 175–176] and [10, Section 3.3]. For a smooth variety V ⊂ FΠ , this vector space can be defined extrinsically as the span of all tangent vectors of analytic curves in V passing through p:  d Tp V = span { dt p(0) | p(t) ⊂ V, t ∈ (−1, 1), p(0) = p} . For general varieties the tangent space is best defined algebraically; see, e.g., [25, 26]. The dimension of a variety V ⊂ FΠ can be defined geometrically as the number n such that dim Tp V = n for all p in an open neighborhood N of p0 ∈ V. For a semi-algebraic set V the dimension is defined as the dimension of the Zariski-closure of V [10, Proposition 2.8.2]. The dimension of the Segre variety SF = Seg(Fn0 1 ×· · ·×Fn0 d ) ⊂ FΠ is well-known [26, 35]: dim SF = Σ + 1. The dimension of σr (SF ) is known only conjecturally in general. As σr0 (SF ) is Euclidean-dense in σr (SF ), an upper bound is obtained by noting that σr (SF ) is the projection of   (p1 , . . . , pr ), p | p = p1 + . . . + pr , p1 , . . . , pr ∈ SF } ⊂ (SF × · · · × SF ) × FΠ 4I

was unable to generalize Theorem 2 to the complex case. variety can be defined in projective space as well and is then called the r-secant variety of the Segre variety SC ⊂ PΠ−1 .

5 This

7

onto the second factor, where the overline indicates the closure in the Euclidean topology. Therefore, dim σr (SF ) ≤ min{Π, r(Σ + 1)}, by [10, Proposition 2.8.6] for F = R and [26, Theorem 11.12] for F = C. One expects that equality holds. The problem of determining the dimension of σr (SF ) has seen much progress recently, leading up to the conjecturally complete picture in [1]. A point p ∈ V on an irreducible complex variety V where there exists an open neighborhood N of p such that dim Tq V = dim V for all q ∈ N is called a smooth point of V. The set of smooth points of V forms an n-dimensional submanifold of CΠ ; it is Zariski-dense in V. For consistency with the real case, I will call a smooth point of a C-variety a regular point. For semi-algebraic sets V ⊂ RΠ , defining the smooth locus is more subtle, because V can consist of several semi-algebraically connected components of different dimensions. Every semi-algebraic set V admits a Nash stratification, meaning that it can beSdecomposed as a finite union of disjoint semi-algebraically connected Nash manifolds [10, Section 9.1]: V = i Mi , where Mi is a smooth R-manifold of dimension di = dim Mi and Mi S ∩ Mj = ∅ if i 6= j. Let D = {i | dim Mi = dim V} be the set of all Mi of maximal dimension. I will call i∈D Mi the set of regular points of the semi-algebraic set V; it forms a (potentially reducible)6 R-manifold. All regular points are smooth but in general the converse is false. The regular locus is neither Zariski-dense nor Euclidean-dense, in general.7 2.2. Why the standard approach fails We can now understand why the Jacobian of f in (4) is not of full rank. Let pi = (a1i , . . . , adi ) be a representative of the rank-1 tensor pi = a1i ⊗ · · · ⊗ adi ∈ SF , and consider the decomposition A = f (p) =

r X

a1i ⊗ a2i ⊗ · · · ⊗ adi ,

i=1

where p = vecr(p1 , p2 , . . . , pr ). Then, f ’s Jacobian matrix Tp := Jf (p) is given explicitly by     Tp = T1 · · · Tr ∈ FΠ×M with Ti = In1 ⊗ a2i ⊗ · · · ⊗ adi · · · a1i ⊗ · · · ⊗ aid−1 ⊗ Ind ,

(6)

where M = r(Σ + d); the tensor products should be interpreted as Kronecker products in this expression. It is well known that the column span of Tp is contained in (one component of) the tangent space of σr (SF ) [1, 35, 58]. Hence, the rank of Tp is bounded from above by the dimension of σr (SF ), which is no more than r(Σ + 1). Terracini’s Lemma [50] provides an even stronger statement, namely that rank(Tp ) = dim σr (SF ) provided that p ∈ FM is not contained in some unspecified Zariski-closed subset. For this reason, I refer to Tp as Terracini’s matrix associated with the VFMs p. A part of the kernel of Terracini’s matrix can be described explicitly. A basis of the kernel of Ti is  1   1   1  ai  ai ai    2  −ai   0   0             3  0  −ai   0        (7a) Ki =  0  ,  0  , . . . ,  ..  = {k2i , k3i , . . . , kdi },     .   .  ..      .        0  .    . d  0

−ai

0

which contains exactly d − 1 linearly independent basis vectors. One then verifies that K = {k2i ⊗ ei , k3i ⊗ ei , . . . , kdi ⊗ ei }ri=1 ,

(7b)

where ei is the ith standard basis vector of Fr , is a linearly independent set contained in the kernel of Tp . 6 In this paper, global connectedness of σ (S ) is not of import as the definition of a condition number is inherently local. r F A neighborhood of a regular point of a variety or semi-algebraic set can always be chosen to be irreducible. 7 The regular locus suffices for our purpose since the condition number in Theorem 10 will be ∞ for non-regular points.

8

Remark 4. A reviewer asked when the rank of Terracini’s matrix can be strictly less than r(Σ + 1). Characterizing such points is an interesting open problem. A partial answer was provided in [17, Lemma 37], which entails that if the VFMs p representing the tensor A = f (p) are contained in a positive-dimensional family of alternative decompositions of A, then Terracini’s matrix Tp is of rank strictly smaller than r(Σ + 1). As Terracini’s matrix is of rank at most N = r(Σ + 1) < r(Σ + d) = M , the classic theory of conditioning for implicit inverse functions does not apply. This paper nevertheless identifies three candidate strategies for deriving condition numbers in the setting of tensor rank decompositions. The first strategy yields a set of condition numbers for the TAP. The other two strategies result in condition numbers of the TDP. Strategy I. Terracini’s matrix is not of full rank because the domain of f is an over-parameterization for representing tensors of rank r. One could eliminate a subset S of the parameters (with |S| = r(d − 1)) via an invertible smooth projection function πS : RM → RN and consider instead an alternative tensor computation function fS = f ◦ πS−1 . Then, for real tensors the standard theory, in particular Corollary 3, applies, resulting in a characterization of a condition number of the TAP fS† . The main disadvantage of this approach is that these representations are usually not global, i.e., some rank-r tensors in the image of f are not in the image of fS . Another disadvantage is that the elimination of parameters can be performed in many, ultimately arbitrary ways, resulting in a multitude of condition numbers all of which are equally valid. This approach is nevertheless explored in Section 3, mainly because its results are required in the proof of Theorem 1. Strategy II. It is reasonable to conjecture that Corollary 3 should still be valid for functions of constant rank, −1 namely the condition number of f −1 is expected to be κ = k(Jf (x))† k = ςN (Jf (x)) . A complication in generalizing the proof of Theorem 2 consists of ensuring that the boundary ∂Bδ is mapped away from y0 ; otherwise the condition number is unbounded. It turns out that this can be accomplished by relaxing the requirement that distances between the parameters of two rank-r decompositions are measured by a metric. This approach results in a condition number of the TDP and is developed in detail in Section 4. Strategy III. We could abandon the classic framework entirely and instead investigate the question of stability in the geometric framework of Blum, Cucker, Shub, and Smale [8], and B¨ urgisser and Cucker [13] by modeling the TDP as a map between manifolds. This presents its own challenges as the set of tensors of rank at most r is not naturally a manifold, but rather a singular semi-algebraic set if F = R or a singular projective algebraic variety if F = C. It is nevertheless possible to extend this geometric framework to join sets, of which the set of tensors of rank at most r is an instance. The geometric condition number of the TDP obtained in this manner was recently introduced by Breiding and the author in [12]. 3. Strategy I: Eliminating variables A common strategy for eliminating the scaling indeterminacies in the representation of a rank-1 tensor consists of explicitly eliminating some variables via a suitable normalization. Because SF is a manifold of dimension Σ + 1, this idea essentially consists of covering SF by a collection of smooth coordinate charts Σ+1 b b {Uj ⊂ SF }j such that Uj = φ−1 open. This covering is not unique, and in general j (Uj ) with Uj ⊂ F a rank-1 tensor A ∈ SF can even be represented on several charts, i.e., A = φ−1 i (xi ) for multiple i. The gist of this strategy consists of applying the standard theory of conditioning to the function f ◦ φ−1 j . By construction, its Jacobian is expected to be of full rank so that Theorem 2 and Corollary 3 apply in the real case, yielding a condition number for the corresponding TAP. Rank-1 tensors. A covering of SF is easily constructed by noting that a rank-1 tensor can be represented as p = (x1 , . . . , xd ) ∈ Fn0 1 × · · · × Fn0 d

9

with kxk kαk = ck , k = 2, . . . , d.

where k · kαk is a norm and ck > 0 are arbitrary constants. The constraints kxk kαk = ck enable the elimination of one variable. For example, an evident choice is taking αk = 2 for all k, so that a dense, open subset U1,...,1 of SF is the image of the smooth invertible function n1 Π φ−1 1,...,1 : F0 × Bc2 (0) × · · · × Bcd (0) → SF ⊂ F  p 2  p cd − kwd k2 c22 − kw2 k2 1 d 1 ⊗ ··· ⊗ , (w , . . . , w ) 7→ w ⊗ wd w2

where Bck (0) is the open ball of radius ck > 0 in Fnk −1 centered at 0. (U1,...,1 , φ1,...,1 ) is a coordinate chart (k) (k) k , and then φ1,...,1 corresponds to the elimination of the variables x1 , k = 2, . . . , d. of SF . Let xk = [xi ]ni=1 In an analogous manner, one can define the functions φ−1 i2 ,...,id with 1 ≤ ik ≤ nk for k = 2, . . . , d. Then, it is ,...,nd easy to check that {{Fn0 1 × Bc2 (0) × · · · × Bcd (0), φi2 ,...,id }ni22,...,i } is a smooth atlas for SF . For every d =1,...,1 † (i2 , . . . , id ) ∈ [1, n2 ] × · · · × [1, nd ], Corollary 3 then yields the condition number of (f ◦ φ−1 i2 ,...,id )y , i.e., the −1 2 rank-1 approximation problem minx∈RΣ+1 kA − (f ◦ φ−1 i2 ,...,id )(x)k associated with the function f ◦ φi2 ,...,id −1 at the real, rank-1 tensor A = f (φi2 ,...,id (y)). The general case. The foregoing strategy can be generalized to tensors of rank equal to r if we restrict ourselves to the regular points N of σr0 (SF ); this ensures that Terracini’s matrix is always of maximal rank r(Σ + 1) for all of the VFMs representing a tensor A ∈ N . Then, by definition, N is an F-manifold so that we can apply the same approach as in the case of rank-1 tensors. Since Strategy I is not the focus of this paper, only the atlas of N that is required in the proof of the main theorem, namely Theorem 10, is presented here. It is essentially based on normalization in the ∞-norm. The construction proceeds as follows. Choose C ∈ (F0 )(d−1)×r , and let C be a set of r(d − 1) integers (k)

C := {jk,i | k = 2, . . . , d, i = 1, . . . , r} with jk,i ∈ {j | aj,i 6= 0} a fixed choice.

(8)

Let x = vecr(x1 , . . . , xr ) ∈ FM be variables, where xi = (x1i , . . . , xdi ) ∈ Fn0 1 × · · · × Fn0 d . We will represent tensors X = f (x) in an open subset of N using the expected number of parameters N by choosing VFMs that lie in the subspace given by the r(d − 1) equations  bC,C := x(k) = ck,i | jk,i ∈ C ⊂ FM , U jk,i ,i (k) bC,C ) ∩ N be any choice. Then, there exists an open where x`,i denotes the `th element of xki . Let Z ∈ f (U bC,C ). This follows from X = f (x) = f (ηC,C (x)) neighborhood NZ ⊂ N of Z such that NZ is contained in f (U with normalization function ηC,C : x 7→ x˙ = vecr(˙x1 , . . . , x˙ r ), where

x˙ i =

x1i ·

d Y

ck,i

k=2

xjk,i ,i

(k)

!−1 , x2i ·

c2,i (2)

xj2,i ,i

. . . , xdi ·

cd,i (d)

xjd,i ,i

! ∈ Fn0 1 × · · · × Fn0 d ;

bC,C in FM , and the image ηC,C is contained this normalization is well-defined in an open neighborhood of U bC,C . The definition of ηC,C is such that it simply chooses an alternative representative for each of the in U rank-1 tensors x1i ⊗ · · · ⊗ xdi . Define for k = 2, 3, . . . , d and i = 1, 2, . . . , r, the functions  πjk,i : Fnk → Fnk −1 , z 7→ z1 · · · zjk,i −1 zjk,i +1 · · ·  πj−1 : Fnk −1 → Fnk , z 7→ z1 · · · zjk,i −1 ck,i zjk,i k,i ,C

znk ···

where jk,i ∈ C. Define also −1 πC,C : FN → FM

πC : FM → FN 10

T

,

znk −1

T

,

 z1i πj2,i (z2i )   7→   ..   .

 1 zi z2i     ..  . zdi

i=1,...,r

πjd,i (zdi )

 z1i −1 πj2,i ,C (z2i )  

 1 zi z2i     ..  .



, and

zdi

i=1,...,r



i=1,...,r

7→   ..   . −1 d πjd,i ,C (zi )

.

i=1,...,r

Then, we define the smooth “local tensor computation” function associated with C and C as −1 fC,C : FN → FΠ , z 7→ (f ◦ πC,C )(z),

(9)

where it is recalled that N = r(Σ + 1). For example, in the simplest case E is the (d − 1) × r matrix of ones and C = {jk,i = 1 | k = 2, . . . , d, i = 1, . . . , r}, so the corresponding local tensor computation function is fC,E : (Fn0 1 × Fn2 −1 × · · · × Fnd −1 )×r → FΠ     r X 1 1 ((wi1 , . . . , wid ))ri=1 7→ wi1 ⊗ ⊗ · · · ⊗ . wi2 wid i=1

The “local” tensor rank approximation problem with input A ∈ FΠ associated with the foregoing model fC,C is minx∈FN kA − fC,C (x)k2 . As fC,C is analytic in N variables, it is continuously differentiable. Hence, the Jacobian matrix J p ∈ FΠ×N of fC,C at p = πC (p) with p := ηC,C (p0 ) is defined, and is given by −1 J p := Jf ◦π−1 (p) = Jf (πC,C (p))Jπ−1 (p) = Jf (p)IbC = Tp IbC , C,C

C,C

(10)

where the second equality is by the chain rule and IbC ∈ RM ×N is an identity matrix from which certain columns have been removed. Hence, the Jacobian J p is formed by a subset of the columns of f ’s Jacobian, i.e., Terracini’s matrix Tp . Expressing p = vecr(p1 , . . . , pr ) with pi = (a1i , . . . , adi ), we see that it follows −1 from the definition of πC,C that J p lacks the r(d − 1) columns a1i ⊗ · · · ⊗ ak−1 ⊗ ejk,i ⊗ ak+1 ⊗ · · · ⊗ adi , i i

jk,i ∈ C,

relative to Tp , where ei is the ith basis vector. One verifies that the columns of J p and Tp still span the same space. As the latter has dimension N by the assumption that NZ ⊂ N , it follows that J p ∈ FΠ×N has linearly independent columns, provided that p ∈ FN is not contained in some unspecified Zariski-closed subset. This leads to the next corollary in the real case. Corollary 5. Let A = fC,C (p) be a real, rank-r tensor. If J p ∈ RΠ×N has full rank, then the condition −1 † number of the TAP (fC,C )†p : A 7→ arg minx∈RN kA − fC,C (x)k2 is κC,C (A) = kJ p k = ςN (J p ) . The condition number of the TAP associated with the model fC,C , obtained via the first strategy of elimination of variables, is always worse than the corresponding condition number of the TDP in Theorem 1, which is obtained via the second strategy of relaxing the distance measure. This is proved next. Proposition 6. Let C ∈ (R0 )(d−1)×r be arbitrary, and let C be any set as in (8). Then, the condition number −1 κC,C (p) of the TAP (fC,C )†p at A = fC,C (p) is larger than the condition number κ(p) with p = πC,C (p) of the TDP as in Theorem 1: −1 −1 κC,C (p) := ςN (J p ) ≥ ςN (Tp ) =: κ(p) Proof. The inequality follows immediately from the min-max characterization of singular values, namely H ςN (J p ) = ςN (J p ) = ςN (IbCH TpH ) ≤ ς1 (IbCH )ςN (TpH ) = ςN (Tp ), where all definitions are as in (10). Aside from the very arbitrary nature of the variable elimination, I believe that this proposition provides a good incentive for considering Strategy II, which is developed in the next section. 11

4. Strategy II: Relaxing the distance measure The second strategy is derived from the definition of the structured condition number in (2). There is one major complication, namely the TDP is not naturally modeled as a function. Indeed, let f −1 (A) := {x ∈ FM | f (x) = A} be the fibre of f at A. The domain of f −1 is σr0 (SF ). The fibre f −1 (A) is an affine variety consisting of representatives of the tensor A; it is well known that dim f −1 (A) ≥ r(d − 1). The next subsection explains how this complication can be overcome via robust r-identifiability and a quotient. 4.1. Geometry of the tensor rank decomposition II: Robust identifiability The representatives x = vecr(p1 , . . . , pr ) ∈ FM and y = vecr(q1 , . . . , qr ) ∈ FM of two rank decompositions are essentially equal, denoted x ∼ y, if the multisets of rank-1 tensors {Seg(p1 ), . . . , Seg(pr )} and {Seg(q1 ), . . . , Seg(qr )} are equal including multiplicities. Proposition 7. ∼ is an equivalence relationship. Proof. See Appendix A. Let [p] := {q ∈ FM | q ∼ p} denote the equivalence class of p ∈ FM , and let q : FM → FM /∼ be the quotient map. By the proposition, the composition (q ◦ f −1 )(A) := {[p] | f (p) = A} is well defined: it sends a rank-r tensor to its set of equivalence classes of representatives. Let the set of r-identifiable tensors be I := {A | |(q ◦ f −1 )(A)| = 1} ⊂ Fn1 ⊗ · · · ⊗ Fnd ∼ = FΠ , where |S| denotes the cardinality of the set S which for infinite sets we define as |S| = ∞. It follows by definition that g = (q ◦ f −1 |I ) : I → (FM /∼) is a function. If we additionally obtain some differentiability guarantees for g, then we could hope to apply the standard definition of a condition number in (2). Therefore, the notion of robust r-identifiability is introduced next.8 Definition 1. A rank-r tensor A ∈ σr0 (SF ) is robustly r-identifiable if it is a regular point of σr (SF ) and there exists an open neighborhood N of A such that every B ∈ N ⊂ σr0 (SF ) is r-identifiable. A robustly r-identifiable tensor A = f (p) thus has an open neighborhood N ⊂ σr0 (SF ) such that for every B ∈ N it holds that B is r-identifiable and a regular point of σr (SF ). Hence, N is a smooth F-manifold. It follows that the column span of Terracini’s matrix Tp coincides with the tangent space to σr (SF ). Robust r-identifiability is expected to impose only a mild condition on rank-r tensors. A Segre variety SF is called generically r-identifiable if there exists a Zariski-closed set Z ⊂ σr (SF ) such that every A ∈ σr (SF )\Z is r-identifiable. Then, the following basic result applies. Lemma 8. Let SF be a generically r-identifiable Segre variety, and let Z be the Zariski-closed set that is the union of the locus of singular points of σr (SF ) and the locus where r-identifiability fails. Then, every element of σr0 (SF ) \ Z is robustly r-identifiable. In the literature, many Segre varieties SF are proven to be generically r-identifiable [9, 15, 23, 49]. The conjecture from [16, 17, 43] paints a nearly complete picture. Conjecture 1 ([16, 17, 43]). Let d ≥ 3, let F = C or R, and let SF = Seg(Fn0 1 × · · · × Fn0 d ) ⊂ FΠ be a Π Segre variety with n1 ≥ · · · ≥ nd ≥ 2. Then, SF is generically r-identifiable if r < Σ+1 , unless (n1 , . . . , nd ) is one of the exceptional cases, namely (4, 4, 3); (4, 4, 4); (6, 6, 3); (2, 2, 2, 2, 2); (n, n, 2, 2) with n ∈ N; or Qd Pd n1 > k=2 nk − k=2 (nk − 1). In the exceptional cases there are multiple complex rank-r decompositions. The computer-assisted proof of [16] combined with the observation of [17, 43] shows that this conjecture Π is true for all n1 · · · nd ≤ 15000. Note that if r > Σ+1 , then SF is never generically r-identifiable. 8A

similar notion that bears the same name but imposes no smoothness condition was recently examined in [7].

12

4.2. Relaxing the metric We can now apply the standard definition of the condition number in (2) to the function −1 fR := (q ◦ f −1 |R ) : R → (FM /∼), f (x) 7→ [x]

where R is the subset of robustly r-identifiable tensors in Fn1 ⊗ · · · ⊗ Fnd ∼ = FΠ and q the quotient map from the previous subsection. This function takes a robustly r-identifiable tensor A = f (x) to the affine variety of essentially equal representatives [x] = {y | x ∼ y} ⊂ FM where x is any vector satisfying f (x) = A. We still need to choose how to measure errors between equivalence classes of representatives of rank-r decompositions in FM /∼ as well as between tensors in FΠ . The natural choice for the latter is the standard 2-norm on FΠ . An obvious choice for measuring the error between two equivalence classes of representatives [x], [y] ∈ FM /∼ would be the minimum distance between these varieties (considered as subvarieties of FM ): −1 µ([x], [y]) := inf x0 ∈[x],y0 ∈[y] kx0 − y0 k. A condition number of fR with respect to µ at the tensor A would measure the worst possible fraction  −1 −1 µ fR (A), fR (A + ∆A) k∆Ak for infinitesimal ∆A with A + ∆A ∈ R. That is, it looks at the distance between the best representative p ∈ [x] of A = f (x) and q ∈ [y] of A+∆A = f (y) relative to k∆Ak. Unfortunately, µ does not separate equivalence classes as µ([x], [y]) can be 0 even if [x] 6= [y].9 The basic example is x = vec(a1 , a2 , . . . , ad ) and y = vec(b1 , b2 , a3 , . . . , ad ). Let x = vec(a1 , a2 , −2 a3 , . . . , ad ) ∈ [x] and y = vec(b1 , b2 , −2 a3 , . . . , ad ) ∈ [y], and then 0 ≤ µ([x], [y]) ≤ kx − y k = O() for all  > 0, hence µ([x], [y]) = 0. So if a1 6= b1 and a2 6= b2 then [x] 6= [y] even though their distance is zero. The above problem motivates the following alternative way to measure errors between representatives of rank-r tensors. Let A = f (p) and B = f (q) be two robustly r-identifiable tensors. We can measure the distance between a given representative p ∈ FM of A and the equivalence class of representatives of the −1 rank-r decomposition of B, i.e., [q] = fR (B), as the least distance between p and the variety [q]: dist(p, [q]) = 0inf kp − q0 k. q ∈[q]

(11)

−1 The condition number of fR with respect to dist at the representative p of A = f (p) measures the worst possible fraction  −1 dist p, fR (A + ∆A) k∆Ak

for infinitesimal ∆A with A + ∆A ∈ R. That is, it looks at the distance between the given representative p ∈ FM of A = f (p) and the best representative q ∈ [y] of A + ∆A = f (y). In the case where we are −1 agnostic about the perturbed decomposition fR (A + ∆A), there is no reason to prefer some VFMs over others. In this case it is reasonable to measure the error with respect to the best possible representative −1 q ∈ fR (A + ∆A) of the perturbed tensor, which is exactly what dist accomplishes. The function dist does correctly separate equivalence classes of representatives, as is shown next. Proposition 9. Let x ∈ FM and [y] ∈ FM /∼. Then, the infimum is attained: dist(x, [y]) = min kx − y0 k. 0 y ∈[y]

Moreover, the distance dist(x, [y]) = dist(y, [x]) = 0 if and only if x ∼ y. Proof. See Appendix A. In conclusion, dist : FM × (FM /∼) satisfies the properties dist(x, [y]) ≥ 0, dist(x, [x]) = 0, and dist(x, [y]) = 0 iff x ∼ y. These conditions suffice for proving the main theorem of the next subsection. 9µ

is a symmetric premetric satisfying µ([x], [y]) ≥ 0, µ([x], [x]) = 0, and µ([x], [y]) = µ([y], [x]).

13

4.3. A condition number The main result of this paper states that the inverse of the N th singular value of Terracini’s matrix is −1 the absolute condition number of the TDP fR with respect to the distance measure dist in (11). Theorem 10. Let pi = (a1i , . . . , adi ) be a representative of pi = a1i ⊗ · · · ⊗ adi ∈ SF , i = 1, . . . , r, and let p = vecr(p1 , p2 , . . . , pr ) be the given VFMs. Assume that A = f (p) =

r X

a1i ⊗ · · · ⊗ adi

∈ σr0 (SF ) ⊂ FΠ

i=1

is robustly r-identifiable, and let ςN with N = r(Σ + 1) denote the N th largest singular value of Terracini’s −1 matrix Tp . Then, the absolute condition number of the tensor rank decomposition problem fR at p satisfies κA (p) := lim

max

→0 k∆Ak≤, A+∆A∈R

−1 dist(p, fR (A + ∆A)) −1 = ςN , k∆Ak

where R ⊂ σr (SF ) is the locus of robustly r-identifiable tensors. If A 6∈ R, then κA is undefined. −1 If κA is finite, then the relative condition number at p is κ(p) := ςN kAkkpk−1 . Proof. The proof is presented in Appendix C. b = f (b Let A = f (p) and A p) be robustly r-identifiable tensors. As usual, we have the following asymptotically sharp inequalities for estimating the absolute, respectively relative, forward error between VFMs: −1 b p) · kA − Ak dist(b p, fR (A)) . κA (b

and

−1 b dist(b p, fR (A)) kA − Ak p) . κ(b , b kb pk kAk

b is sufficiently small. provided that kA − Ak 5. The condition number of the TDP The absolute and relative condition numbers considered in the previous subsection are defined at the representative p ∈ FM rather than at the tensor A ∈ FΠ . These condition numbers depend on the particular −1 p ∈ fR (A) that was chosen. If there is no choice of VFMs that is suggested by the application at hand, then where should the condition number from Theorem 10 be evaluated? Consider the rank-1 tensor pi = a1i ⊗ · · · ⊗ adi ∈ SF . All of the representatives of pi can be parameterized as pαi := vec(αi,1 u1i , . . . , αi,d udi ) where kuki k = 1, αi,1 · · · αi,d = 1, and αi,k > 0. Terracini’s matrix is   Tpαi = (αi,2 · · · αi,d )In1 ⊗ u2i ⊗ · · · ⊗ udi · · · (αi,1 · · · αi,d−1 )u1i ⊗ · · · ⊗ uid−1 ⊗ Ind   = In1 ⊗ u2i ⊗ · · · ⊗ udi · · · u1i ⊗ · · · ⊗ ud−1 ⊗ Ind · diag((αi,2 · · · αi,d )In1 , . . . , (αi,1 · · · αi,d−1 )Ind ) i =: Tbpαi · Dαi . From the min-max characterization of singular values it follows that ςΣ+1 (Tpαi ) ≤ ς1 (Tbpαi ) · ςΣ+1 (Dαi ); this is √ essentially [30, Ex. 18 in Section 7.3]. From Corollary 18 below, we have ς1 (Tbpαi ) = d. If we assume that Q all nk ≥ d, then it follows from the structure of Dαi that ςΣ+1 (Dαi ) = min1≤k≤d 1≤j6=k≤d αi,j . Hence, Y 1 −1 κA (pαi ) ≥ √ max αi,j . d 1≤k≤d 1≤j6=k≤d One verifies that this lower bound is minimized for αi,1 = · · · = αi,d = kpi k1/d . On the other hand, it is easy to choose representatives pαi of pi that are ill-conditioned, for example by choosing w  1 and setting αi,1 = 2w , αi,2 = · · · = αi,d = 2−w/(d−1) , the condition number becomes κA (pαi ) ≥ d−1/2 2w . 14

pαT

−1 Continuing observations, we can also bound the condition number of fR at the VFMs  1 T from ther above  T · · · (pα ) , which represent the tensor p = p1 + · · · + pr . Terracini’s matrix is = (pα )

 Tpα = Tpα1

···

 h Tpαr = Tbpα1

i Tbpαr · diag(Dα1 , . . . , Dαr ).

···

Assume that Tpα is of the maximal rank N . By the min-max characterization of singular values   kTpα1 v1 k + · · · + kTpαr vr k , kTpα vk ≤ min ςN (Tpα ) = min v⊥ker(Tpα ), kvk=1

v⊥ker(Tpα ), kvk=1, vT =[ v1T ··· vrT

]

where the inequality is due to the triangle inequality, and ker(A) denotes the kernel of A ∈ Fm×n . The assumption on the rank of Tpα causes v ⊥ ker(Tpα ) to entail vi ⊥ ker(Tpαi ) for all i = 1, . . . , r. Hence, choosing v1 = · · · = vi−1 = vi+1 = · · · = vd = 0, kvi k = 1, and vi ⊥ ker(Tpαi ), the right hand side of the above inequality is bounded by kTpαi vi k. Since this bound is valid for every i = 1, . . . , d, we find ςN (Tpα ) ≤ min

min

1≤i≤r vi ⊥ker(Tpi ),

kTpαi vj k = min ςΣ+1 (Tpαi ). 1≤i≤r

α

kvi k=1 −1 The condition number of fR at the VFMs pα thus satisfies

Y 1 −1 αi,j . κA (pα ) ≥ √ · max max d 1≤i≤r 1≤k≤d 1≤j6=k≤d As before, taking αi,1 = · · · = αi,d = kpi k1/d is a minimizer of the expression on the right hand side. −1 The above evidence strongly suggests to evaluate the condition number of fR at norm-balanced VFMs. Definition 2. The VFMs p = vecr(p1 , . . . , pr ) with pi = (a1i , . . . , adi ) are called norm-balanced if for every i = 1, . . . , r we have ka1i k = · · · = kadi k. −1 We can then define the condition number of the TDP fR at the tensor A as follows.

∼ FΠ be a robustly r-identifiable Definition 3 (Condition number of the TDP). Let A ∈ Fn1 ⊗ · · · ⊗ Fnd = −1 −1 rank decomposition. Let p ∈ fR (A) be norm-balanced. Then, the condition number of the TDP fR at A −1 is defined to be the condition number of fR at a norm-balanced representative of A: κA (A) := κA (p)

and

κ(A) := κ(p).

−1 Remark 11. The fibre fR contains several norm-balanced VFMs. Indeed, if p = vecr(p1 , . . . , pr ) ∈ −1 −1 fR (A), then for every permutation π of {1, . . . , r} we have pπ := vecr(pπ1 , . . . , pπr ) ∈ fR (A) as well. It is nevertheless easy to verify that the condition numbers of norm-balanced VFMs of A are all equal.

The condition number κ(A) is computed straightforwardly if the VFMs p ∈ FM representing the rank-r decomposition of A are given, requiring only the N th largest singular value of Terracini’s matrix corresponding to the norm-balanced VFMs p0 ∈ [p]. Remark 12. A reviewer observed that choosing norm-balanced VFMs in Strategy II, as in Definition 3, can also be interpreted as a specific approach for variable elimination that could be employed in Strategy I. The reviewer asked if there is any obvious relationship between these two approaches. Note that in this approach one would still have to choose which of the variables are eliminated, resulting in a set of condition numbers as before. It seems unlikely that all of these condition numbers would coincide, as this is also not the case for the approach based on normalization in the ∞-norm that was described in Section 3. The precise relationship is left as an open problem. 15

5.1. Elementary properties We start by proving three desirable properties of the relative condition number κ that are also exhibited by the condition number of a matrix, namely its continuity (under mild conditions), and its invariance under scaling and orthogonal change of bases. Proposition 13. The absolute and relative condition numbers κA (A) and κ(A) are continuous in a neighborhood of a robustly r-identifiable tensor A. Proof. This follows immediately from the continuity of Terracini’s matrix Tp in the parameters p, which is a consequence from the assumption of regularity in the definition of robust r-identifiability. Proposition 14. The relative condition number κ is scale-invariant: κ(A) = κ(βA) for all β ∈ F0 . −1 Proof. Let p ∈ fR (A) be norm-balanced. For F = C, we can write βA = αd A = f (αp). Terracini’s matrix corresponding to the norm-balanced VFMs q = αp is Tq = αd−1 Tp . Hence, (ςN (Tq ))−1 = (|α|d−1 ςN (Tp ))−1 . Since kBk/kqk = kαd Ak/kαpk = |α|d−1 kAk/kpk, we get

κ(B) = κA (B)

kBk kBk kAk = (ςN (Tq ))−1 = (|α|d−1 ςN (Tp ))−1 |α|d−1 = κ(A). kqk kqk kpk

In the real case, whenever d is even and β < 0, one should exploit βA = −αd A = f (αp0 ), where p = vecr(p01 , . . . , p0r ) with p0i = (−a1i , a2i , . . . , adi ). Then, Terracini’s matrix corresponding to q = αp0 is Tq = αd−1 Tp S, where S is a diagonal matrix whose diagonal entries are ±1. Since S is an orthogonal matrix, ςN (Tp ) = ςN (Tp S). 0

Proposition 15. The absolute and relative condition numbers κA and κ are orthogonally invariant. Proof. Considering A ∈ FΠ , the claim is that κ(A) = κ((Q1 ⊗ · · · ⊗ Qd )A), where Qi ∈ O(ni ) are orthogonal matrices in Fni ×ni with respect to the Hermitian inner product. As Q = Q1 ⊗ · · · ⊗ Qd is orthogonal, kAk = kQAk. Since QA = (Q1 ⊗ · · · ⊗ Qd )

r X

a1i ⊗ · · · ⊗ adi =

i=1

r X

Q1 a1i ⊗ · · · ⊗ Qd adi ,

i=1

it follows that QA = f (diag(Q1 , . . . , Qd , . . . , Q1 , . . . , Qd )p) = f (U p). As U is orthogonal, kpk = kU pk.  Terracini’s matrix corresponding to the norm-balanced VFMs p0 = U p is Tp0 = T10 · · · Tr0 , where   Ti0 = I ⊗ Q2 a2i ⊗ · · · ⊗ Qd adi · · · Q1 a1i ⊗ · · · ⊗ Qd−1 aid−1 ⊗ I . Multiplying Ti0 on the right by D = diag(Q1 , Q2 , . . . , Qd ) results in Ti0 D = QTi , and hence QTp = Tp0 diag(D, . . . , D). Since Q and diag(D, . . . , D) are orthogonal, one finds ςN (Tp ) = ςN (QTp ) = ςN (Tp0 diag(D, . . . , D)) = ςN (Tp0 ), concluding the proof. The relative condition number of a robustly r-identifiable tensor can be bounded from below. Proposition 16. The relative condition number κ(A) of a robustly r-identifiable tensor A ∈ Fn1 ⊗ · · · ⊗ Fnd is bounded from below by d−1 . −1 Proof. Let p ∈ fR (A) be norm-balanced. Then,

κ(A) = ςN (Tp )−1

kd−1 Tp pk kTp pk kAk = ςN (Tp )−1 = d−1 ςN (Tp )−1 . kpk kpk kpk

Let K be as in (7). Then by assumption on the rank of Tp , KK † = K(K H K)−1 K H is a projector onto the kernel of Tp . Let pTi = [ (a1i )T ··· (adi )T ], and then it follows that KiH pi = 0. Hence, K H p = 0 = KK † p. So p is contained in span(K)⊥ . Therefore, kTp pk/kpk ≥ ςN (Tp ) and the result follows. It is unknown if this lower bound is sharp. In the next subsection, it is shown that this bound √ is not sharp for rank-1 tensors and certain weak 3-orthogonal tensors, both of which have condition number d−1 . 16

5.2. Weak 3-orthogonal rank decompositions The relative condition number of a rank-1 tensor depends only on the order d of the tensor and decreases as d increases. Proposition 17. A rank-1 tensor A = αa1 ⊗ · · · ⊗ ad ∈ Fn1 ⊗ · · · ⊗ Fnd with kak k = 1 and α ∈√R+ strictly positive has absolute condition number κA (A) = α1/d−1 and relative condition number κ(A) = d−1 . Proof. Let p0 = vec(α1/d a1 , α1/d a2 , . . . , α1/d ad ) and p = vec(a1 , a2 , . . . , ad ). Note that they are both normbalanced. One verifies that Tp0 = α1−1/d Tp . Hence, finding the singular values of Tp suffices. The largest singular value of Tp is ς1 (Tp ) = max kTp vk = max kv1 ⊗ a2 ⊗ · · · ⊗ ad + · · · + a1 ⊗ · · · ⊗ ad−1 ⊗ vd k kvk=1

kvk=1

 ≤ max kv1 ⊗ a2 ⊗ · · · ⊗ ad k + · · · + ka1 ⊗ · · · ⊗ ad−1 ⊗ vd k kvk=1   max kv1 k + · · · + kvd k = max kv1 k + · · · + kvd k = 1 2 d 2 kvk=1 kv k +···+kv k =1 √ √ = 2 max2 (|ν1 | + · · · + |νd |) = max kνν k1 ≤ dkνν k2 = d, ν k=1 kν

ν1 +···+νd =1

√ where vT =√[ (v1 )T ··· (vd )T ] and ν =√[ ν1 ··· νd ]. If we take v1 = d−1 vec(a1 , · · · , ad ), then it follows that kTp vk = kd d−1 a1 ⊗ · · √ · ⊗ ad k = d. That√is, v1 is a right singular vector corresponding to the largest √ singular value ς1 (Tp ) = d. Since Tp v1 = da1 ⊗ · · · ⊗ ad = du1 , it follows that u1 is a left singular vector corresponding to ς1 (Tp ). Deflating the largest singular tuple from Tp , we find √   Tbp = Tp − du1 v1H = (I − a1 (a1 )H ) ⊗ a2 ⊗ · · · ⊗ ad · · · a1 ⊗ · · · ⊗ ad−1 ⊗ (I − ad (ad )H ) . The singular values of the above matrix are the square roots of the eigenvalues of its Gram matrix, which has a particularly pleasing structure as it is block diagonal,   TbpH Tbp = diag (I − a1 (a1 )H )2 , (I − a2 (a2 )H )2 , . . . (I − ad (ad )H )2 , which is a consequence of the fact that I − ak (ak )H is a projector onto the orthogonal complement of ak , so that applying it to the latter yields 0. Since I − ak (ak )H is a projector, it is idempotent, and so the eigenvalues of TbpH Tbp are the union of the eigenvalues of the I − ak (ak )H for k = 1, . . . , d. The only possible eigenvalues of a projector are 1 and 0; so eigenvalue 1 has multiplicity equal to the rank of the projector, b which is readily verified to be nk − 1. Hence, the √ singular values of Tp are 1 with multiplicity Σ and 0 with multiplicity d. The singular values of Tp are d, 1 with multiplicity Σ, and 0 with multiplicity d − 1. We can conclude that the N = (Σ + 1)th singular value of Tp0 is α1−1/d , so that the absolute condition number is κA = α1/d−1 . Finally, κ = κA

√ √ kαa1 ⊗ · · · ⊗ ad k = α1/d−1 · α · ( dα1/d )−1 = d−1 , 0 kp k

concluding the proof. The following result characterizes the singular values of Terracini’s matrix corresponding to a normbalanced rank-1 tensor; it is a consequence of the foregoing proof. Corollary 18. Let p = (a1 , a2 , . . . , ad ) with ka1 k = · · · = kad k = α1/d > 0, and let p = vec(p). Then, the singular values of Terracini’s matrix are √ ς(Tp ) = { dα1−1/d , α1−1/d , . . . , α1−1/d , 0, . . . , 0}. | {z } | {z } Σ

17

d−1

Based on the foregoing characterization of the singular values of Terracini’s matrix associated with one rank-1 tensor, we can determine explicitly the condition number of the class of weak k-orthogonal tensors with k ≥ 3. Both “strong” and “weak” 2-orthogonality were considered in [56]; hence the additional qualifier. A tensor rank decomposition is said to be weak k-orthogonal if for every pair of rank-1 tensors, there is orthogonality in k factors, but the factors in which orthogonality occurs need not be the same for different pairs. The next result shows that tensors with a weak 3-orthogonal decomposition admit a closed expression for their condition number. Proposition 19. Let αi ∈ R+ be sorted as α1 ≥ α2 ≥ · · · ≥ αr > 0, and let A=

r X

αi vi1 ⊗ · · · ⊗ vid

with kvik k = 1

i=1

be a robustly r-identifiable weak 3-orthogonal tensor: ∀i < j : ∃1 ≤ k1 < k2 < k3 ≤ d : hvik1 , vjk1 i = hvik2 , vjk2 i = hvik3 , vjk3 i = 0, where h·, ·i is the Hermitian inner product. Then, κA (A) = αr−1+1/d

and

v ,v u r u r X u uX 2/d κ(A) = α−1+1/d t α2 t dα . r

i

i=1

i

i=1

Proof. Let p = vecr(p1 , . . . , pr ) with pi = (a1i , . . . , adi ) be norm-balanced. By weak 3-orthogonality, we have for every 1 ≤ i < j ≤ r and every 1 ≤ k1 , k2 ≤ d that (vi1 ⊗ · · · ⊗ vik1 −1 ⊗ I ⊗ vik1 +1 ⊗ · · · ⊗ vid )H (vj1 ⊗ · · · ⊗ vjk2 −1 ⊗ I ⊗ vjk2 +1 ⊗ · · · ⊗ vjd ) = 0. Hence, TiH Tj = 0 for all i 6= j, so that TpH Tp = diag(T1H T1 , T2H T2 , . . . , TrH Tr ), where Ti is as in (6). The singular values of Tp thus coincide with the union of the set of singular values of each of the Ti ’s. From Corollary 18 it follows that ς(Tp ) =

r √ [ 1−1/d 1−1/d 1−1/d , αi , . . . , αi , 0, . . . , 0}; { dαi i=1 1−1/d

by our assumption on the order of the αi . The squared the N th singular Pr value is thus ςN (Tp ) = αr norm kAk2 = i=1 αi2 because of the orthogonality of the vi1 ⊗ · · · ⊗ vid ’s. The norm of p is clear. √ Corollary 20. If all coefficients αi in Proposition 19 are equal, then the relative condition number is d−1 . Remark 21. The class of weak 3-orthogonal tensors includes rank-1 tensors and orthogonally decomposable tensors (odeco) of order d ≥ 3, which have been studied extensively in [11, 14, 31, 44, 56, 61]. 6. Numerical experiments In this section, the condition number is computed numerically for some tensors. The implementation of this algorithm in Matlab/Octave that was used, is provided at https://arxiv.org/abs/1604.00052. All of the experiments were performed using Matlab R2015a on a computer system consisting of an Intel Core i7-5600U CPU, clocked at 2.6GHz, and 8GB of main memory. Tensorlab v3 [59] was employed for computing approximate tensor rank decompositions. Additional experiments can be found in [55].

18

6.1. The main theorem For illustrating Theorem 10, the quantity on the left hand side of b k/kpk kp − p dist(p, [b p])/kpk ≥ kf (p) − f (b p)k/kf (p)k kf (p) − f (b p)k/kf (p)k

(12)

b are both norm-balanced VFMs. We will be investigated as a proxy for the right hand side, where p and p b such that kf (p) − f (b can estimate the condition number κ by generating a large number of vectors p p)k ≤ εkf (p)k, for some small value of ε, then computing the left-hand side of (12), and finally taking the maximum value over all samples. For example, a (positive) rank-2 decomposition in R3×3×2 was randomly chosen with norm-balanced representatives             5.1518 1.9032  8.8821 7.5082  7.2302 6.9447  and p1 = 10−1 3.6941 , 1.6653 , . p1 = 10−1 4.9802 , 5.4218 , 4.9879 6.7487 5.0806 6.6436 1.1117 5.8845 Let p = vecr(p1 , p2 ) be the VFMs, and let A = f (p) be the corresponding tensor. The relative condition number is κ(A) ≈ 18.410787. Let pi = (a1i , a2i , a3i ). Then, the factor matrices Fk = [ ak1 ak2 ] corresponding to p were perturbed randomly as follows: Fbk = Fk + 10−4 · kFk kF · Xk where Xk has elements sampled from the standard uniform distribution N (0, 1). Starting from these factor matrices, the cpd_nls function b denote the norm-balanced in Tensorlab computed an approximate tensor rank decomposition of A. Let p VFMs obtained by the cpd_nls function. This computation is repeated 1 million times; for every instance where kA−f (b p)k ≤ 10−14 ·kAk, the value on the left-hand side of (12) is computed. This resulted in 278, 527 valid samples. Taking their maximum value yielded approximately 10.7102 as estimate for the condition number. This is actually a poor approximation to κ. In fact, the mean estimate of the condition number over all valid trials was only 1.63618, suggesting that the average small perturbation to A only causes the parameters p to change only about twice as much as the tensor. Section Appendix C.3 proves that the right singular vector v12 corresponding to the N = 12th singular value of Tp is the worst direction of perturbation. If we consider a small perturbation in this direction, e.g., b = p + 10−8 v12 , then we get a substantially more accurate estimate, namely 18.410764, which is a relative p difference of about 1 · 10−6 with respect to the true condition number. 6.2. An ill-conditioned example Consider the following sequence of tensors As = f (ps ) = a11 ⊗ a21 ⊗ (x + 2−s a31 ) + a12 ⊗ a22 ⊗ (x + 2−s a32 ),

(13)

where aki ∈ Rnk are random vectors sampled from a normal distribution, x ∈ Rn3 is any non-zero vector, and ps are norm-balanced VFMs. Every tensor As on this sequence is robustly 2-identifiable by Kruskal’s theorem with probability 1. However, lims→∞ As = (a11 ⊗ a21 + a12 ⊗ a22 ) ⊗ x, which is not 2-identifiable. In fact, A∞ has infinitely many decompositions of length 2. Recall from Lemma 37 of [17] that Terracini’s matrix Tp is of rank strictly less than N , where p = vecr(p1 , p2 ) with pi = (a1i , a2i , x). By continuity in a neighborhood of A∞ this entails that κ(As ) → ∞ as s → ∞, which we now confirm numerically. b s be The decomposition of As can be computed by a direct decomposition algorithm [19, 22, 36]. Let p the norm-balanced VFMs that were computed by such a numerical algorithm when As is provided as input. b s 6= ps in general, so that the algorithm actually computed the decomposition of Because of roundoff errors p b s = f (b b s k/kAs k, which measures how close the a nearby tensor A ps ). The relative backward error is kAs − A rank decomposition problem that was solved by the numerical algorithm was to the true rank decomposition problem. The condition number allows us to asymptotically bound the relative forward error between the computed (norm-balanced) and the true VFMs: −1 b dist(b ps , fR (As )) b s ) · kAs − As k . . κ(A b sk kb ps k kA

19

1 10

Backward error Forward error Forward error upper bound

−2

10−4 10−6 10−8 10−10 10−12 10−14 10−16 0

5

10

15

20

25

30

35

40

45

s Figure 1: The relative backward error, relative forward error, and relative condition number multiplied with the relative backward error of a particular instance of the sequence in (13).

We will investigate four quantities, namely the relative forward error, the condition number, the relative backward error, and the asymptotic upper bound on the right hand side in the last inequality. As a concrete problem instance, consider rank-2 tensors in R13×11×7 whose factor matrices F(s) were generated as follows: A = randn(13,2); B = randn(11,2); C = randn(7,2); x = randn(7,1); F = @(s) {A, B, x*[1 1] + 2^(-s) * C};

The cpd_gevd function in Tensorlab was used for computing a direct decomposition of As for the values s = 1, 2, . . . , 45. For simplicity, kb ps − ps k is used as a proxy for the forward error dist(b ps , [ps ]). The backward error, the proxy of the forward error, and estimated forward error obtained by multiplying the condition number with the backward error are all plotted in function of s in Fig. 1. The figure shows that the forward error grows as s increases, which is driven by the increase of the condition number from roughly O(1) to O(1013 ), confirming the implication of [17, Lemma 37] for condition numbers. Fig. 1 illustrates the perils of investigating only the relative backward error. One sees that the backward error of cpd_gevd is of the order of the machine precision. The forward error, however, is several orders of magnitude larger. In this example, there are no telltale signs that the stability of the computed solution as measured by the forward error is doubtful for s = 1, . . . , 44. The cpd_gevd algorithm behaves exactly the same for s = 1 and s = 44; there is no increase of the computation time or backward error, and the computed decomposition is in most regards normal except for the near rank deficiency in the third factor matrix. Nevertheless, the condition number correctly reveals the instability of the solution. It is only for s ≥ 45 that Tensorlab’s implementation of the ST-HOSVD algorithm [57] detects that the multilinear rank of As is very close to (2, 2, 1), at which point the software prudently refuses to employ the generalized eigenvalue decomposition algorithm, hinting at an ill-conditioned TDP. 6.3. An ill-posed example Recall that for both F = C and R the set of tensors of F-rank bounded by r is in general not closed [20]. This entails that there exist tensors of rank strictly greater than r that can nevertheless be approximated arbitrarily well by tensors of rank r. Such tensors cause problems when trying to approximate them by a rank-r tensor [20]. According to Demmel [21], a problem that is close to an ill-posed problem is often ill-conditioned. We investigate numerically whether this property is admitted by the proposed condition number by considering the example of [20]: f (ps ) = 2s/5 (a1 + 2−s/5 b1 ) ⊗ (a2 + 2−s/5 b2 ) ⊗ (a3 + 2−s/5 b3 ) − 2s/5 a1 ⊗ a2 ⊗ a3 , 20

(14)

Backward error Forward error Forward error upper bound

1 10−2 10−4 10−6 10−8 10−10 10−12 10−14 10−16 10

20

30

40

50

60

70

80

90

100

s Figure 2: The relative backward error, relative forward error, and relative condition number multiplied with the relative backward error of a particular instance of the sequence in (14).

where ak , bk ∈ Fnk are linearly independent vectors and ps are norm-balanced VFMs. Every tensor on this sequence is robustly 2-identifiable by Kruskal’s criterion, and the limit of this sequence of rank-2 tensors is the rank-3 tensor lims→∞ As = b1 ⊗ a2 ⊗ a3 + a1 ⊗ b2 ⊗ a3 + a1 ⊗ a2 ⊗ b3 . As in the previous example, the tensor rank decomposition of As = f (ps ) can be computed by a direct algorithm. As a particular instance, let a1 , b1 ∈ R5 , a2 , b2 ∈ R4 and a3 , b3 ∈ R3 be vectors whose entries were sampled from a standard normal distribution. For all s = 5, 6, . . . , 100, the rank-2 decomposition was b s denote the norm-balanced VFMs obtained from applying this numerical computed with cpd_gevd. Let p b s = f (b algorithm to As , and let A ps ). The relative backward error, the proxy of the relative forward error, b s multiplied with the relative backward error were recorded. These and the relative condition number at A quantities are plotted in Fig. 2. From the figure it can be deduced that the condition number increases from O(1) to O(1010 ) as s increases from 5 to 90, suggesting that the condition number indeed deteriorates as one moves closer to the ill-posed tensor rank decomposition problem. Notice that the upper bound on the forward error seems to stagnate around s = 85. This is a numerical issue in the computation of the condition number. It can be verified that the condition number keeps increasing as s increases when employing variable precision arithmetic in Matlab. Fortunately, the occurrence of these numerical difficulties in computing the condition number can be detected by verifying that the N th singular value of Terracini’s matrix Tps is larger than a small constant multiple of the largest singular value ς1 (Tps ) multiplied with the machine precision (mach ≈ 2.2 · 10−16 in standard double precision in Matlab). The behavior of the condition number is similar in other known examples of sequences tending to an ill-posed tensor rank decomposition problem; see Section 9.5 of [55]. 7. Conclusions Two strategies for defining condition numbers of respectively the TAP and TDP based on the classic framework of conditioning were investigated. A third strategy for the TDP based on the geometric framework of [8, 13] is investigated in a follow-up article by Breiding and the author [12]. Strategy I is based on the modification of the model by explicit elimination of variables, whereas Strategy II relaxes the metric for measuring distances between representatives so that the classic framework applies to the usual over-parameterized model. Terracini’s matrix lies at the heart of the conditioning of robustly ridentifiable tensors for both strategies I and II. It was shown that the condition number of the TDP derived via Strategy II is always less, i.e., better, than the condition number of the TAP derived via Strategy I. In addition, the elimination of variables in Strategy I is an arbitrary choice. For these reasons, only 21

Strategy II was developed in substantial detail. The corresponding condition number equals the inverse of the N th largest singular value of Terracini’s matrix. Provided that the input tensor is robustly r-identifiable, the (relative) condition number multiplied with the (relative) backward error yields an asymptotically sharp upper bound on the (relative) forward error. The analysis of Terracini’s matrix shows that rank-1 tensors are always well-conditioned; their condition number admits a closed expression. The condition number of weak 3-orthogonal tensors was completely described; they are well-conditioned only if the norms of the individual rank-1 terms are approximately of the same order of magnitude. Finally, the numerical experiments provide some preliminary evidence suggesting that tensors in a sequence of robustly r-identifiable tensors whose limit is a tensor of rank strictly larger (see Section 6.3) are ill-conditioned near the limit. The study of conditioning for the tensor rank decompositions is far from complete. A natural question concerns the condition number of more structured decompositions such as nonnegative and (partially) symmetric decompositions. The condition number derived in Theorem 10 provides a natural upper bound for the condition number of such decompositions, but it is unclear to what extent imposing constraints can improve conditioning. The last part of the proof of the main theorem shows that the worst perturbation to the VFMs p is given by the right singular vector corresponding to the N th largest singular value of Terracini’s matrix Tp ; this essentially enables an identification of the components of the VFMs that are most sensitive to perturbations of the tensor. I believe that this is a useful property in data analysis applications. Acknowledgements This paper benefited immensely from the discussions I had with B. Jeuris, K. Meerbergen, G. Ottaviani, and G. Tomasi. I am particularly indebted to J. Nicaise for inquiring about an intrinsic definition of the condition number. I thank P. Breiding and L.-H. Lim for detailed comments they provided on an earlier version of this manuscript. An anonymous reviewer is heartily thanked for his or her detailed comments and questions that greatly improved this lengthy paper. Appendix A. Proof of the identifiability propositions Proof of Proposition 7. Let   D = diag (α2 · · · αd )−1 In1 , α2 In2 , . . . , αd Ind α2 , . . . , αd ∈ F0 ,  B = diag(D1 , D2 , . . . , Dr ) | D1 , D2 , . . . , Dr ∈ D ,  P = P ⊗ IΣ+d | P is an r × r permutation matrix , and  T = P B | P ∈ P and B ∈ B .

(A.1a) (A.1b) (A.1c) (A.1d)

It is straightforward to verify that both B and P are multiplicative groups. For every P ⊗ IΣ+d ∈ P and diagonal matrices Di ∈ FΣ+d×Σ+d we have (P ⊗ IΣ+d ) · diag(D1 , D2 , . . . , Dr ) · (P ⊗ IΣ+d )T = diag(Dπ1 , Dπ2 , . . . , Dπr ),

(A.2)

where π is the permutation represented by P . Hence, (A.2) defines a map ψ : P → Aut(B) that takes x ∈ P to ψx ∈ Aut(B) defined by ψx : y 7→ xyx−1 . It follows (see, e.g., [45, Chapter 7]) that T is the semidirect product T = P o B, which is thus a multiplicative group. Writing p ∼ q if and only if p = T q for some T ∈ T , it follows that ∼ is an equivalence relationship and that f is a morphism for ∼. Proof of Proposition 9. Write x = vecr(p1 , . . . , pr ) with pi = (a1i , . . . , adi ), and y = vecr(q1 , . . . , qr ) with qi = (b1i , . . . , bdi ). Let T be as in the proof of Proposition 7. Note that h(T ) = kx − T yk is a coercive function in the parameters defining T ∈ T , so that for a fixed input the minimizer of the optimization problem is always attained. Indeed, since I ∈ T , it follows that the optimal (P ⊗ IΣ+d )B ∈ T satisfies kx − (P ⊗ IΣ+d )Byk2 =

r X

k vec(pi ) − Dπi vec(qπi )k2

i=1

22

=

r X d X

kaki − θk,πi bkπi k2 +

r X

i=1 k=2

ka1i − (θ2,πi · · · θd,πi )−1 b1πi k2 ≤ kx − yk2 < ∞

i=1

−1 −1 where B = diag(D1 , . . . , Dr ), Di = diag(θ2,i · · · θd,i In1 , θ2,i In2 , . . . , θd,i Ind ) and π is the permutation reprer×r sented by the permutation matrix P ∈ F . Since we have a sum of positive reals, it follows that all |θk,i | are uniformly bounded from above by some positive constant C, and, hence there is also a uniform lower bound 0 < c. So we can optimize θk,i over the bounded intervals [−C, −c] ∪ [c, C] leading to the conclusion that the infimum is attained. Now the last claim about dist(x, [y]) = 0 is trivial.

Appendix B. The Iterated Scaling Lemma For proving the main result, we need the following technical lemma. Let D be as in (A.1). I will abuse notation, writing q ∈ Dp when vec(q) ∈ D vec(p) is meant, where q, p ∈ Fn1 × · · · × Fnd . Lemma 22 (Iterated Scaling). Let SF be a Segre variety. For i = 1, 2, . . . , r, let pi = (a1i , . . . , adi ), ∇i = (n1i , . . . , ndi ), and qi = pi + ∇i = (a1i + n1i , . . . , adi + ndi ), where Seg(pi ), Seg(qi ) ∈ SF . Assume that the perturbation ∇ = vecr(∇1 , ∇2 , . . . , ∇r ) is of sufficiently small norm: ∇k ≤ k∇

 1 −1 1 λ = d+4 · min min ka1i k−1 kaki k2 . 2 2 (d − 1)3/2 1≤i≤r 1≤k≤d

(B.1)

If Terracini’s matrix Tp associated with the VFMs p = vecr(p1 , . . . , pr ) has rank N = r(Σ + 1) and if ∇ is contained in its kernel, i.e., Tp∇ = 0, then there is a representative p˙ i ∈ Dpi such that pi +∇i = qi = p˙ i +∆i ∆i k ≤ 2λk∇ ∇i k2 ≤ k∇ ∇i k; herein, ∇i = vec(∇i ) and Ki as in (7). In with ∆i = vec(∆i ) ∈ span(Ki )⊥ and k∆ other words, letting K be as in (7), there exists a factorization ∆k ≤ 2λk∇ ∇k2 ≤ k∇ ∇k. p + ∇ = p˙ + ∆ with p˙ ∼ p, ∆ ∈ span(K)⊥ , and k∆ Furthermore, letting Tp˙ denote Terracini’s matrix in p˙ = vecr(p˙ 1 , . . . , p˙ r ), then ˙ Tp˙ = Tp D˙ = Tp (I + E), ∇k tends to zero; specifically, there is a constant where the diagonal matrix D˙ tends to the identity as k∇ ˙ 2 = kD˙ − Ik2 ≤ Ck∇ ∇k. C > 0 such that kEk The rest of this appendix is devoted to the proof. We show the existence of a linearly convergent sequence (k) of representatives pi ∈ Dpi , k = 1, 2, . . ., with the following properties (k)

qi − pi

(k)

= ∆i

(k)

+ ∇i

(k)

with ∆ i

(k)

(k)

= vec(∆i ) ∈ span(Ki )⊥ and ∇ i

(k)

= vec(∇i ) ∈ span(Ki ).

(H0a)

The sequence will be constructed in such a way that ∇(k) lim k∇ i k → 0.

(H0b)

k→∞ (k)

(k)

(k)

Under the assumptions of the lemma both pi = vec(pi ) and ∆ i will be of uniformly bounded norm, (k) (k) so that a convergent subsequence exists for which both limk→∞ ∆ i → ∆ i = vec(∆i ) and limk→∞ pi → p˙ i = vec(p˙ i ) are well defined. It remains to show that a sequence satisfying (H0a) and (H0b) exists.

23

Appendix B.1. A recurrence relation (1) (1) (1) (1) (1) Let pi = pi and ∆i = 0, and define ∇i = ∇i . Since ∇(1) = vecr(∇1 , . . . , ∇r ) = ∇ is contained in P (1) (1) r the kernel, we have Tp∇ (1) = i=1 Ti∇ i = 0, where Tp and Ti are as in (6). Suppose that Ti∇ i 6= 0, so (1) that ∇ i would not be contained in the span of Ki . Then Tp ’s rank would be strictly less than the expected value r(Σ + 1), which is a contradiction. Hence, the base case k = 1 of (H0a) is true. Assume now that the statement is true for all l = 1, 2, . . . , k, and then we show that it holds for k + 1 as well. Since Ki is a basis of the kernel of Ti , we can express   (k) (k) (k) ∇ i = k2i · · · kdi vi = Ki vi (B.2) (k)

for some vi

(k)

∈ Fd−1 . By (H0a), pi

∈ Dpi , so that we can write it explicitly as (k)

pi

(k)

(k)

(k)

= (γ1,i a1i , γ2,i a2i , . . . , γd,i adi );

(B.3)

(1)

this is true with γj,i = 1 for the base case, and will follow shortly for the induction step as well. Consider (k)

the perturbed representative pi of Ti as in (B.2), we find that (k)

zi

(k)

= pi

(k)

+ ∇i

(k)

(k)

+ ∇i . Writing ∇i

with respect to the particular basis Ki of the kernel

 (k) (k) (k) (k) (k) (k) (k) = (γ1,i + v2,i + · · · + vd,i )a1i , (γ2,i − v2,i )a2i , . . . , (γd,i − vd,i )adi ,

(k)

(k)

(k)

(k)

where vj,i denotes the (j − 1)th element of vi . Consequently, Seg(zi ) and Seg(pi ) are multiples of each other. The following representative (k+1)

d Y

:=

pi

 (k) (k) (k) (k) (k) (k) (γj,i − vj,i )−1 a1i , (γ2,i − v2,i )a2i , . . . , (γd,i − vd,i )adi ∈ Dpi

(B.4)

j=2

will induce the required sequence. Define (k) zi



(k+1) pi

=



(k) γ1,i

+

d X

(k) vj,i



j=2 (k+1)

where ∇ i

(k+1)

= vec(∇i

d Y

(k) (γj,i



(k) vj,i )−1

(k+1)

(k)



a1i , 0, . . . , 0

(k+1)

=: ∇i

b (k+1) , +∆ i

(B.5)

j=2

b ) ∈ span(Ki ) and vec(∆ i zi



(k)

= pi

(k)

+ ∇i

) ∈ span(Ki )⊥ . Then,

(k+1)

= pi

(k+1)

+ ∇i

b (k+1) . +∆ i

Since this must be inductively true for l = 1, 2, . . . , k, we find that (1)

qi = zi

(k+1)

= pi

(k+1)

+ ∇i

+

k+1 X

b (j) = p(k+1) + ∇(k+1) + ∆(k+1) . ∆ i i i i

(B.6)

j=2

This proves (H0a). Appendix B.2. Convergence (k)

It remains to show (H0b), i.e., the sequence of ∇ i ’s converges to zero. We demonstrate that k−1 ∇(k) ∇ i kk k∇ k∇ i k≤λ

for some real constant λ > 0;

(H1)

for the base case k = 1 this is true by definition. We can assume the truth of the above statement for l = 1, 2, . . . , k. From (B.5) it immediately follows that ∇(k+1) k∇ k i



(k) kzi



(k+1) pi k

=

ka1i k

d d Y (k) X (k) (k) (k) −1 · γ1,i + vj,i − (γj,i − vj,i ) , j=2

24

j=2

(k)

(k)

where zi = vec(zi ). Our efforts will be focused on showing that the right hand side is bounded by ∇i kk+1 . The boundedness of k∆ ∆(k+1) λk k∇ k would then follow from the observation that i ∆(k+1) k∆ k≤ i

k+1 X

k X

b (j) k ≤ k∆ i

j=2

(j)

(j+1)

kzi − pi

k X

k≤

j=1

∇i kj+1 ≤ λ−1 λj k∇

j=1

∞ X

∇i k)j+1 = λ (λk∇

j=1

∇i k2 k∇ , ∇i k 1 − λk∇

(j)

b b (j) ), and provided that λk∇ ∇i k < 1 but this will be satisfied by our assumptions. The where ∆ = vec(∆ i i (k) boundedness of pi is then an immediate consequence of (B.6) and the uniform boundedness of qi = vec(qi ), (k+1) ∇(k+1) ∇i k ≤ 21 , it follows that k∆ ∆(k+1) ∇i k, and ∆i . In fact, since we will assume (I3), i.e., λk∇ k ≤ 2 · 12 k∇ i i ∆ which already proves the bound on k∆ i k. (k) (k+1) For bounding kzi − pi k from above, we proceed as follows. During our derivations, we will assume some additional convenient constraints on certain quantities; they will be considered more carefully in the (k) next subsection. Consider the coefficient of a1i in (B.5). First, a bound on the vj,i ’s is obtained as follows. Since Ki ∈ F(Σ+d)×(d−1) is a basis for the kernel of Ti it has linearly independent columns. Therefore, (k) (k) (k) the pseudoinverse is Ki† = (KiH Ki )−1 KiH , and as ∇ i ∈ span(Ki ) one has ∇ i = Ki Ki†∇ i , because (k) (k) Ki Ki† is a projector onto the column span of Ki ; so vi = Ki†∇ i . If λi (A) denotes the ith largest − 1 eigenvalue of a Hermitian matrix A ∈ Fm×m , then it is well-known that kKi† k2 = λd−1 (KiH Ki ) 2 . One can verify that KiH Ki = diag(ka2i k2 , ka3i k2 , . . . , kadi k2 ) + ka1i k2 11T , where 1 is a vector of length d − 1 containing only ones. From [60, Section 2.41] it follows that the eigenvalues of KiH Ki satisfy λj (KiH Ki ) ≥ λj diag(ka2i k2 , ka3i k2 , . . . , kadi k2 ) , j = 1, 2, . . . , d − 1, so in particular kKi† k2 ≤



−1  −1 min kaki k ≤ min kaki k =: χi .

2≤k≤d

(B.7)

1≤k≤d

(k)

(k)

We can now bound the (j − 1)th element vj,i , j = 2, 3, . . . , d, of vi

as follows:

∇i k ≤ χi k∇ ∇i k ≤ χi λk−1 k∇ ∇i kk , |vj,i | ≤ kvi k ≤ kKi† k2 k∇ (k)

(k)

(k)

(k)

(B.8)

where the last step is due to the induction hypothesis (H1). (0) For convenience, we define j,i = 0 and (k)

j,i :=

k X

(`)

(k)

vj,i , so that |j,i | ≤

`=1

k X

(`)

|vj,i | ≤ χi λ−1

`=1

∞ X ∇i k)` = (λk∇ `=1

∇i k χi k∇ =: Ci0 ; ∇i k 1 − λk∇

(B.9)

∇i k < 1, which will be satisfied by our assumptions. We additionally the penultimate equality holds for λk∇ assume the following bound Ci0 ≤

1 . 2(d − 1)

(I0)

Then, for j = 2, 3, . . . , d we find from (B.3) and (B.4) that (k+1)

γj,i

(k)

(k)

(1)

:= γj,i − vj,i = · · · = γj,i −

k X

(`)

(k)

vj,i = 1 − j,i ,

j = 2, 3, . . . , d,

(B.10a)

`=1 (`)

where we used that this formula also applies by induction for γj,i with ` = 1, 2, . . . , k. For j = 1 we have (k+1)

γ1,i

=

d Y j=2

(k)

(k) −1

γj,i − vj,i

=

d Y

(k−1)

1 − j,i

j=2

(k) −1

− vj,i

=

d Y j=2

25

(k) −1

1 − j,i

.

(B.10b)

With the above observations, the coefficient of a1i in (B.5) can be written as (k)

γ1,i +

d X

(k)

(k+1)

vj,i − γ1,i

d Y

=

j=2

(k−1) −1

1 − j,i

d X

+

j=2

(k)

vj,i −

j=2

d Y

(k) −1

(k−1)

1 − j,i

− vj,i

.

(B.11)

j=2

For proceeding, we will need some convoluted series expansions; however, the key idea revolves around expanding (1 − x)−1 with x ≈ 0 by a Maclaurin series. Recall that n Y

n X ∞ Y

(1 − xj )−1 =

xκj =

j=1 κ=0

j=1

∞ X

n Y

X

`

xj j ,

(B.12)

κ=0 k``k1 =κ j=1

Pn where ` = [ `1 `2 ··· `n ] ∈ Nn and k``k1 = j=1 |`j | is the 1-norm of ` ; in the expression it is assumed that |xj | < 1 so that all expansions are absolutely convergent. The last term in (B.11) can be rewritten as follows: Z :=

d Y

(k−1)

1 − j,i

(k) −1

− vj,i

=

j=2

d Y

(k−1) −1

1 − `,i

=

(k−1)

(k−1) −1 `,i

1−

=

1 − j,i

d  Y

(k−1) −1 `,i

1−

1−

vj,i

−1

(k−1)

1 − j,i

(k)

∞ d  X X Y

vj,i

 `j

(k−1)

κ=0 k``k1 =κ j=2

`=2

(k)

− vj,i

(k)

j=2

`=2 d Y

1 − j,i

j=2

`=2 d Y

(k−1)

d Y

1 − j,i

,

where the last step was by (B.12), which requires that (k)

(k−1)

|vj,i | < |1 − j,i

|;

(I1)

this hypothesis will be investigated later. Let us write Z = S0 + S1 + S∞ , where S0 =

d Y

1−

(k−1) −1 j,i

=

(k) γ1,i ,

S1 =

j=2

S∞ =

d Y

d Y

1−

(k−1) −1 `,i

·

1−

(k−1) −1 `,i

·

∞ X

vj,i

 `j

(k−1)

κ=2 k``k1 =κ j=2

`=2

(k)

d  X Y

1 − j,i

vj,i

(k−1)

j=2

`=2

(k)

d X

1 − j,i

, and

.

We will need to apply (B.12) again to S1 for obtaining the desired result. S1 =

d X

(k−1)

 (k) vj,i 1 +

j=2

=

d X

(k)

vj,i 1 +

=

(k) vj,i

+

1

d X

j=2

j=2

| {z }

|

S10

d  Y (k−1) −1 · 1 − `,i (k−1)

1 − j,i



j=2 d X

j,i

`=2

(k−1)  j,i (k−1) − j,i

(k) vj,i

∞ X

∞ d   X X Y (k−1) `l · 1+ l,i κ=1 k``k1 =κ l=2

X

d Y

(k−1) `l l,i



κ=1 k``k1 =κ l=2

+

d X j=2

{z

}

S100

|

(k−1)

(k) vj,i

j,i 1−

(k−1) j,i

d Y

·

(k−1) −1

1 − `,i

.

`=2

{z

S1000

}

Hence, (B.11) equals S0 + S10 − Z = S0 + S10 − (S0 + S10 + S100 + S1000 + S∞ ) = −(S100 + S1000 + S∞ ). From (B.9), (k−1) (k−1) (k−1) 1 − j,i ≥ 1 − |j,i | ≥ 1 − Ci0 , and as |j,i | ≤ 12 by (I0), we get that (k−1) −1

Ci := (1 − Ci0 )−1 ≥ 1 − j,i 26

(k−1) −1 = 1 − j,i ;

this bound depends neither on j nor k. Consequently, (k) d d  X Y  `j X Y (k) `j vj,i κ v ≤ Ciκ kv(k) ⊗ · · · ⊗ v(k) k1 ≤ Ciκ (d − 1)κ/2 kv(k) kκ , ≤ C i j,i i i i (k−1) k``k1 =κ j=2 k``k1 =κ j=2 1 − j,i (B.13) where the second step is because every ` with k``k1 = κ can be identified with (1, . . . , 1, 2, . . . , 2, . . . , d − 1, . . . , d − 1) | {z } | {z } | {z } `1

`2

`d−1

(k)

(k)

whose length is κ. From this identification, it follows that the symmetric part of vi ⊗ · · · ⊗ vi P Qd (k) `j precisely all summands of k``k1 =κ j=2 vj,i . As a result, S∞ can be bounded as follows |S∞ | ≤ Cid−1

∞ X κ=2

contains

√ ∇i k2k C d+1 (d − 1)χ2i λ2k−2 k∇ (k) κ √ , Ci d − 1kvi k ≤ i ∇i kk 1 − Ci d − 1χi λk−1 k∇

(k)

where (B.8) was used to bound kvi k in the second inequality; herein, we assumed √ 1 ∇i kk ≤ , Ci d − 1χi λk−1 k∇ 2

(I2)

so that the series is convergent. Analogously to the derivation of (B.13), one finds d X Y κ (k−1) `j (k−1) (k−1) (k−1) κ j,i ⊗ · · · ⊗ i k1 ≤ (d − 1)κ/2 ki k ≤ (d − 1)Ci0 , ≤ ki

(B.14)

k``k1 =κ j=2 (k)

where  i

=



(k−1)

2,i

|S100 | ≤

(k−1)

··· d,i

 . The last step was due to (B.9). A bound for S100 is

∞ d X (k) X κ v (d − 1)Ci0 = j,i κ=1

j=2

d (d − 1)Ci0 X (k) (d − 1)Ci0 (k) vj,i = kv k1 ; 0 1 − (d − 1)Ci j=2 1 − (d − 1)Ci0 i

the Maclaurin series is convergent because of (I0). A bound for S1000 is |S1000 | ≤

d X j=2

(k)

|vj,i |Ci0 Ci

d Y

|1 − `,i,k−1 |−1 ≤ Ci0 Cid

d X

(k)

(k)

|vj,i | = Ci0 Cid kvi k1 .

j=2

`=2

It follows from the foregoing two bounds that |S100 | + |S1000 | ≤

 (d − 1)C 0    √ d−1 (k) i 0 d 0 d k−1 ∇i kk + C C kv k ≤ C d − 1 + C k∇ 1 i i i i χi λ i 1 − (d − 1)Ci0 1 − (d − 1)Ci0   2√ d−1 d χi d − 1 k−1 ∇i kk+1 , ≤ + C λ k∇ i ∇i k 1 − (d − 1)Ci0 1 − λk∇

√ (k) (k) d−1 d where we used kvi k1 ≤ d − 1kvi k, (B.8), and (B.9). Let ζ = 1−(d−1)C 0 + Ci . Combining the above i with the bound for S∞ , we get √  √ ∇i kk−1  k−1 ζ Cid+1 d − 1λk−1 k∇ 2 √ ∇i kk+1 . |(B.11)| ≤ χi d − 1 + λ k∇ ∇i k 1 − Ci d − 1χi λk−1 k∇ 1 − λk∇ ∇ i kk 27

∇i kk+1 . So, it suffices showing that Hence, it suffices proving that the right hand side is less than ka1i k−1 λk k∇ √  √ ∇i kk−1  ζ Cid+1 d − 1λk−1 k∇ 1 2 √ ≤ λ. (B.15) kai kχi d − 1 + ∇i k 1 − Ci d − 1χi λk−1 k∇ 1 − λk∇ ∇i kk Let us assume additionally that ∇i k ≤ λk∇

1 . 2

(I3)

Exploiting (I0), (I2), and (I3), we find that (B.15) is implied by the inequality √ √  ka1i kχ2i d − 1 4(d − 1) + 2d+1 + 2d−k+3 d − 1 ≤ λ.

(I4)

Note that λ > 0 is a free parameter, so it can simply be chosen so as to satisfy the above inequality. For  instance, let us choose λ = 2d+3 (d − 1)3/2 · max1≤i≤r χ2i ka1i k . If (I0), (I1), (I2), (I3) and (I4) are true, then this proves (H1) and (H0b). Appendix B.3. Eliminating assumptions We show that (B.1) implies (I0), (I1), (I2) and (I3). Assumption (I0) can be eliminated as follows: (I0) ⇔

  ∇i k 1 1 k∇ ∇i k ≤ ≤ χ−1 ⇐ k∇ χ−1 ∧ (I3), i i ∇i k 1 − λk∇ 2(d − 1) 4(d − 1) | {z } (I5)

where ∧ denotes conjunction. The last statement also implies (I1), because ∇i kk < 1 − (I1) ⇐ χi λk−1 k∇

 ∇i k  ∇i k χi k∇ χi k∇ ∇i k < 1 − ⇐ χi k∇ ∧ (I3) ∇i k ∇i k 1 − λk∇ 1 − λk∇     1 ∇i k 1 + ⇐ χi k∇ < 1 ∧ (I3) ∇i k 1 − λk∇  1 −1 ∇i k ≤ 3 χi ∧ (I3). ⇐ k∇

The elimination of (I2) proceeds as follows:   1 1 −1 √ ∇ i kk ≤ √ ∇ χ−1 ⇐ k∇ k ≤ χ ∧ (I0) ∧ (I3) ⇐ (I5) (I2) ⇔ λk−1 k∇ i i i 2 d − 1Ci 4 d−1 ∇i k ≤ 21 λ−1 is the strongest bound, because we have Hence, (I3), or, equivalently, k∇ ∇i k ≤ k∇

 1 −1 1 1 1 1 λ = d+2 √ · min ka1j k−1 χ−2 < χ−1 < χ−1 , j 2 4(d − 1) i 3 i 2 d − 1 4(d − 1) 1≤j≤r

where we exploited ka1j k−1 χ−1 = ka1j k−1 min1≤k≤d kakj k ≤ ka1j kka1j k−1 = 1. Thus, (I3) implies (I0), (I1), j (I2), (I3), (I4), (I5), (H0a), (H0b), and (H1) for all i = 1, 2, . . . , r simultaneously, hereby concluding the main proof. Appendix B.4. Alternative formulation It is clear that taking   ∆ T = ∆ T1 · · · ∆ Tr

 and p˙ T = vec(p˙ 1 )T

···

vec(p˙ r )T



∆k ≤ k∇ ∇k. By assumption, the columns of K form a basis of the kernel of Tp . Hence, satisfies p˙ ∼ p and k∆ the Moore–Penrose pseudoinverse of K is K † = (K H K)−1 K H . As KK † is a projector onto the column Pr span of K, the claim ∆ ∈ span(K)⊥ is equivalent with KK †∆ = 0. Then, K H ∆ = i=1 KiH ∆ i = 0. 28

Appendix B.5. Consequences for Terracini’s matrix (k)

Consider the definition of the representative pi (k)

lim pi

k→∞

in (B.3). The limit for k → ∞ can be written as

d   Y (k) −1 (k)  (k)  = a1i · lim 1 − j,i , a2i · lim 1 − 2,i , . . . , adi · lim 1 − d,i k→∞

k→∞

j=2

k→∞

= (γ1,i a1i , γ2,i a2i , . . . , γd,i adi ) = p˙ i

(B.16)

because of (B.10). Note that the series (∞)

γj,i = 1 − j,i = 1 −

∞ X

(k)

vj,i > 0,

j = 2, 3, . . . , d,

k=1

are absolutely P∞convergentk because the terms in the infinite sequences can be bounded as in (B.8), and the ∇k) is absolutely convergent by (I3). Since sum χi λ−1 k=1 (λk∇ γ1,i = (γ2,i γ3,i · · · γd,i )−1 , the above γj,i yield the explicit expressions for the coefficients of p˙ i . Notice that the above furnishes (k) an alternative proof that the sequence of pi ’s converges. It follows immediately from the definition of Terracini’s matrix in (6) that we can write ! d d Y Y  T˙i = Ti · diag In · γj,i , . . . , In · γj,i = Ti · diag γ −1 In , γ −1 In , . . . , γ −1 In = Ti D˙ i , 1

1,i

d

j=1 j6=1

1

2,i

2

d,i

d

j=1 j6=d

so that Terracini’s matrix Tp˙ in the points p˙ i is given by   ˙ Tp˙ = T˙1 · · · T˙r = Tp · diag(D˙ 1 , D˙ 2 , . . . , D˙ r ) = Tp D. −1 ∇k. For j = 2, 3, . . . , d, we have So it suffices demonstrating that |γj,i − 1| ≤ Ck∇ (∞)

−1 |γj,i − 1| = |(1 − j,i )−1 − 1| ≤

∞ X

(∞)

|j,i |κ ≤

κ=1

∞ X

∇i k)κ ≤ 4k∇ ∇k max χi , (2χi k∇

(B.17)

1≤i≤r

κ=1

∇i k ≤ where in the second equality we used (B.9) and (I3), and where in the last step we used 2χi k∇ is true because of (I3). The case of j = 1 is due to

1 2

which

d ∞ d d d−1 X X Y X Y X Y (∞)  (∞) `j (∞) −1 |γ1,i − 1| = −1 + 1 − j,i = −1 + 1 + (−1)κ j,i |j,i |`j ≤ κ=1 k``k1 =κ, ` ≤1

j=2



∞ X κ=1

(d − 1)Ci0



=

j=2

(d − 1)Ci0 ∇k max χi ; ≤ 4(d − 1)k∇ 1≤i≤r 1 − (d − 1)Ci0

κ=1 k``k1 =κ j=2

(B.18)

herein, the inequality ` ≤ 1 is meant componentwise, the second inequality is by (B.14), and the last step used (B.9), (I0) and (I3). Letting C = 4(d − 1) max1≤i≤r χi then concludes the proof. Appendix C. Proof of the main theorem Let N be a connected neighborhood of A where all tensors are robustly r-identifiable. Let A0 ∈ N be arbitrary. Then, there exist several ∆p0 such that A0 = f (p + ∆p0 ). The first step consists of enforcing a 29

−1 unique choice. By r-identifiability, fR (A0 ) = [p+∆p0 ], so that the set of rank-1 tensors in the decomposition 0 of A is unique. These rank-1 tensors can be ordered uniquely with respect to the lexicographic total order ≤ on FΠ .10 Let A0 = f (x), where x = vecr(q1 , . . . , qr ) ∈ FM with qi = (x1i , x2i , . . . , xdi ) is chosen so that the corresponding rank-1 tensors Seg(qi ) are sorted. The following set is uniquely defined regardless of x:   n o (k) C = jk,i = min arg max|x`,i | k = 2, 3, . . . , d, i = 1, 2, . . . , r , (C.1) 1≤`≤nk

(k)

(k)

where x`,i is the `th element of xki ∈ Fnk . Note that |xjk,i ,i | = kxki k∞ > 0 and that |C| = r(d − 1). By r-identifiability there is just one choice of VFMs p + ∆p = vecr(q01 , . . . , q0r ) with (k)

q0i = (b1i , . . . , bdi ), Seg(q01 ) < · · · < Seg(q0r ), and ∀jk,i ∈ C : bjk,i ,i = 1

(C.2)

such that A0 = f (p + ∆p). The order of the rank-1 tensors is strict, for otherwise the rank of A0 would be strictly less than r. Given a fixed p and A0 , ∆p is uniquely determined by the foregoing procedure. −1 Next, we write dist(p, fR (A0 )) as a norm. Let A0 = A + ∆A = f (p + ∆p). By r-identifiability, its −1 0 inverse is fR (A ) = [p + ∆p]. Let T be as in (A.1). Choose any Sb ∈ Z = arg minkp − S(p + ∆p)k;

(C.3)

S∈T

note that Sb depends on A0 , but for brevity such dependencies are not made explicit in the notation. We can b + ∆p) = p + ∆p, g so that for every A0 ∈ N the following holds: write S(p g A0 = f (p + ∆p) = f (p + ∆p). g = dist(p, f −1 (A + ∆A)), so we can just analyze this norm. It is important to note that By definition, k∆pk R g depends on the choice of S, b the norm of this vector is independent of this choice. while ∆p g depend on ∆A, which is arbitrary, In the remainder of the proof, most considered quantities, such as ∆p, and the choice of Sb (which, in turn, depends on ∆A). To be precise, one could indicate these dependencies g b in the notation by writing, e.g., ∆p(∆A, S(∆A)), but this will be avoided wherever no confusion can arise. Appendix C.1. Continuity For convenience, let 0 G = {∆A ∈ FΠ 0 | k∆Ak ≤  and A + ∆A ∈ σr (SF )}.

We prove that g lim max k∆p(∆A)k →0

→0 ∆A∈G

(C.4)

by showing that N is locally diffeomorphic to FN and choosing a particularly suitable chart (U, φ) such that the Euclidean distance on the chart provides an upper bound on the foregoing expression. (k) (k) Let C be defined as in (C.1) but replacing x`,i by a`,i , where the latter is the `th element of aki . Let (k) (d−1)×r bC,C , fC,C , πC , π −1 , and ηC,C be defined as in C = [ajk,i ,i ]d,r , where the jk,i ∈ C. Let U C,C k=2,i=1 ∈ F Section 3, substituting the general C and C for the aforementioned choices. Then by definition A lies in the image of fC,C ; in fact the construction is such that A = fC,C (p) where p = πC (ηC,C (p)) = πC (p). I claim that fC,C is a local diffeomorphism. By the definition of robust r-identifiability, N is an open neighborhood of a smooth F-manifold of dimension N . Hence, there exists a chart (U, φ) where U ⊂ N is an open 10 For

F = C ' R2 , one should consider the lexicographic order after the identification CΠ = R2Π .

30

neighborhood of A and a local diffeomorphism φ between U and an open neighborhood V = φ(U ) ⊂ FN . Restrict N to the open neighborhood U ⊂ N . Let A = φ−1 (q) ∈ N . Since φ is a local diffeomorphism, φ◦φ−1 = IdFN . Hence, the Jacobian matrix at q ∈ V of the composition φ◦φ−1 , namely Jφ (φ−1 (q))Jφ−1 (q), is the N × N identity matrix. Since Jφ−1 (q) ∈ FΠ×N and Jφ (φ−1 (q)) ∈ FN ×Π with Π ≥ N , it follows that both matrices are of maximal rank N . By definition, Jφ−1 (q) is contained in the tangent space TA σr (SF ). Let J p ∈ FΠ×N denote the Jacobian matrix of fC,C at p. By the assumption on robust r-identifiability  and (10), span(J p ) = span(Tp ) = TA σr (SF ). Since dim span Jφ−1 (q) = N = dim span(J p ) it follows that Jφ−1 (q)Z = J p for an invertible matrix Z ∈ FN ×N . Hence, the Jacobian matrix of the composition g : φ ◦ fC,C at p is given by Jφ (fC,C (p))J p = Jφ (φ−1 (q))Jφ−1 (q)Z = Z, where the first equality is because A = fC,C (p) = φ−1 (q). Now, the smooth function g : FN → FN has nonsingular Jacobian matrix Z, hence by the Inverse Function Theorem [24, Theorem 1A.1] it has a smooth inverse function g −1 . Thus, g is a −1 local diffeomorphism between a neighborhood of p ∈ FN and a neighborhood of q ∈ FN . Letting now fC,C be the composition g −1 ◦ φ, which is a composition of local diffeomorphisms, it follows that fC,C is a local diffeomorphism between an open neighborhood of p ∈ FN and an open neighborhood of A ∈ N . Further restrict N to the neighborhood where this local diffeomorphism is defined. For an arbitrary choice of A0 = f (p + ∆p) ∈ N there exists an S ∈ T such that  A0 = A + ∆A = f (p + ∆p) = fC,C (πC (ηC,C (p + ∆p))) =: fC,C πC S(p + ∆p) . −1 is a local diffeomorphism, it follows that ∆A → 0 implies ∆p → 0. Let ∆p := πC (S(p + ∆p)) − p. As fC,C −1 bC,C , so that their values at all positions Both p = π (p) and S(p + ∆p) = π −1 (p + ∆p) live in U C,C

C,C

−1 jk,i ∈ C agree. As πA (x) is analytic, ∆p → 0 implies that (S(p + ∆p) − p) → 0 as well. By definition, g k∆pk = minS∈T kp − S(p + ∆p)k ≤ kp − S(p + ∆p)k, (C.4) follows.

Appendix C.2. Sandwiching It remains to demonstrate that the next equalities hold: lim max

→0 ∆A∈G

g −1 k∆p(∆A)k kx(∆A)k = lim max = ςN (Tp ) , →0 ∆A∈G kTp x(∆A)k k∆Ak

g where x(∆A) ∈ FM . The key consists of exploiting the Iterated Scaling Lemma for transforming any ∆p into a new vector x(∆A) that is orthogonal to Tp ’s kernel. Since Terracini’s matrix Tp has a kernel K = span(K) of dimension exactly equal to r(d − 1), we can g as ∆p g = ∆ + ∇ , where ∆ ∈ K⊥ and ∇ ∈ K. Note that h∇ g and ∇, ∆ i = 0, so that k∇ ∇k ≤ k∆pk factorize ∆p g From Lemma 22 it follows that for small k∆Ak > 0, and, hence, small k∆pk g > 0, we have ∆k ≤ k∆pk. k∆ p + ∇ = p˙ + ∆ 0

∆0 k ≤ k∇ ∇k. with p˙ ∼ p, ∆ 0 ∈ K⊥ , and k∆

˙ = ∆ 0 + ∆ ∈ K⊥ . As particular consequences, we have that f (p) = f (p) ˙ ). g = f (p˙ + ∆ ˙ and f (p + ∆p) Let ∆ Observe that f is an analytic multivariate polynomial that is homogeneous of degree d in M variables, so that it has a finite, convergent Taylor series expansion at every point. Since Tp˙ = Tp + Tp E˙ by Lemma 22, the expansion about the VFMs p˙ is ˙ ) = f (p) ˙ + O(k∆ ˙ k2 ) = f (p) + Tp∆ ˙ + Tp E˙ ∆ ˙ + O(k∆ ˙ k2 ). g = f (p˙ + ∆ ˙ + Tp˙ ∆ f (p + ∆p) The following bound is then obtained: ˙ k ≤ kTp E˙ ∆ ˙ k2 + O(k∆ ˙ k2 ) = O(k∆ ˙ kk∆pk), g − f (p)k − kTp∆ g kf (p + ∆p) ˙ k2 ≤ kTp k2 kEk ˙ k and kEk ˙ 2 k∆ ˙ 2 ≤ k∇ g It follows from the ∇k ≤ k∆pk. where the last step invoked kTp E˙ ∆ ˙ ˙ k = 0 yields g Iterated Scaling Lemma that if f (p + ∆p) 6= f (p) then k∆ k = 6 0. Indeed, the contrapositive k∆ ˙ k > 0, so that g = f (p) ˙ = f (p). Thus, if k∆Ak > 0 then k∆ f (p + ∆p) ˙k k∆Ak kTp∆ = + ε(∆A) with ˙k ˙k k∆ k∆ 31

lim |ε(∆A)| → 0;

∆A→0

(C.5)

g the limit is zero because of (C.4). It remains to relate this expression to k∆pk/k∆Ak. ˙ ˙ =: g in terms of ∆ is obtained by observing that S(p b + ∆p) = p + ∆p g = p˙ + ∆ An upper bound on k∆pk ˙ Sp + ∆ for some invertible S ∈ T . It follows that ˙ k ≤ kS −1 k2 k∆ ˙ k, g = min kD(p + ∆p) − pk ≤ kS −1 S(p b + ∆p) − pk = kS −1∆ k∆pk D∈T

where the first inequality arises because T is a multiplicative group. As S −1 ∈ T so S ∈ T , we can write S = P D where P ∈ P and D ∈ B, so that kS −1 k2 = kD−1 k2 . It follows from the Iterated Scaling Lemma that the diagonal entries of D are the γj,i ’s in (B.16). Then, it follows immediately from (B.17) and (B.18) g · maxi χi =: 1 + δ, that the largest diagonal entry in absolute value of D−1 is smaller than 1 + 4(d − 1)k∆pk g in where χi is as in (B.7). It follows from Appendix C.1 that δ → 0 as  → 0. A lower bound on k∆pk 0 ˙ ˙ ∆ k + k∆ ∆k so that terms of ∆ is obtained as follows. From the triangle inequality, k∆ k ≤ k∆ ˙ k2 ≤ k∆ ∆0 k2 + 2k∆ ∆0 kk∆ ∆k + k∆ ∆k2 k∆ ∆0 k(k∆ ∆0 k + 2k∆ ∆k) + k∆ ∆k2 ≤ k∇ ∇k2 (4λ2 k∇ ∇k2 + 4λk∆ ∆k) + k∆ ∆ k2 , = k∆ g where the last step is due to Lemma 22. Note that λ is a constant. Provided that k∆Ak and, hence, k∆pk, are sufficiently small, i.e., so that the inequality g 2 + 4λk∆pk g ≤1 ∇k2 + 4λk∆ ∆k ≤ 4λ2 k∆pk 4λ2 k∇ is satisfied, then the foregoing implies the lower bound ˙ k2 ≤ k∇ g 2, ∇k2 + k∆ ∆k2 = k∇ ∇k2 + 2 · Reh∇ ∇, ∆ i + k∆ ∆k2 = k∆pk k∆ where the first equality is because of the orthogonality in the Hermitian inner product of ∇ and ∆ . Thus, by combining the lower and upper bound and dividing by k∆Ak, we can conclude that for every non-zero ∆A of sufficiently small norm the following relations hold true ˙ (∆A, S(∆A))k ˙ (∆A, S(∆A))k b g b k∆p(∆A)k k∆ k∆ ≤ ≤ (1 + δ(∆A)) , k∆Ak k∆Ak k∆Ak

(C.6)

b provided that  ≥ k∆Ak is sufficiently small and where the dependence on ∆A and S(∆A) was explicitly indicated. Let ∆A0 ∈ arg max ∆A∈G

max b S(∆A)∈Z

˙ (∆A, S(∆A))k ˙ (∆A0 , S)k b g b k∆ k∆p(∆A)k k∆ , ∆A00 ∈ arg max , Sb0 ∈ arg max , 0 k∆Ak k∆Ak k∆A k ∆A∈G b S∈Z

where Z is as in (C.3). It follows that 0 00 00 ˙ (∆A0 , Sb0 )k ˙ g g k∆ k∆p(∆A )k k∆p(∆A )k 00 k∆ (∆A , •)k ≤ ≤ ≤ (1 + δ(∆A )) k∆A0 k k∆A0 k k∆A00 k k∆A00 k ˙ (∆A0 , Sb0 )k k∆ ≤ · max (1 + δ(∆A)), ∆A∈G k∆A0 k

(C.7)

where the first step is by the first inequality in (C.6), the second inequality is by optimality of ∆A00 , the third inequality is because of the second inequality in (C.6) with the “•” indicating that the inequality is valid for all S(∆A00 ) ∈ Z (so in particular the maximum), and the last step is because of the optimality of ∆A0 . For continuing, the explicit notation indicating the dependence on ∆A and Sb is discarded again. Let n be sufficiently large, and then consider the sequences cn =

min

min

∆An ∈G1/n S(∆A b n )∈Z

k∆An k k∆A0n k = and sn = min ˙ nk ∆An ∈G1/n ˙0k k∆ k∆ n 32

min b S(∆A n )∈Z

˙ nk kTp∆ , ˙ nk k∆

˙ 0 := ∆ ˙ (∆A0 , Sb0 (∆A0 )). Then it follows from the first part of (C.5) that where ∆ n n n cn ≥ sn +

min

min

∆An ∈G1/n S(∆A b n )∈Z

ε(∆An )

and sn ≥ cn +

min

min

∆An ∈G1/n S(∆A b n )∈Z

(−ε(∆An )).

From the above and the second part of (C.5), it follows there exist ϑn ≥ 0 so that we can bound −1 sn − ϑn ≤ cn ≤ sn + ϑn ; hence,(sn − ϑn )−1 ≥ c−1 , n ≥ (sn + ϑn )

(C.8)

for sufficiently large n so that sn > ϑn . Note that ϑn → 0 as n → ∞ because of (C.5), (C.4) and Lemma ˙ n ∈ K⊥ , it follows immediately from the Courant–Fisher min-max characterization of the least 22. Since ∆ singular value [30, Theorem 4.2.11] that 0 < ςN (Tp ) ≤

˙ nk kTp∆ ≤ ς1 (Tp ), ˙ k∆ n k

where N = r(Σ + 1), ςi (Tp ) denotes the ith singular value of Tp , and the first inequality is due to the assumption on the rank of Tp . I claim that the sequence of sn converges: lim sn = lim min

n→∞

min

→0 ∆A∈G S(∆A)∈Z b

˙ (∆A, S(∆A))k b kTp∆ = ςN (Tp ). ˙ (∆A, S(∆A)k b k∆

(C.9)

This would conclude the proof because then all of the following limits are well-defined: −1 lim (sn − ϑn )−1 = ςN (Tp )

n→∞

and

−1 lim (sn + ϑn )−1 = ςN (Tp ) ,

n→∞

sandwiching c−1 n in (C.8), so that by considering (C.7) we find lim c−1 n ≤ lim

n→∞

max

n→∞ ∆An ∈G1/n

gn k k∆p ≤ lim c−1 · lim k∆An k n→∞ n n→∞

max

∆An ∈G1/n

(1 + δn ),

and as δn → 0, it follows that κA = lim max

→0 ∆A∈G

−1 dist(p, fR (A + ∆A)) = lim n→∞ k∆Ak

max

∆An ∈G1/n

gn k k∆p = (ςN (Tp ))−1 , k∆An k

which would conclude the proof. Appendix C.3. The limit For proving (C.9), we consider the specific perturbation ∆p0n = n−1 w, where w is the right singular vector of Tp corresponding to the singular value ςN (Tp ). The perturbed tensor is An = f (p+∆p0n ) = f (p+n−1 w) = A + ∆An . If n0 is sufficiently large, then An ⊂ N for all n > n0 . Clearly, k∆An k → 0 as n → ∞. From a Taylor series expansion, it is also clear that for large n, ∆An 6= 0, as the norms of the higher-order terms −1 (A + ∆An ) will be dominated by n−1 = kn−1 wk. Let ∆pn be the specific normalization of p + ∆pn ∈ fR b that satisfies the properties imposed in (C.2). Let Sn ∈ arg minS∈T kp − S(p + ∆pn )k be arbitrary. Then, gn = Sbn (p + ∆pn ). By definition of ∆pn there exists some Tn ∈ T such that p + ∆pn = Tn (p + ∆p0 ). p + ∆p n gn = Sbn Tn (p + n−1 w), so that Hence, p + ∆p gn k = min kp − S(p + ∆pn )k = min kp − STn (p + n−1 w)k k∆p S∈T

S∈T

= min kp − S(p + n−1 w)k ≤ kp − I(p + n−1 w)k = n−1 ; S∈T

the third equality is because T is a group. 33

(C.10)

Let n be sufficiently large. Then, Sbn Tn = I + En where En is a diagonal matrix whose diagonal entries are bounded by O(n−1 ). Indeed, the proof of Proposition 9 shows that (C.10) can only be true if for all k = 2, . . . , d and all i = 1, . . . , r simultaneously we have that kaki − θk,πi (n)(akπi + n−1 wπki )k ≤ n−1 ,

θk,πi (n) ∈ F0 ,

(C.11)

where w = vecr(w1 , . . . , wr ), wi = (wi1 , . . . , wid ), and π is a permutation of {1, . . . , r}. Since the above equation is valid for all all n and as aki are constant vectors, it follows that there exist subsequences such that all θk,πi (n) converge: θk,πi (n) → θbk,πi . In the limit of this subsequence, the inequality can only be satisfied if aki = θbk,πi akπi for all k = 2, 3, . . . , d and i = 1, 2, . . . , r. Suppose that π is not the identity, then there exists a j 6= πj for which it holds that −1 2 −1 a1j ⊗ · · · ⊗ adj + a1πj ⊗ · · · ⊗ adπj = a1j ⊗ · · · ⊗ adj + a1πj ⊗ (θb2,π a ) ⊗ · · · ⊗ (θbd,π ad ) j j j j  −1 −1 = a1j + θb2,π · · · θbd,π a1 ⊗ a2j ⊗ · · · ⊗ adj , j j πj

showing that A really has a decomposition of length at most r − 1, which is a contradiction. Thus, on every convergent subsequence, for large n, π is the identity. Hence, for large n, (C.11) simplifies to k(1 − θk,i (n))aki − n−1 θk,i (n)wik k2 ≤ n−2 ,

(C.12)

which should hold for all i = 1, 2, . . . , r and k = 2, 3, . . . , d. It is shown next that this inequality may be satisfied only if 1 − 3n−1 kaki k−1 ≤ θk,i (n) ≤ 1 + 3n−1 kaki k−1 .

(C.13)

In the remainder of the proof, the dependence of the θk,i ’s on n will no longer be indicated. Note that (C.12) 2 is a quadratic equation in θk,i with positive coefficient for θk,i . So it suffices proving that inequality (C.12) is satisfied for θk,i = 1 while it is no longer satisfied for θk,i = 1±3n−1 kaki k−1 to conclude that any θk,i satisfying (C.12) must be contained in the interval (C.13). For θk,i = 1 one has k(1−1)aki −n−1 wik k = n−1 kwik k ≤ n−1 , so (C.12) is satisfied. Pugging θk,i = 1 ± 3n−1 kaki k−1 into (C.12), we get from the triangle inequality k(1 − θk,i )aki − n−1 θk,i wik k ≥ 3n−1 kaki k−1 kaki k − n−1 |1 ± 3n−1 kaki k−1 | · kwik k ≥ n−1 (3 − |1 ± 3n−1 kaki k−1 |) > n−1 , where the last inequality is valid for n > 3kaki k−1 . Hence, |θk,i − 1| ≤ 3n−1 kaki k−1 for k = 2, . . . , d. For θ1,i = (θ2,i · · · θd,i )−1 we get d ∞ d Y X X Y −1 k −1 `k −1 k −1 −1 (1 ± 3n kai k ) ≤ −1 + 1 + |θ1,i − 1| = −1 + 3n kai k κ=1 k``k1 =κ k=2

k=2



∞ X

(3n−1 χi )κ k1 ⊗ · · · ⊗ 1k1 ≤

κ=1

∞ X

(3(d − 1)n−1 χi )κ ≤ 6(d − 1)n−1 χi ,

κ=1

where χi is as in (B.7); in the third step the same argument surrounding the derivation of (B.13) was employed and in the last step we assumed n ≥ 6(d − 1)χi . Let µk,i be defined as θk,i =: 1 − µk,i , θ1,i =

−1 θ2,i

−1 · · · θd,i

=: 1 − µ1,i ,

k = 2, 3, . . . , d, i = 1, 2, . . . , r

(C.14a)

i = 1, 2, . . . , r.

(C.14b)

By the foregoing derivation |µk,i | = O(n−1 ). We can then write En explicitly as En = diag(En,1 , . . . , En,r )

with En,i = diag(−µ1,i In1 , −µ2,i In2 , . . . , −µd,i Ind ). 34

We now consider the following factorization gn = Sbn (p + ∆pn ) = Sbn Tn (p + ∆p0n ) p + ∆p = (I + En )(p +

(C.15)

∆p0n )

∇n,1 + ∇ n,2 ) + = p + (∇

∆n,1 + ∆ n,2 + ∆ p0n ) (∆

where En p = ∇n,1 + ∆n,1 and En ∆p0n = ∇n,2 + ∆n,2 with ∇n,i ∈ K and ∆n,i ∈ K⊥ . One sees immediately ∇n,1 k ≤ kEn k2 kpk = O(n−1 ), k∇ ∇n,2 k ≤ kEn k2 k∆p0n k = O(n−2 ), and k∆ ∆n,2 k ≤ kEn k2 k∆p0n k = that k∇ −2 −2 ∆ O(n ). The key difficulty lies in proving that k∆ n,1 k = O(n ). By definition, ∆ n,1 = (I − KK † )En p = (I − K diag(K1H K1 , . . . , KrH Kr )−1 K H )En p so it suffices to prove that k(I − Ki Ki† )qi k = O(n−2 ), where qi = qi By definition and some Maclaurin series, we know that

(1)

−µ1,i = −1 +

d Y

−1

(1 − µk,i )

=

k=2

d X

µk,i +

= En,i vec(pi ) with µk,i as in (C.14).

d X X Y

k µ`k,i

=:

κ=2 k`k1 =κ k=2

k=2

d X

µk,i + Mi ;

(C.16)

k=2 (2)

(1)

it can be shown in the usual way that |Mi | = O(n−2 ). Let αk,i = kaki k. Then, we find qi = KiH qi . The (2) (2) (3) (2) 2 2 entries of qi are qk−1,i = −µ1,i α1,i + µk,i αk,i , k = 2, . . . , d. For the next step qi = (KiH Ki )−1 qi , we recall from the Sherman–Morrison formula that 2 α1,i hi hTi , P d −2 2 1 + α1,i j=2 αj,i Pd = −µ1,i α bi + k=2 µk,i , so that the

−2 −2 2 2 2 (KiH Ki )−1 = (diag(α2,i , . . . , αd,i ) + α1,i 11T )−1 = diag(α2,i , . . . , αd,i )−

where hTi = [ α−2 2,i (3)

elements of qi

··· α−2 d,i

2 ]. Let α bi = α1,i

Pd

(2)

j=2

−2 αj,i . Then, hTi qi

are given by (3) qk−1,i

=

−2 2 −µ1,i α1,i αk,i

+ µk,i −

−2 2 = −µ1,i α1,i αk,i + µk,i −

= µk,i + = µk,i +

−2 2 α1,i αk,i (−µ1,i α bi + −2 2 α1,i αk,i (1

Pd

k=2

µk,i )

1+α bi Pd −2 2 +α bi ) k=2 µk,i + α1,i αk,i α b i Mi 1+α bi

−2 −2 2 2 α1,i αk,i Mi − α1,i αk,i α bi (1 + α bi )−1 Mi −2 2 Mi α1,i αk,i (1 − α bi (1 + α bi )−1 ) =: µk,i +

Mi,k ,

where in the second and third step (C.16) was used. Now qi = qi − Ki qi = (I − Ki Ki† )qi where      Pd Pd −µ1,i − k=2 (µk,i + Mi,k ) a1i (Mi − k=2 Mi,k )a1i     (−µ2,i + µ2,i + Mi,2 )a2i Mi,2 a2i     (4) qi =  = , . . .. ..     (4)

(−µd,i + µd,i + Mi,d )adi

(1)

(3)

Mi,d adi

∆n,1 k = O(n−2 ) follows. from which k∆ Continuing from (C.15), we can apply the Iterated Scaling Lemma to p + ∇ n , where ∇ n = ∇ n,1 + ∇ n,2 ∇n k = O(n−1 ), thus obtaining and k∇ gn = p˙ n + ∆ 0n + ∆ n,1 + ∆ n,2 + ∆p0n = p˙ n + n−1 w + n−2 xn , p + ∆p where ∆ 0n ∈ K⊥ is of norm O(n−2 ) for sufficiently large n, and xn is a vector whose norm can be uniformly ˙ n = n−1 w + n−2 xn and Tp w = ςN w. From bounded by a constant C. By definition, ∆ n−1 kTp (w + n−1 xn )k ςN + n−1 kTp k2 kxn k ςN − n−1 kTp k2 kxn k ≤ ≤ , 1 + n−1 kxn k n−1 kw + n−1 xn k 1 − n−1 kxn k ˙ (∆An , •)k/k∆ ˙ (∆An , •))k, where “•” indicates any S(∆A b it follows that kTp∆ n ) ∈ Z, tends to ςN (Tp ) as n → ∞, concluding the proof. 35

References [1] Abo, H., Ottaviani, G., Peterson, C., 2009. Induction for secant varieties of Segre varieties. Trans. Amer. Math. Soc. 361, 767–792. [2] Acar, E., Dunlavy, D., Kolda, T., Mørup, M., 2011. Scalable tensor factorizations for incomplete data. Chemometr. Intell. Lab. 106 (1), 41–56. [3] Allman, E. S., Matias, C., Rhodes, J. A., 2009. Identifiability of parameters in latent structure models with many observed variables. Ann. Stat. 37 (6A), 3099–3132. [4] Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., Telgarsky, M., 2014. Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15, 2773–2832. [5] Appellof, C., Davidson, E., 1981. Strategies for analyzing data from video fluorometric monitoring of liquid chromatographic effluents. Anal. Chem. 53 (13), 2053–2056. [6] Bernardi, A., Brachat, J., Comon, P., Mourrain, B., 2013. General tensor decomposition, moment matrices and applications. J. Symbolic Comput. 52, 51–71. [7] Bhaskara, A., Charikar, M., Vijayaraghavan, A., 2014. Uniqueness of tensor decompositions with applications to polynomial identifiability. JMLR: Workshop and Conference Proceedings 35, 1–37. [8] Blum, L., Cucker, F., Shub, M., Smale, S., 1998. Complexity and real computation. Springer-Verlag, New York. [9] Bocci, C., Chiantini, L., Ottaviani, G., 2014. Refined methods for the identifiability of tensors. Ann. Mat. Pura Appl. (4) 193 (6), 1691–1702. [10] Bochnak, J., Coste, M., Roy, M., 1998. Real Algebraic Geometry. Springer–Verlag. [11] Boralevi, A., Draisma, J., Horobet¸, E., Robeva, E., 2017. Orthogonal and unitary tensor decomposition from an algebraic perspective. Isr. J. Math.To appear. [12] Breiding, P., Vannieuwenhoven, N., 2016. The condition number of join decompositions. arXiv:1611.08117. [13] B¨ urgisser, P., Cucker, F., 2013. Condition: The Geometry of Numerical Algorithms. Vol. 349 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag. [14] Chen, J., Saad, Y., 2009. On the tensor SVD and the optimal low rank orthogonal approximation of tensors. SIAM J. Matrix Anal. Appl. 30 (4), 1709–1734. [15] Chiantini, L., Ottaviani, G., 2012. On generic identifiability of 3-tensors of small rank. SIAM J. Matrix Anal. Appl. 33 (3), 1018–1037. [16] Chiantini, L., Ottaviani, G., Vannieuwenhoven, N., 2014. An algorithm for generic and low-rank specific identifiability of complex tensors. SIAM J. Matrix Anal. Appl. 35 (4), 1265–1287. [17] Chiantini, L., Ottaviani, G., Vannieuwenhoven, N., 2017. Effective criteria for specific identifiability of tensors and forms. SIAM J. Matrix Anal. Appl.Accepted. [18] Comon, P., 1994. Independent component analysis, a new concept? Signal Proc. 36 (3), 287–314. [19] De Lathauwer, L., 2006. A link between the canonical decomposition in multilinear algebra and simultaneous matrix diagonalization. SIAM J. Matrix Anal. Appl. 28 (3), 642–666. [20] de Silva, V., Lim, L.-H., 2008. Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl. 30 (3), 1084–1127. [21] Demmel, J. W., 1987. The geometry of ill-conditioning. J. Complexity 3 (2), 201–229. [22] Domanov, I., De Lathauwer, L., 2014. Canonical polyadic decomposition of third-order tensors: reduction to generalized eigenvalue decomposition. SIAM J. Matrix Anal. Appl. 35 (2), 636–660. [23] Domanov, I., De Lathauwer, L., 2015. Generic uniqueness conditions for the canonical polyadic decomposition and INDSCAL. SIAM J. Matrix Anal. Appl. 36 (4), 1567–1589. [24] Dontchev, A. L., Rockafellar, R. T., 2009. Implicit Function and Solution Mappings: A View from Variational Analysis. Springer Monographs in Mathematics. Springer. [25] Griffiths, P., Harris, J., 1978. Principles of Algebraic Geometry. John Wiley & Sons, Inc. [26] Harris, J., 1992. Algebraic Geometry, A First Course. Vol. 133 of Graduate Text in Mathematics. Springer-Verlag. [27] Hayashi, C., Hayashi, F., 1982. A new algorithm to solve PARAFAC-model. Behaviormetrika 11, 49–60. [28] Higham, N., 1996. Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA. [29] Hitchcock, F., 1927. The expression of a tensor or a polyadic as a sum of products. J. Math. Phys. 6 (1), 164–189. [30] Horn, R., Johnson, C., 1986. Matrix Analysis. Cambridge University Press, New York, NY, USA. [31] Kolda, T. G., 2001. Orthogonal tensor decompositions. SIAM J. Matrix Anal. Appl. 23 (1), 243–255. [32] Koldovsk´ y, Z., Tichavsk´ y, P., Phan, A.-H., 2011. Stability analysis and fast damped Gauss–Newton algorithm for INDSCAL tensor decomposition. In: Proc. IEEE Workshop Statist. Signal Process. pp. 581–584. [33] Krantz, S. G., Parks, H. R., 2002. The Implicit Function Theorem: History, Theory, and Applications. Birkh¨ auser Boston, New York. [34] Kruskal, J. B., 1977. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra Appl. 18 (2), 95–138. [35] Landsberg, J., 2012. Tensors: Geometry and Applications. Vol. 128 of Graduate Studies in Mathematics. American Mathematical Society, Providence, Rhode Island. [36] Leurgans, S. E., Ross, R. T., Abel, R. B., 1993. A decomposition for three-way arrays. SIAM J. Matrix Anal. Appl. 14 (4), 1064–1083. [37] Liu, X., Sidiropoulos, N. D., 2001. Cram´ er–Rao lower bounds for low-rank decomposition of multidimensional arrays. IEEE Trans. Signal Process. 49 (9), 2074–2086.

36

[38] Nocedal, J., Wright, S. J., 2006. Numerical Optimization, 2nd Edition. Springer Series in Operations Research and Financial Engineering. Springer. [39] Oseledets, I., Savost’yanov, D., 2006. Minimization methods for approximating tensors and their comparison. Comp. Math. Math. Phys+ 46 (10), 1641–1650. [40] Paatero, P., 1997. A weighted non-negative least squares algorithm for three-way PARAFAC factor analysis. Chemometr. Intell. Lab. 38 (2), 223–242. [41] Phan, A.-H., Tichavsk´ y, P., Cichocki, A., 2013. Fast alternating LS algorithms for high order CANDECOMP/PARAFAC tensor factorizations. IEEE Trans. Signal Process. 61 (19), 4834–4846. [42] Phan, A.-H., Tichavsk´ y, P., Cichocki, A., 2013. Low complexity damped Gauss–Newton algorithms for CANDECOMP/PARAFAC. SIAM J. Matrix Anal. Appl. 34 (1), 126–147. [43] Qi, Y., Comon, P., Lim, L.-H., 2016. Semialgebraic Geometry of Nonnegative Tensor Rank. SIAM J. Matrix Anal. Appl. 37 (4), 1556–1580. [44] Robeva, E., 2016. Orthogonal decomposition of symmetric tensors. SIAM J. Matrix Anal. Appl. 37 (1), 86–102. [45] Rotman, J. J., 1994. An Introduction to the Theory of Groups, 4th Edition. Vol. 148 of Graduate Texts in Mathematics. Springer–Verlag, New York, USA. [46] Sorber, L., Van Barel, M., De Lathauwer, L., 2013. Optimization-based algorithms for tensor decompositions: canonical polyadic decomposition, decomposition in rank-(lr , lr , 1) terms, and a new generalization. SIAM J. Optim. 23 (2), 695–720. [47] Spivak, M., 1965. Calcalus on Manifolds: A Modern Approach to Classical Theorems of Advanced Calcalus. AddisonWesley. [48] Stewart, G. W., Sun, J.-G., 1990. Matrix Perturbation Theory. Vol. 33 of Computer Science and Scientific Computing. Academic Press. [49] Strassen, V., 1983. Rank and optimal computation of generic tensors. Linear Algebra Appl. 52–53, 645–685. [50] Terracini, A., 1911. Sulla Vk per cui la variet` a degli Sh h + 1-secanti ha dimensione minore dell’ordinario. Rend. Circ. Mat. Palermo (2) 31, 392–396. [51] Tichavsk´ y, P., Koldovsk´ y, Z., 2011. Weight adjusted tensor method for blind separation of underdetermined mixtures of nonstationary sources. IEEE Trans. Signal Process. 59 (3), 1037–1047. [52] Tichavsk´ y, P., Phan, A.-H., Koldovsk´ y, Z., 2013. Cram´ er–Rao-induced bounds for CANDECOMP/PARAFAC tensor decomposition. IEEE Trans. Signal Process. 61 (8), 1986–1997. [53] Tomasi, G., 2006. Practical and computational aspects in chemometric data analysis. Ph.D. thesis, The Royal Veterinary and Agricultural University, Frederiksberg, Denmark. [54] Tomasi, G., Bro, R., 2005. PARAFAC and missing values. Chemometr. Intell. Lab. 75, 163–180. [55] Vannieuwenhoven, N., 2016. A condition number for the tensor rank decomposition. arXiv:1604.00052, 45 pp. [56] Vannieuwenhoven, N., Nicaise, J., Vandebril, R., Meerbergen, K., 2014. On generic nonexistence of the Schmidt–Eckart– Young decomposition for complex tensors. SIAM J. Matrix Anal. Appl. 35 (3), 886–903. [57] Vannieuwenhoven, N., Vandebril, R., Meerbergen, K., 2012. A new truncation strategy for the higher-order singular value decomposition. SIAM J. Sci. Comput. 34 (2), A1027–A1052. [58] Vannieuwenhoven, N., Vandebril, R., Meerbergen, K., 2015. A randomized algorithm for testing nonsingularity of structured matrices with an application to asserting nondefectivity of Segre varieties. IMA J. Numer. Anal. 35 (1), 289–324. [59] Vervliet, N., Debals, O., Sorber, L., Van Barel, M., De Lathauwer, L., 2014. Tensorlab v3.0. http://www.tensorlab.net (last accessed 18 January 2017). [60] Wilkinson, J. H., 1965. The Algebraic Eigenvalue Problem. Oxford University Press, Amen House, London E.C. 4, United Kingdom. [61] Zhang, T., Golub, G. H., 2001. Rank-one approximation to high order tensors. SIAM J. Matrix Anal. Appl. 23 (2), 534–550.

37