Cross: Efficient Low-rank Tensor Completion

4 downloads 2443 Views 1MB Size Report
Nov 3, 2016 - upper bound and the matching minimax lower bound for recovery ... and information technologies, the scales of tensor data ... norm minimization, are proved to be computationally NP-hard, ...... input to the nearest integer.
Cross: Efficient Low-rank Tensor Completion Anru Zhang∗

arXiv:1611.01129v1 [stat.ME] 3 Nov 2016

University of Wisconsin-Madison November 4, 2016

Abstract The completion of tensors, or high-order arrays, attracts significant attention in recent research. Current literature on tensor completion primarily focuses on recovery from a set of uniformly randomly measured entries, and the required number of measurements to achieve recovery is not guaranteed to be optimal. In addition, the implementation of some previous methods are NP-hard. In this article, we propose a framework for low-rank tensor completion via a novel tensor measurement scheme we name Cross. The proposed procedure is efficient and easy to implement. In particular, we show that a third order tensor of Tucker rank-(r1 , r2 , r3 ) in p1 -by-p2 -by-p3 dimensional space can be recovered from as few as r1 r2 r3 +r1 (p1 −r1 )+r2 (p2 −r2 )+r3 (p3 −r3 ) noiseless measurements, which matches the sample complexity lower-bound. In the case of noisy measurements, we also develop a theoretical upper bound and the matching minimax lower bound for recovery error over certain classes of low-rank tensors for the proposed procedure. The results can be further extended to fourth or higher-order tensors. Simulation studies show that the method performs well under a variety of settings. Finally, the procedure is illustrated through a real dataset in neuroimaging.

1

Introduction

Tensors, or high-order arrays, commonly arise in a wide range of applications, including neuroimaging (Zhou et al., 2013; Li et al., 2013; Guhaniyogi et al., 2015; Li and Zhang, 2016; Sun ∗

E-mail address: [email protected]

1

and Li, 2016), recommender systems (Karatzoglou et al., 2010; Rendle and Schmidt-Thieme, 2010; Sun et al., 2015), hyperspectral image compression (Li and Li, 2010), multi-energy computed tomography (Semerci et al., 2014; Li et al., 2014) and computer vision (Liu et al., 2013). With the development of neuroscience and information technologies, the scales of tensor data are increasing rapidly, making the storage and computation of tensors more and more costly and difficult. For example, to simultaneously analyze 100 MRI (magnetic resonance imaging) images with dimensions 256-by-256-by-256, it would require 100 × 2563 = 1, 677, 721, 600 measurements and about 6GB of storage space. Tensor completion, whose central goal is to recover low-rank tensors based on limited numbers of measurable entries, can be used for compression and decompression of high-dimensional lowrank tensors. Such problems have been central and well-studied for order-2 tensors (i.e. matrices) in the fields of high-dimensional statistics and machine learning for the last decade. A nonexhaustive list of references include Keshavan et al. (2009); Cand`es and Tao (2010); Koltchinskii et al. (2011); Rohde et al. (2011); Negahban and Wainwright (2011); Agarwal et al. (2012); Cai and Zhang (2015); Cai et al. (2016); Cai and Zhou (2016). There are efficient procedures for matrix completion with strong theoretical guarantees. Particularly, for a p1 -by-p2 matrix of rank-r, whenever roughly O(r(p1 + p2 )polylog(p1 + p2 )) uniformly randomly selected entries are observed, one can achieve nice recovery with high probability using convex algorithms such as matrix nuclear norm minimization (Cand`es and Tao, 2010; Recht, 2011) and max-norm minimization (Srebro and Shraibman, 2005; Cai and Zhou, 2016).For matrix completion, the required number of measurements nearly matches the degrees of freedom, O((p1 + p2 )r), for p1 -by-p2 matrices of rank-r. Although significant progress has been made for matrix completion, similar problems for order-3 or higher tensors are far more difficult. There have been some recent literature, including Gandy et al. (2011); Kressner et al. (2014); Yuan and Zhang (2014); Mu et al. (2014); Bhojanapalli and Sanghavi (2015); Shah et al. (2015); Barak and Moitra (2016); Yuan and Zhang (2016), that studied tensor completion based on similar formulations. To be specific, let X ∈ Rp1 ×p2 ×p3 be an order-3 low-rank tensor, and Ω be a subset of [1 : p1 ] × [1 : p2 ] × [1 : p3 ]. The goal of tensor completion is to recover X based on the observable entries indexed by Ω. Most of the previous literature focuses on the setting where the indices of the observable entries are uniformly ran2

domly selected. For example, Gandy et al. (2011); Liu et al. (2013) proposed the matricization nuclear norm minimization, which requires O(rp2 polylog(p)) observations to recover order-3 tensors of dimension p-by-p-by-p and Tucker rank-(r, r, r). Later, Jain and Oh (2014); Bhojanapalli and Sanghavi (2015) considered an alternative minimization method for completion of low-rank tensors with CP decomposition and orthogonal factors. Yuan and Zhang (2014, 2016) proposed the tensor nuclear norm minimization algorithm for tensor completion with noiseless observations and further proved that their proposed method has guaranteed performance for p-by-p-by-p tensors of Tucker rank-(r, r, r) with high probability when |Ω| ≥ O((r1/2 p3/2 + r2 p)polylog(p)). However, it is unclear whether the required number of measurements in this literature could be further improved or not. In addition, most of these proposed procedures, such as tensor matrix nuclear norm minimization, are proved to be computationally NP-hard, making them very difficult to apply in real problems. In this paper, we propose a novel tensor measurement scheme and the corresponding efficient low-rank tensor completion algorithm. We name our methods Cross Tensor Measurement Scheme because the measurement set is in the shape of a high-dimensional cross contained in the tensor. We show that one can recover an unknown, Tucker rank-(r1 , r2 , r3 ), and p1 -by-p2 -by-p3 tensor X with |Ω| = r1 r2 r3 + r1 (p1 − r1 ) + r2 (p2 − r2 ) + r3 (p3 − r3 ) noiseless Cross tensor measurements. This outperforms the previous methods in literature, and matches the degrees of freedom for all rank-(r1 , r2 , r3 ) tensors of dimensions p1 -by-p2 -by-p3 . To the best of our knowledge, we are among the first to achieve this optimal rate. We also develop the corresponding recovery method for more general cases where measurements are taken with noise. The central idea is to transform the observable matricizations by singular value decomposition and perform the adaptive trimming scheme to denoise each block. To illustrate the properties of the proposed procedure, both theoretical analyses and simulation studies are provided. We derive upper and lower bound results to show that the proposed recovery procedure can accommodate different levels of noise and achieve the optimal rate of convergence for a large class of low-rank tensors. Although the exact low-rank assumption is used in the theoretical analysis, some simulation settings show that such an assumption is not really necessary in practice, as long as the singular values of each matricization of the original 3

tensor decays sufficiently. It is worth emphasizing that because the proposed algorithms only involve basic matrix operations such as matrix multiplication and singular value decomposition, it is tunning-free in many general situations and can be implemented efficiently to handle large scale problems. In fact, our simulation study shows that the recovery of a 500-by-500-by-500 tensor can be done stably within, on average, 10 seconds. We also apply the proposed procedure to a 3-d MRI imaging dataset that comes from a study on Attention-deficit/hyperactivity disorder (ADHD). We show that with a limited number of Cross tensor measurements and the corresponding tensor completion algorithm, one can estimate the underlying low-rank structure of 3-d images as well as if one observes all entries of the image. This work also relates to some previous results other than tensor completion in the literature. Mahoney et al. (2008) considered the tensor CUR decomposition, which aims to represent the tensor as the product of a sub-tensor and two matrices. However, simply applying their work cannot lead to optimal results in tensor completion since treating tensors as matrix slices would lose useful structures of tensors. Krishnamurthy and Singh (2013) proposed a sequential tensor completion algorithm under adaptive samplings. Their result requires O(pr2.5 log(r)) number of entries for p-by-p-by-p order-3 tensors under the more restrictive CP rank-r condition, which is much larger than that of our method. Rauhut et al. (2016) considered a tensor recovery setting where each observation is a general linear projections of the original tensor. However, their theoretical analysis heavily relies on a conjecture that is difficult to check. The rest of the paper is organized as follows. After an introduction to the notations and preliminaries in Section 2.1, we present the Cross tensor measurement scheme in Section 2.2. Based on the proposed measurement scheme, the tensor completion algorithms for both noiseless and noisy case are introduced in Sections 2.3 and 2.4 respectively. We further analyze the theoretical performance of the proposed algorithms in Section 3. The numerical performance of algorithms are investigated in a variety of simulation studies in Section 4. We then apply the proposed procedure to a real dataset of brain MRI imaging in Section 5. The proofs of the main results are finally collected in Section 6 and the supplementary materials. Although the presentation of this paper focuses on order-3 tensors, our methods can be easily extended to

4

fourth or higher order tensors.

2

Cross Tensor Measurements & Completion: Methodology

2.1

Basic Notations and Preliminaries

We start with basic notations and results that will be used throughout the paper. The upper case letters, e.g., X, Y, Z, are generally used to represent matrices. For X ∈ Rp1 ×p2 , the singular value decomposition can be written as X = U ΣV > . Suppose diag(Σ) = (σ1 (X), . . . , σmin{p1 ,p2 } (X)), then σ1 (X) ≥ σ2 (X) ≥ . . . ≥ σmin{p1 ,p2 } (X) ≥ 0 are the singular values of X. Especially, we note σmin (X) = σmin{p1 ,p2 } (X) and σmax (X) = σ1 (X) as the smallest and largest singular value of X. Additionally, the matrix spectral norm and Frobenius norm are denoted as q qP P Pmin{p1 ,p2 } 2 kXuk2 p1 p2 2 p kXk = maxu∈R 2 kuk2 and kXkF = σi (X), respectively. i=1 j=1 Xij = i=1 We denote PX ∈ Rp1 ×p1 as the projection operator onto the column space of X. Specifically, PX = X(X > X)† X > = XX † . Here (·)† is the Moore-Penrose pseudo-inverse. Let Op,r be the set of all p-by-r orthogonal columns, i.e., Op,r = {V ∈ Rp×r : V V > = Ir }, where Ir represents the identity matrix of dimension r. We use bold upper case letters, e.g., X, Y, Z to denote tensors. If X ∈ Rp1 ×p2 ×p3 , Et ∈ Rmt ×pt , t = 1, 2, 3. The mode products (tensor-matrix product) is defined as X ×1 E1 ∈ R

m1 ×p2 ×p3

,

(X ×1 E1 )ijk =

p1 X

i ∈ [1 : m1 ], j ∈ [1 : p2 ], k ∈ [1 : p3 ].

E1,is Xsjk ,

s=1

The mode-2 product X×2 E2 and mode-3 product X×3 E3 can be defined similarly. Interestingly, the products along different modes satisfy the commutative law, e.g., X ×t Et ×s Es = X ×s Es ×t Et if s 6= t. The matricization (or unfolding, flattening in literature), Mt (X), maps a Q

tensor X ∈ Rp1 ×p2 ×p3 into a matrix Mt (X) ∈ Rpt ×

s6=t

ps

, so that for any i ∈ {1, · · · , p1 }, j ∈

{1, · · · , p2 }, k ∈ {1, · · · , p3 }, Xijk = (M1 (X))[i,(j+p2 (k−1))] = (M2 (X))[j,(k+p3 (i−1))] = (M3 (X))[k,(i+p1 (j−1))] . The tensor Hilbert Schmitt norm and tensor spectral norm, which are defined as v uX p2 X p3 u p1 X X ×1 u ×2 v ×3 w X2ijk , kXkop = max kXkHS = t , p p p u∈R 1 ,v∈R 2 ,w∈R 3 kuk2 kvk2 kwk2 i=1 j=1 k=1

5

will be intensively used in this paper. It is also noteworthy that the general calculation of the tensor operator norm is NP-hard (Hillar and Lim, 2013). Unlike matrices, there is no universal definition of rank for third or higher order tensors. Standing out from various definitions, the Tucker rank (Tucker, 1966) has been widely utilized in literature, and its definition is closely associated with the following Tucker decomposition: for X ∈ Rp1 ×p2 ×p3 , X = S ×1 U1 ×2 U2 ×3 U3 ,

or equivalently Xijk =

X

si0 j 0 k0 U1,ii0 U2,jj 0 U3,kk0 .

(1)

i0 j 0 k 0

Here S ∈ Rr1 ×r2 ×r3 is referred to as the core tensor, Uk ∈ Opk ,rk . The minimum number of triplets (r1 , r2 , r3 ) are defined as the Tucker rank of X which we denote as rank(X) = (r1 , r2 , r3 ). The Tucker rank can be calculated easily by the rank of each matricization: rt = rank(Mt (X)). It is also easy to prove that the triplet (r1 , r2 , r3 ) satisfies rt ≤ pt , max2 {r1 , r2 , r3 } ≤ r1 r2 r3 . For a more detailed survey of tensor decomposition, readers are referred to Kolda and Bader (2009). We also use the following symbols to represent sub-arrays. For any subsets Ω1 , Ω2 , etc, we use X[Ω1 ,Ω2 ] to represent the sub-matrix of X with row indices Ω1 and column indices Ω2 . The sub-tensors are denoted similarly: X[Ω1 ,Ω2 ,Ω3 ] represents the tensors with mode-t indices in Ωt for t = 1, 2, 3. For better presentation, we use bracket to represent index sets. Particularly for any integers a ≤ b, let [a : b] = {a, . . . , b} and let “:” alone represent the whole index set. Thus, U[:,1:r] represents the first r columns of U ; X[Ω1 ,Ω2 ,:] represents the sub-tensor of X with mode-1 indices Ω1 , mode-2 indices Ω2 and all mode-3 indices. Now we establish the lower bound for the minimum number of measurements for Tucker low-rank tensor completion based on counting the degrees of freedom. Proposition 1 (Degrees of freedom for rank-(r1 , r2 , r3 ) tensors in Rp1 ×p2 ×p3 ) Assume that r1 ≤ p1 , r2 ≤ p2 , r3 ≤ p3 , max2 {r1 , r2 , r3 } ≤ r1 r2 r3 , then the degrees of freedom of all rank(r1 , r2 , r3 ) tensors in Rp1 ×p2 ×p3 is r1 r2 r3 + (p1 − r1 )r1 + (p2 − r2 )r2 + (p3 − r3 )r3 .

Remark 1 Beyond order-3 tensors, we can show the degrees of freedom for rank-(r1 , . . . , rd ) Q P order-d tensors in Rp1 ×···×pd is dt=1 rt + dt=1 rt (pt − rt ) similarly.

6

Proposition 1 provides a lower bound and the benchmark for the number of measurements to P guarantee low-rank tensor completion, i.e., r1 r2 r3 + 3t=1 rt (pt − rt ). Since the previous methods are not guaranteed to achieve this lower bound, we focus on developing the first measurement scheme that can both work efficiently and reach this benchmark.

2.2

Cross Tensor Measurements

In this section, we propose a novel Cross tensor measurement scheme. Suppose the targeting unknown tensor X is of p1 -by-p2 -by-p3 , we let Ω1 ⊆ [1 : p1 ], Ξ1 ⊆ Ω2 × Ω3 ,

Ω2 ⊆ [1 : p2 ],

Ω3 ⊆ [1 : p3 ],

Ξ2 ⊆ Ω3 × Ω1 ,

where

Ξ3 ⊆ Ω1 × Ω2 ,

|Ωt | = mt ,

where

t = 1, 2, 3;

|Ξt | = gt ,

(2) t = 1, 2, 3.

Then we measure the entries of X using the following indices set Ω = (Ω1 × Ω2 × Ω3 ) ∪ ([1 : p1 ] × Ξ1 ) ∪ ([1 : p2 ] × Ξ2 ) ∪ ([1 : p3 ] × Ξ3 ) ⊆ [1 : p1 , 1 : p2 , 1 : p3 ], (3) where Ω1 × Ω2 × Ω3 = {(i, j, k) : i ∈ Ω1 , j ∈ Ω2 , k ∈ Ω3 } are referred to as body measurements;  [1 : p1 ] × Ξ1 = {(i, j, k) : i ∈ [1 : p1 ], (j, k) ∈ Ξ1 }     are referred to as arm measurements. [1 : p2 ] × Ξ2 = {(i, j, k) : j ∈ [1 : p2 ], (k, i) ∈ Ξ2 }    [1 : p ] × Ξ = {(i, j, k) : k ∈ [1 : p ], (i, j) ∈ Ξ }  3

3

3

3

(4) Meanwhile, the intersections among body and arm measurements, which we refer to as joint measurements, also play important roles in our analysis: Ω1 × Ξ1 = (Ω1 × Ω2 × Ω3 ) ∩ ([1 : p1 ] × Ξ1 ) = {(i, j, k) : i ∈ Ω1 , (j, k) ∈ Ξ1 }, Ω2 × Ξ2 = (Ω1 × Ω2 × Ω3 ) ∩ ([1 : p2 ] × Ξ2 ) = {(i, j, k) : j ∈ Ω2 , (k, i) ∈ Ξ2 },

(5)

Ω3 × Ξ3 = (Ω1 × Ω2 × Ω3 ) ∩ ([1 : p3 ] × Ξ3 ) = {(i, j, k) : k ∈ Ω3 , (i, j) ∈ Ξ3 }. A pictorial illustration of the body, arm and joint measurements is provided in Figure 1. Since the measurements are generally cross-shaped, we refer to Ω as the Cross Tensor Measurement Scheme. It is easy to see that the total number of measurements for the proposed scheme is m1 m2 m3 + g1 (p1 − m1 ) + g2 (p2 − m2 ) + g3 (p3 − m3 ) and the sampling ratio is #Observable samples m1 m2 m3 + g1 (p1 − m1 ) + g2 (p2 − m2 ) + g3 (p3 − m3 ) = . #All parameters p1 p2 p3 7

(6)

Based on these measurements, we focus on the following model, YΩ = XΩ + ZΩ ,

i.e.

(i, j, k) ∈ Ω,

Yijk = Xijk + Zijk ,

(7)

where X, Y and Z correspond to the original tensor, observed values and unknown noise term, respectively.

2.3

Recovery Algorithm – Noiseless Case

When X is exactly low-rank and the observations are noiseless, i.e. Yijk = Xijk , we can recover X with the following algorithm. We first construct the arm matricizations, joint matricizations and body matricizations based on (4) and (5), YΞt = Mt (Y[1:pt ]×Ξt ) ∈ Rpt ×gt , Q

Yt,Ω = Mt (Y[Ω1 ,Ω2 ,Ω3 ] ) ∈ Rmt ×

s6=t

ms

,

YΩt ×Ξt = Mt (YΩt ×Ξt ) ∈ Rmt ×gt .

(Arm matricizations)

(8)

(Body matricizations)

(9)

(Joint matricizations)

(10)

In the noiseless setting, we propose the following formula to complete X: ˆ = Y[Ω ,Ω ,Ω ] ×1 R1 ×2 R2 ×3 R3 , X 1 2 3 R1 = YΞ1 YΩ†1 ×Ξ1 ∈ Rp1 ×m1 ,

R2 = YΞ2 YΩ†2 ×Ξ2 ∈ Rp2 ×m2 ,

(11)

R3 = YΞ3 YΩ†3 ×Ξ3 ∈ Rp3 ×m3 . (12)

The procedure is summarized in Algorithm 1. The theoretical guarantee for this proposed algorithm is provided in Theorem 1. Algorithm 1 Cross: Efficient Algorithm for Tensor Completion with Noiseless Observations 1: Input: noiseless observations Yijk , (i, j, k) ∈ Ω from (3). 2:

Construct YΞ1 , YΞ2 , XΞ3 , YΩ1 ×Ξ1 , YΩ2 ×Ξ2 , YΩ3 ×Ξ3 as (8).

3:

Calculate R1 = YΞ1 YΩ†1 ×Ξ1 ∈ Rp1 ×m1 ,

4:

R2 = YΞ2 YΩ†2 ×Ξ2 ∈ Rp2 ×m2 ,

R3 = YΞ3 YΩ†3 ×Ξ3 ∈ Rp3 ×m3 .

Calculate the final estimator ˆ = Y[Ω ,Ω ,Ω ] ×1 R1 ×2 R2 ×3 R3 . X 1 2 3

8

(a) Ωt , Ξt illustration

(b) All measurements

(c) Body measurements Y[Ω1 ,Ω2 Ω3 ]

(d) Arm measurements YΞ1

(e) Arm measurements YΞ2

(f) Arm measurements YΞ3

(g) Joint measurements YΩ1 ×Ξ1

(h) Joint measurements YΩ2 ×Ξ2

(i) Joint measurements YΩ3 ×Ξ3

Figure 1: Illustrative example for Cross Tensor Measurements Scheme. For better illustration here we assume Ωt = [1 : mt ], p1 = p2 = p3 = 10, m1 = m2 = m3 = g1 = g2 = g3 = 4.

9

Theorem 1 (Exact recovery in noiseless setting) Suppose X ∈ Rp1 ×p2 ×p3 , rank(X) = (r1 , r2 , r3 ). Assume all Cross tensor measurements are noiseless, i.e. YΩ = XΩ . If rt ≤ min{mt , gt } and rank(YΩt ×Ξt ) = rt for t = 1, 2, 3, then X = Y[Ω1 ,Ω2 ,Ω3 ] ×1 R1 ×2 R2 ×3 R3 ,

where

Rt = YΞt YΩ†t ×Ξt ,

t = 1, 2, 3.

(13)

˜ t ∈ Rmt ×rt , N ˜t ∈ Rgt ×rt such that M ˜ > XΩt ×Ξt N ˜ ∈ Rrt ×rt is non-singular Moreover, if there are M for t = 1, 2, 3, then we further have ˜ 2 ×3 R ˜3, ˜ 1 ×2 R X = Y[Ω1 ,Ω2 ,Ω3 ] ×1 R

where

 −1 ˜ t = YΞt N ˜> M ˜ > YΩt ×Ξt N ˜ ˜ >, R M t t

t = 1, 2, 3. (14)

Theorem 1 shows that, in the noiseless setting, as long as rt ≤ min{mt , gt } and the joint matricizations YΩt ×Ξt have the same rank as Mt (X), exact recovery by Algorithm 1 can be guaranteed. Therefore when mt = gt = rt , the minimum required number of measurements for the proposed Cross tensor measurement scheme is r1 r2 r3 + r1 (p1 − r1 ) + r2 (p2 − r2 ) + r3 (p3 − r3 ) when we set mt = gt = rt , which exactly matches the lower bound established in Proposition 1 and outperforms the previous methods in the literature. On the other hand, Algorithm 1 heavily relies on the noiseless assumption. In fact, calculating YΩ†t ×Ξt = (XΩt ×Ξt + ZΩt ×Ξt )† is unstable even with low levels of noise, which ruins the performance of Algorithm 1. Since we rarely have noiseless observations in practice, we focus on the setting with non-zero noise for the rest of the paper.

2.4

Recovery Algorithm – Noisy Case

In this section we propose the following procedure for recovery in the noisy setting. The proposed algorithm is divided into four steps and an illustrative example is provided in Figure 2 for readers’ better understanding. • (Step 1: Construction of Matricizations) Same as Algorithm 1, Construct the arm, body and joint matricizations as (8) and (9) (see Figure 2(a)), Arm: YΞ1 , YΞ2 , YΞ3 ,

Body: Y1,Ω , Y2,Ω , Y3,Ω , 10

Joint:

YΩ1 ×Ξ1 , YΩ2 ×Ξ2 , YΩ3 ×Ξ3 .

• (Step 2: Rotation) For t = 1, 2, 3, we calculate the singular value decompositions of YΞt and Yt,Ω , then store (A)

Vt

(B) Ut

∈ Ogt

as the right singular vectors of YΞt ,

∈ Omt

as the left singular vectors of Yt,Ω .

(15)

Here the superscripts “(A), (B)” represent arm and body, respectively. We calculate the following rotation for arm and joint matricizations based on SVDs (See Figure 2(b), (c)). (A)

At =YΞt · Vt

(B) Jt =(Ut )>

∈ Rpt ×gt ,

· YΩt ×Ξt ·

(A) Vt

(16) ∈R

mt ×gt

.

As we can see from Figure 2(c), the magnitude of At ’s columns and Jt ’s both columns and rows decreases front to back. Therefore, the important factors of YΞt and YΩt ×Ξt are moved to front rows and columns in this step. • (Step 3: Adaptive Trimming) Since At and Jt are contaminated with noise, in this step we denoise them by trimming the lower ranking columns of At and both lower ranking columns and rows of Jt . To decide the number of rows and columns to trim, it will be good to have an estimate for (r1 , r2 , r3 ), say (ˆ r1 , rˆ2 , rˆ3 ). We will show later in theoretical analysis that a good choice of rˆt should satisfy



(Jt )[1:ˆrt ,1:ˆrt ] is non-singular and (At )[:,1:ˆrt ] (Jt )−1 [1:ˆ rt ,1:ˆ rt ] ≤ λt ,

t = 1, 2, 3.

(17)

λt is the tunning parameter here, and in a variety of situations we can simply choose p λt = 3 pt /mt for reasons we will discuss later. Our final estimator for rt is the largest rˆt that satisfies Condition (17), and can be found by verifying (17) for all possible rt ’s. It is worth mentioning that this step shares similar ideas with structured matrix completion in Cai et al. (2016). (See Figure 2(d) and (e)). • (Step 4: Assembling) Finally, given rˆ1 , rˆ2 , rˆ3 obtained from Step 3, we calculate (A) > pt ×mt ¯ t = (At )[:,1:ˆr ] (Jt )−1 R , t [1:ˆ rt ,1:ˆ rt ] (Vt,[1:ˆ rt ,:] ) ∈ R

t = 1, 2, 3,

(18)

and recover the original low-rank tensor X by ˆ = Y[Ω ,Ω ,Ω ] ×1 R ¯ 1 ×2 R ¯ 2 ×3 R ¯3. X 1 2 3

(19)

The procedure is summarized as Algorithm 2. It is worth mentioning that both Algorithms 1 and 2 can be easily extended to fourth and higher order tensors. 11

(a) Step 1. All matricizations: Yt,Ω , YΞt and YΩt ×Ξt

(b) Heatmap illustration of YΞt , YΩt ×Ξt

(c) Step 2. We obtain At , Jt after rotation

(darker blocks mean larger absolute values)

(d) Step 3. Intermediate process of trimming

(e) Step 3. Eventually located at rˆt = 4

Figure 2: Illustration of the proposed procedure in noisy setting 12

Algorithm 2 Noisy Tensor Completion with Cross Measurements 1: Input: entries Yijk , (i, j, k) ∈ Ω from (3), λ1 , λ2 , λ3 . 2:

Construct arm, body and joint matricizations as (8) and (9), YΞt ∈ Rpt ×gt ,

3:

(B)

Calculate Ut

(A)

and Vt

YΩt ×Ξt ∈ Rmt ×gt ,

(B) (A)

4:

6: 7:

,

t = 1, 2, 3.

∈ Ogt , as the right singular vectors of Xt,Ω .

>  Jt = U (B) · XΩt ×Ξt · V (A) ∈ Rmt ×gt .

for t = 1, 2, 3 do for s = min{gt , mt } : −1 : 1 do −1 if Jt,[1:s,1:s] is not singular and kAt,[:,1:s] Jt,[1:s,1:s] k ≤ λt then

rˆt = s; break from the loop; end if

10:

end for

11:

If rˆt is still unassigned then rˆt = 0.

12:

end for

13:

Calculate (A) > pt ×mt ¯ t = At,[:,1:s] J −1 R , t,[1:s,1:s] (Vt,[1:ˆ rt ,:] ) ∈ R

14:

ms )

Rotate the arm and joint measurements as

8: 9:

s6=t

∈ Omt , as the left singular vectors of XΞt ;

At = XΞt · V (A) ∈ Rpt ×gt , 5:

Q

via SVDs:

Ut

Vt

Yt,Ω ∈ Rmt ×(

t = 1, 2, 3.

Compute the final estimator ˆ = Y[Ω ,Ω ,Ω ] ×1 R ¯ 1 ×2 R ¯ 2 ×3 R ¯3. X 1 2 3

13

3

Theoretical Analysis

In this section, we investigate the theoretical performance for the proposed procedure in the last section. Recall that our goal is to recover X from YΩ based on (3). Similarly, one can further define the arm, joint and body matricizations for X, Z, i.e. XΞt , XΩt ×Ξt , Xt,Ω , ZΞt , ZΩt ×Ξt and Zt,Ω for t = 1, 2, 3 in the same fashion as YΞt , YΩt ×Ξt and Yt,Ω in (8), (9) and (10). We first present the following theoretical guarantees for low-rank tensor completion based on noisy observations via Algorithm 2. Theorem 2 Suppose X ∈ Rp1 ×p2 ×p3 , rank(X) = (r1 , r2 , r3 ). Assume we observe YΩ based on Cross tensor measurement scheme (4), where XΩ satisfies rank(XΩt ×Ξt ) = rt and σrt (XΩt ×Ξt ) > 5 kZΩt ×Ξt k ,

σrt (XΞt ) > 5 kZΞt k ,

σrt (Xt,Ω ) > 5 kZt,Ω k .

(20)

We further define



ξt = XΩ† t ×Ξt Xt,Ω ,

t = 1, 2, 3.

(21)

Applying Algorithm 2 with λ1 , λ2 , λ3 satisfying



λt ≥ 2 Xt,Ξt XΩ† t ×Ξt ,

t = 1, 2, 3,

(22)

we have the following upper bound results for some uniform constant C,

ˆ

X − X

HS



ˆ

X − X

≤Cλ1 λ2 λ3 kZ[Ω1 ,Ω2 ,Ω3 ] kHS + Cλ1 λ2 λ3

op

3 X t=1

≤ Cλ1 λ2 λ3 kZ[Ω1 ,Ω2 ,Ω3 ] kop + Cλ1 λ2 λ3

3 X ξt ξt kZΩt ×Ξt kF + C kZΞt kF λt

3 X t=1

! ,

t=1

! 3 X ξt kZΞt k . ξt kZΩt ×Ξt k + C λt t=1

It is helpful to explain the meanings of the conditions used in Theorem 2. The singular value gap condition (20) is assumed in order to guarantee that signal dominates the noise in the observed blocks. λt and ξt are important factors in our analysis which represent “arm-joint” and “joint-body” ratio respectively. These factors roughly indicate how much information is contained in the body and arm measurements and how much impact the noisy terms have on the upper bound, all of which implicitly indicate the difficulty of the problem. Based on λt , ξt ,

14

we consider the following classes of tuples formed by rank-(r1 , r2 , r3 ) tensors, the perturbation Z, and indices of observations, F =F{λt },{ξt }      = (X, Z, Ωt , Ξt ) :    

    

X ∈ Rp1 ×p2 ×p3 , rank(X) = (r1 , r2 , r3 );

σrt (XΩt ×Ξt ) ≥ 5kZΩt ×Ξt k, σrt (XΞt ) ≥ 5kZΞt k, σrt (Xt,Ω ) ≥ 5kZt,Ω k; 



 



† 

XΞt XΩt ×Ξt ≤ λt , XΩt ×Ξt Xt,Ω ≤ ξt . (23)

and provide the following lower bound result over F{λt },{ξt } . Theorem 3 (Lower Bound) Suppose positive integers rt , pt satisfy 6 max{r1 , r2 , r3 }2 ≤ r1 r2 r3 , pt ≥ 2rt . The arm, body and joint measurement errors are bounded as (A)

kZ[Ω1 ,Ω2 ,Ω3 ] kHS ≤ C (B) , (J)

If Ct

kZt,Ξt kF ≤ Ct

(J)

kZt,Ωt ×Ξt kF ≤ Ct .

,

(24)

(J)

≤ min{Ct , C (B) }, ξt ≥ 28, λt > 1, then there exists uniform constant c > 0 such that  3 

X ξt (A)

ˆ

(J) (B) . (25) inf sup + cλ1 λ2 λ3 ξt Ct + Ct

X − X ≥ cλ1 λ2 λ3 C ˆ (X,Z,Ωt ,Ξt )∈F λt HS X t=1

Z satisfies (24) (A)

Similarly, suppose C (B) , Ct

(J)

, Ct

are the upper bound for arm, body and joint measurement

errors in tensor and matrix operator norms respectively, i.e. kZ[Ω1 ,Ω2 ,Ω3 ] kop ≤ C (B) , (J)

Suppose Ct

(A)

≤ min{Ct

inf

sup

ˆ (X,Z,Ωt ,Ξt )∈F X Z satisfies (26)

(A)

kZt,Ξt k ≤ Ct

,

(J)

kZt,Ωt ×Ξt k ≤ Ct .

(26)

, C (B) }, ξt ≥ 28, λt > 1, then



ˆ

X − X

≥ cλ1 λ2 λ3 C (B) + cλ1 λ2 λ3

op

 3  X ξt (A) (J) ξt Ct + Ct . λt

(27)

t=1

Remark 2 Theorems 2 and 3 together yield the optimal rate of recovery in F in both HilbertSchmitt and operator norms: inf

sup

ˆ (X,Z,Ωt ,Ξt )∈F X Z satisfies (24)

inf

sup

ˆ (X,Z,Ωt ,Ξt )∈F X Z satisfies (26)

(



ˆ

X − X

HS



ˆ

X − X

op

C (B) +

 λ1 λ2 λ3

3  X t=1

(  λ1 λ2 λ3

C

(B)

) .

) 3  X ξt (A) (J) . + ξt Ct + Ct λt t=1

15

ξt (A) (J) ξt Ct + Ct λt

As we can see from the theoretical analyses, the choice of λt is crucial towards the recovery performance of Algorithm 2. Theorem 2 provides a guideline for such a choice depending on



the unknown parameter XΞt XΩ† t ×Ξt , which is hard to obtain in practice. However, we can p choose λt = 3 pt /mt in a variety of settings. Specifically in the analysis below, we show under random sampling scheme that Ω1 , Ω2 , Ω3 , Ξ1 , Ξ2 , Ξ3 are uniformly randomly selected from p [1 : p1 ], [1 : p2 ], [1 : p3 ], Ω2 × Ω3 , Ω3 × Ω1 , Ω1 × Ω2 , Algorithm 2 with λt = 3 pt /mt will have p guaranteed performance. The choice of λt = 3 pt /mt will be further examined in simulation studies later. Theorem 4 Suppose X is with Tucker decomposition X = S ×1 U1 ×2 U2 ×3 U3 , where S ∈ Rr1 ×r2 ×r3 , U1 ∈ Op1 ,r1 , U2 ∈ Op2 ,r2 , U3 ∈ Op3 ,r3 and U1 , U2 , U3 , M1 (S ×2 U2 ×3 U3 ), M2 (S ×1 U1 ×3 U3 ), M3 (S ×1 U1 ×2 U2 ) all satisfy the matrix incoherence conditions: Q Q



( s6=t ps ) 2 pt

s6=t ps (pt ) 2 max PUt ej ≤ ρ, max

≤ ρ,

PMt (S×(t+1) Ut+1 ×(t+2) Ut+2 )> · ej Q rt 1≤j≤pt rt 2 2 1≤j≤ s6=t ps (28) (p)

where ej

is the j-th canonical basis in Rp . Suppose we are given random Cross tensor mea-

surements that Ωt and Ξt are uniformly randomly chosen mt and gt values from {1, . . . , pt } and Q s6=t Ωs , respectively. If for t = 1, 2, 3,  r  r r p1 p2 p3 p1 p2 p3 p1 p2 p3 σmin (Mt (S)) ≥ max 10 kZΩt ×Ξt k, 10 kZt,Ω k, 19 kZΞt k , (29) mt gt m1 m2 m3 pt gt p Algorithm 2 with λt = 3 pt /mt yields

ˆ

X − X

HS

r ≤C

3

X p1 p2 p3 √ kZ[Ω1 ,Ω2 ,Ω3 ] kHS + C p1 p2 p3 m1 m2 m3 t=1



kZΞ kF kZΩt ×Ξt kF + √ t √ gt mt gt pt

 ,

 3  X

p1 p2 p3 kZΩt ×Ξt k kZΞt k

Z[Ω ,Ω ,Ω ] + C √p1 p2 p3 + , √ √ 1 2 3 op m1 m2 m3 gt mt gt pt op t=1 P3 with probability at least 1 − 2 t=1 rt {exp(−mt /(16rt ρ)) + exp(−gt /(64rt ρ))}.

ˆ

X − X

r

≤C

Remark 3 The incoherence conditions (28) are widely used in matrix and tensor completion literature (see, e.g., Cand`es and Tao (2010); Recht (2011); Yuan and Zhang (2016)). Their conditions basically characterize every entry of X as containing a similar level of information for the whole tensor. Therefore, we should have enough knowledge to recover the original tensor based on the observable entries. 16

4

Simulation Study

In this section, we investigate the numerical performance of the proposed procedure in a variety of settings. We repeat each setting 1000 times and record the average relative loss in Hilbert ˆ − XkHS /kXkHS . Schmitt norm, i.e., kX We first focus on the setting with i.i.d. Gaussian noise. To be specific, we randomly generate X = S ×1 E1 ×2 E2 × E3 , where S ∈ Rr1 ×r2 ×r3 , E1 ∈ Rp1 ×r1 , E2 ∈ Rp2 ×r2 , E3 ∈ Rp3 ×r3 are all with i.i.d. standard Gaussian entries. We can verify that X becomes a rank-(r1 , r2 , r3 ) tensor with probably 1 whenever r1 , r2 , r3 satisfy max2 (r1 , r2 , r3 ) ≤ r1 r2 r3 . Then we generate the Cross tensor measurement Ω as in (3) with Ωt including uniformly randomly selected mt values from Q [1 : pt ] and Ξt including uniformly randomly selected gt values from s6=t Ωs , and contaminate iid

XΩ with i.i.d. Gaussian noise: YΩ = XΩ +ZΩ , where Zijk ∼ (0, σ 2 ). Under such configuration, we study the influence of different factors, including λt , σ, mt , gt , pt to the numerical performance. Under the Gaussian noise setting, we first compare different choices of tunning parameters λt . To be specific, set p1 = p2 = p3 = 50, m1 = m2 = m3 = g1 = g2 = g3 = 10, r1 = r2 = r3 = 3 and p p let σ range from 1 to 0.01 and λt range from 1.5 pt /mt to 4 pt /mt , respectively. The average p p ˆ from Algorithm 2 with λt ∈ [1.5 pt /mt , 4 pt /mt ] is relative Hilbert Schmitt norm loss of X reported in Figure 3(a). It can be seen that the average relative loss decays when the noise level p is decreasing. After comparing different choices of λt , 3 pt /mt works the best under different p σ, which matches our suggestion in theoretical analysis. Thus, we set λt = 3 pt /mt for all the other simulation settings and real data analysis. We also compare the effects of mt = |Ωt | and gt = |Ξt | in the numerical performance of p Algorithm 2. We set p1 = p2 = p3 = 50, r1 = r2 = r3 = 3, σ = 0.3, λt = 3 pt /mt and let gt , mt vary from 6 to 30, to plot the average relative Hilbert-Schmitt norm loss in Figure 3(b). It can be seen that as gt , mt grow, namely when more entries are observable, better recovery performance can be achieved. To further study the impact of high-dimensionality to the proposed procedure, we consider the setting where the dimension of X further grows. Here, r1 = r2 = r3 = 3, σ = 0.3, m1 = m2 = m3 = g1 = g2 = g3 ∈ {10, 15, 20, 25} and p1 , p2 , p3 grow from 100 to 500. The average

17

(a) Varying noise level σ and stop criterion level: p σ from 1 to 0.01 and λt ∈ c pt /mt , c ∈

(b) Varying gt and mt : gt , mg from 6 to 30

{1.5, 2, 2.5, 3, 3.5, 4}.

(c) Average relative loss when pt , mt , gt are varying

(d) Average time cost when pt , mt , gt are varying

Figure 3: Numerical performance under Gaussian noise settings relative loss in Hilbert-Schmitt norm and average running time are provided in Figure 3(c) and (d), respectively. Particularly, the recovery of 500-by-500-by-500 tensors involves 125,000,000 variables, but the proposed procedure provides stable recovery within 10 seconds on average by the PC with 3.1 GHz CPU, which demonstrates the efficiency of our proposed algorithm. Next we move on to the setting where observations take discrete random values. Highdimensional count data commonly appear in a wide range of applications, including fluorescence microscopy, network flow, and microbiome (see, e.g., Nowak and Kolaczyk (2000); Jiang et al. (2015); Cao et al. (2016); Cao and Xie (2016), etc), where Poisson and multinomial distributions are often used in modeling the counts. In this simulation study, we generate S ∈ Rr1 ×r2 ×r3 , Et ∈ 18

(a) Poisson model with varying mt , gt and intensity H (b) Multinomial model with varying mt , gt , total count N

Figure 4: Average relative loss in HS norm based on Poisson and multinomial observations. Here, p1 = p2 = p3 = 50, r1 = r2 = r3 = 3.

Rpt ×rt as absolute values of i.i.d. standard normal random variables, and calculate S ×1 E1 ×2 E2 ×3 E3 . X = Pp1 Pp2 Pp3 i=1 j=1 k=1 (S ×1 E1 ×2 E2 ×3 E3 )ijk Ωt , Ξt are generated similarly as before, p1 = p2 = p3 = 50, r1 = r2 = r3 = 3, m1 = m2 = m3 = g1 = g2 = g3 ∈ {10, 15, 20, 25}, and Y = (Yijk ) are Poisson or multinomial distributed: Yijk ∼ Poisson(HXijk ),

or

Yijk ∼ Multinomial(N ; X), (i, j, k) ∈ [1 : p1 , 1 : p2 , 1 : p3 ].

Here H is a known intensity parameter in Poisson observations and N is the total count parameter in multinomial observations. As shown in Figure 4, the proposed Algorithm 2 performs stably for these two types of noisy structures. Although X is assumed to be exactly low-rank in all theoretical studies, it is not necessary in practice. In fact, our simulation study shows that Algorithm (2) performs well when X is only approximately low-rank. Specifically, we fix p1 = p2 = p3 = 50, generate W ∈ Rp1 ×p2 ×p3 from i.i.d. standard normal, set U1 ∈ Op1 , U2 ∈ Op2 , U3 ∈ Op3 as uniform random orthogonal matrices, and Et = diag(1, 1, 1−α , · · · , (pt − 2)−α ). X is then constructed as X = W ×1 (E1 U1 ) ×2 (E2 U2 ) ×3 (E3 U3 ) . Here, α measures the decaying rate of singular values of each matricization of X and X becomes exactly rank-(3, 3, 3) when α = ∞. We consider different decay rates α, noise levels σ, and 19

(a) Fixed mt = gt = 10, varying singular value decay- (b) Fixed σ = 0.3, varying singular value decaying ing rate α and noise level σ.

rate α and mt , gt .

Figure 5: Average relative loss for approximate low-rank tensors.

observation set sizes mt and gt . The corresponding average relative Hilbert-Schmitt norm loss is reported in Figure 5. It can be seen that although X is not exactly low rank, as long as the singular values of each matricization of X decay sufficiently fast, a desirable completion of X can still be achieved. It can be seen from Figure 5 that, although X is assumed to be exactly low-rank in theoretical analysis, such an assumption is not necessary in practice if the singular values of each matricizations decay sufficiently fast, which again demonstrates the robustness of the proposed procedure.

5

Real Data Illustration

In this section, we apply the proposed Cross tensor measurements scheme to a real dataset on attention hyperactivity disorder (ADHD) available from ADHD-200 Sample Initiative ( http: //fcon_1000.projects.nitrc.org/indi/adhd200/). ADHD is a common disease that affects at least 5-7% of school-age children and may accompany patients throughout their life with direct costs of at least $36 billion per year in the United States. Despite being the most common mental disorder in children and adolescents, the cause of ADHD is largely unclear. To investigate the disease, the ADHD-200 study covered 285 subjects diagnosed with ADHD and 491 control subjects. After data cleaning, the dataset contains 776 tensors of dimension 121-by-145-by-121:

20

Yi , i = 1, . . . , 776. The storage space for these data through naive format is 121 × 145 × 121 × 776 × 4B ≈ 6.137 GB, which makes it difficult and costly for sampling, storage and computation. Therefore, we hope to reduce the sampling size for ADHD brain imaging data via the proposed Cross tensor measurement scheme. Figure 6 shows the singular values of each matricization of a randomly selected Yi . We can see that Yi is approximately Tucker low-rank.

Figure 6: Singular value decompositions for each matricization of Y

Similar to previous simulation settings, we uniformly randomly select Ωt ⊆ [1 : pt ], Ξt ⊆ Q

s6=t Ωs

such that |Ωt | = mt , |Ξt | = gt . Particularly, we choose mt = round(ρ · pt ),

gt =

round(m1 m2 m3 /pt ), where ρ varies from 0.1 to 0.5 and round(·) is the function that rounds its input to the nearest integer. After observing partial entries of each tensor, we apply Algorithm p ˆ Different from some of the previous studies (e.g. Zhou 2 with λt = 3 pt /mt to obtain X. et al. (2013)), our algorithm is adaptive and tunning-free so that we do not need to subjectively specify the rank of the target tensors beforehand. ˆ = (ˆ ˆ1 ∈ Op ,r , U ˆ2 ∈ Op ,r , U ˆ3 ∈ Op ,r are the left singular Suppose rank(X) r1 , rˆ2 , rˆ3 ), U 1 1 2 2 3 3 ˆ M2 (X), ˆ M3 (X), ˆ respectively. We are interested in investigating the perforvectors of M1 (X), ˆ but the absence of the true tank of the underlying tensor X makes it difficult to mance of X, ˆ and X. Instead, we compare X ˆ with X, ˜ where X ˜ is the rank-(ˆ directly compare X r1 , rˆ2 , rˆ3 ) tensor obtained through the high-order singular value decomposition (HOSVD) (see e.g. Kolda and Bader (2009)) based on all observations in Y: ˜ = Y ×1 P ˜ ×2 P ˜ ×3 P ˜ . X U2 U3 U1 21

(30)

˜3 ∈ Op ,r are the first r1 , r2 and r3 left singular vectors of ˜2 ∈ Op ,r , U ˜1 ∈ Op ,r , U Here U 3 3 2 2 1 1 ˆ ˜ M1 (Y), M2 (Y) and M3 (Y), respectively. Particularly, we compare kX−Yk HS and kX−YkHS , Sampling Ratio (See (6))

ˆ kX−Yk HS ˜ kX−Yk HS

ˆ >U ˜ √1 kU 1 1 kF rˆ1

ˆ >U ˜ √1 kU 2 2 kF rˆ2

ˆ >U ˜ √1 kU 3 3 kF rˆ3

mt = round(.1pt )

0.0035

1.5086

0.8291

0.8212

0.8318

mt = round(.2pt )

0.0267

1.2063

0.9352

0.9110

0.9155

mt = round(.3pt )

0.0832

1.0918

0.9650

0.9571

0.9634

mt = round(.4pt )

0.1766

1.0506

0.9769

0.9745

0.9828

mt = round(.5pt )

0.3066

1.0312

0.9832

0.9840

0.9905

ˆ and X ˜ for ADHD brain imaging data. Table 1: Comparison between X i.e. the rank-(r1 , r2 , r3 ) approximation based on limited number of Cross tensor measurements ˆ1 , U ˆ2 , U ˆ3 and U ˜1 , U ˜2 , U ˜3 by and the approximation based on all measurements. We also compare U calculating

ˆ >U ˜ t kF . √1 kU t rˆt

The study is performed on 10 randomly selected images and repeated

100 times for each of them. We can immediately see from the result in Table 1 that on average, ˆ − YkHS , i.e. rank-(r1 , r2 , r3 ) approximation error with limited numbers of Cross tensor kX ˜ − YkHS , i.e. rank-(r1 , r2 , r3 ) approximation error with measurements, can get very close to kX the whole tensor Y. Besides,

ˆ >U ˜t kF √1 kU t rˆt

is close to 1, which means the singular vectors

calculated from limited numbers of Cross tensor measurements are not too far from the ones calculated from the whole tensor. Therefore, by the proposed Cross Tensor Measurement Scheme and a small fraction of observable entries, we can approximate the leading principle component of the original tensor just as if we have observed all voxels. This illustrates the power of the proposed algorithm.

Acknowledgments The author would like to thank Lexin Li for sharing the ADHD dataset and helpful discussion.

22

6

Proofs

We collect the proofs of the main results (Theorems 1 and 2) in this section. The proofs for other results are postponed to the supplementary materials.

6.1

Proof of Theorem 1.

First, we shall note that Y = X in the exact low-rank and noiseless setting. For each t = 1, 2, 3, according to definitions, XΞt is a collection of columns of Mt (X) and XΩt ×Ξt is a collection of rows of XΞt . Then we have rank(XΩt ×Ξt ) ≤ rank(XΞt ) ≤ rank(Mt (Y)). Given the assumptions, we have rank(XΩt ×Ξt ) = rank(Mt (X)) = rt . Thus, rank(XΩt ×Ξt ) = rank(XΞt ) = rank(Mt (X)) = rt . Then Mt (X) and XΞt share the same column subspace and there exists a matrix Wt ∈ Rgt ×

Q

s6=t

ps

such that Mt (X) = XΞt · Wt . On the other hand, since XΞt and XΩt ×Ξt have the same row subspace, we have XΞt = XΞt · (XΩt ×Ξt )† · XΩt ×Ξt = Rt · XΩt ×Ξt .

(31)

Additionally, XΞt and XΩt ×Ξt can be factorized as XΞt = P1 Q, XΩt ×Ξt = P2 Q, where P1 ∈ Rpt ×rt , P2 ∈ Rmt ×rt , Q ∈ Rrt ×gt and rank(P1 ) = rank(P2 ) = rank(Q) = rt . In this case, for any ˜ ∈ Rmt ×rt , N ˜ ∈ Rgt ×rt with rank(M ˜ > XΩt ×Ξt N ˜ ) = rt , we have matrices M  −1  −1 ˜ t · XΩt ×Ξt =XΞt N ˜ M ˜ > XΩt ×Ξt N ˜ ˜ > · XΩt ×Ξt = P1 QN ˜ M ˜ > P 2 QN ˜ ˜ > P2 Q R M M ˜ (QN ˜ )−1 (M ˜ > P2 )−1 M ˜ > P2 Q = P1 Q = XΞt . P1 QN Right multiplying Wt to (31) and (32), we obtain Mt (X) = XΞt · Wt =Rt · XΩt ×Ξt · Wt = Rt · (Mt (Y))[Ωt ,:] , ˜ t · XΩt ×Ξt · Wt = R ˜ t · (Mt (Y)) =R [Ωt ,:] . By folding Mt (X) back to the tensors and Lemma 3, we obtain X =X[Ω1 ,:,:] ×1 R1 = X[:,Ω2 ,:] ×2 R2 = X[:,:,Ω3 ] ×3 R3 ˜ 2 = X[:,:,Ω ] ×3 R ˜3. ˜ 1 = X[:,Ω ,:] ×2 R =X[Ω1 ,:,:] ×1 R 2 3

23

(32)

By the equation above, ˜ 1 = X[:,Ω ,Ω ] , X[Ω1 ,Ω2 ,Ω3 ] ×1 R1 = X[Ω1 ,Ω2 ,Ω3 ] ×1 R 2 3 ˜ 2 = X[:,:,Ω ] , X[:,Ω2 ,Ω3 ] ×2 R2 = X[:,Ω2 ,Ω3 ] ×2 R 3

˜ 3 = X. X[:,:,Ω3 ] ×3 R3 = X[:,:,Ω3 ] ×3 R

Therefore, ˜ 1 ×2 R ˜ 2 ×3 R ˜ 3 = X, X[Ω1 ,Ω2 ,Ω3 ] ×1 R1 ×2 R2 ×3 R3 = X[Ω1 ,Ω2 ,Ω3 ] ×1 R which concludes the proof of Theorem 1.

6.2



Proof of Theorem 2.

Based on the proof for Theorem 1, we know ˜ 1 ×2 R ˜ 2 ×3 R ˜3, X = X[Ω1 ,Ω2 ,Ω3 ] ×1 R where −1  ˜ > ∈ Rpt ×mt , ˜ ˜ > XΩt ×Ξt N ˜ M ˜ t = XΞt N M R

(33)

˜ ∈ Rmt ×rt , N ˜ ∈ Rgt ×rt satisfying M ˜ > XΩt ×Ξt N ˜ is non-singular. The proof for Theorem for any M 2 is relatively long. For better presentation, we divide the proof into steps. Before going into detailed discussions, we list all notations with the definitions and possible simple explanations in Table 2 in the supplementary materials. 1. Denote ˆt , Nt ∈ Ogt ,rt as the first rt right singular vectors of YΞt and XΞt , respectively; N

(34)

ˆ t , Mt ∈ Omt ,rt as the first rt left singular vectors of Yt,Ω and Xt,Ω , respectively. M ˆt and M ˆ t are It is easy to see that alternate characterization for N ˆt = V (A) , N [:,1:rt ]

ˆ t = U (B) . M [:,1:rt ]

(35)

ˆ t , Mt ; N ˆt , Nt are close by providing the upper Denote τ = 1/5. In this step, we prove that M bounds on singular subspace perturbations, ˆ t , Mt )k ≤ τ 2 /(1 − 2τ ), k sin Θ(M 24

ˆt , Nt )k ≤ τ 2 /(1 − 2τ ); k sin Θ(N

(36)

as well as the inequality which ensures that Jt is bounded away from being singular,  ˆt ) ≥ 1 − τ 4 /(1 − 2τ )2 σmin (XΩt ×Ξt ), ˆ t> XΩt ×Ξt N σmin (M  ˆt ) ≥ 1 − τ 4 /(1 − 2τ )2 − τ σmin (XΩt ×Ξt ). ˆ t> YΩt ×Ξt N σmin (M

(37)

Actually, based on Assumption (20) , we have σrt (YΞt Nt ) ≥ σrt (XΞt Nt ) − kZΞt Nt k ≥ σrt (XΞt ) − τ σrt (XΞt ) = (1 − τ )σrt (AΞt ); σrt +1 (XΞt ) ≤ σr+1 (XΞt ) + kZΞt k ≤ τ σrt (AΞt );



PXΞt Nt XΞt (Nt )⊥ ≤ kXΞt (Nt )⊥ k = kAΞt (Nt )⊥ + ZΞt (Nt )⊥ k ≤ kZΞt k ≤ τ σrt (AΞt ). Here (Nt )⊥ is the orthogonal complement matrix, i.e. [Nt (Nt )⊥ ] ∈ Ogt . By setting A = XΞt , W = Nt in the scenario of the unilateral perturbation bound (Proposition 1 in Cai and Zhang (2016)), we then obtain



σrt +1 (XΞt ) P Z (N )

t Ξ ⊥ (X N ) t Ξt t τ ·τ

ˆt , Nt ) ≤ ≤ = τ 2 /(1 − 2τ ).

sin Θ(N 2 2 2 2 (1 − τ ) − τ σrt (XΞt Nt ) − σrt +1 (XΞt ) Here k sin Θ(·, ·)k is a commonly used distance between orthogonal subspaces. Similarly, based on the assumption that τ σrt (At,Ω ) ≥ kZt,Ω k, we can derive



ˆ

sin Θ(Mt , Mt ) ≤ τ 2 /(1 − 2τ ), which proves (36). Based on the property for sin Θ distance (Lemma 1 in Cai and Zhang (2016)), q p ˆ > Nt ) = 1 − k sin Θ(N ˆt , Nt )k2 ≥ 1 − τ 4 (1 − 2τ )2 ; σmin (N t q p ˆ t , Mt )k2 ≥ 1 − τ 4 /(1 − 2τ )2 . ˆ t> Mt ) = 1 − k sin Θ(M σmin (M Also, since Mt , Nt coincide with the left and right singular subspaces of XΩt ×Ξt respectively, we have ˆ > XΩt ×Ξt N ˆt ) = σmin (M ˆ > PMt XΩt ×Ξt PNt N ˆt ) = σmin (M ˆ > Mt M > XΩt ×Ξt Nt N > N ˆt ) σmin (M t t t t t ˆ > Mt )σmin (M > XΩt ×Ξt Nt )σmin (N > N ˆt ) ≥ (1 − τ 4 /(1 − τ )2 )σmin (XΩt ×Ξt ) ≥σmin (M t t t ˆ > YΩt ×Ξt N ˆt ) ≥ σmin (M ˆ > XΩt ×Ξt N ˆt ) − kZΩt ×Ξt k ≥ (1 − τ 4 /(1 − 2τ )2 − τ )σmin (XΩt ×Ξt ), σmin (M t t which has finished the proof for (37). 25

2. In this step, we prove that under the given setting, rˆt ≥ rt for t = 1, 2, 3. We only need to show that for each t = 1, 2, 3, the stopping criterion holds when s = rt , i.e.



−1

(At )[:,1:rt ] (Jt )[1:rt ,1:rt ] ≤ λt .

(38)

ˆ t, N ˆt can be related as According to the definitions in (16), At , Jt and M (A)

(35)

ˆt = (XΞt + ZΞt )N ˆt , (At )[:,1:rt ] = YΞt V[:,1:rt ] = YΞt N (B) (A) (35) ˆ > ˆ ˆ> ˆ (Jt )[1:rt ,1:rt ] = (U[:,1:rt ] )> YΩt ×Ξt V[:,1:rt ] = M t YΩ×Ξt Nt = Mt (XΩ×Ξt + ZΩ×Ξt )Nt .

ˆt is non-singular ˆ t> YΩt ×Ξt N Therefore, in order to show (38), we only need to prove that M and

 −1

ˆt · M ˆ > (XΩt ×Ξt + ZΩt ×Ξt )N ˆt

(XΞt + ZΞt )N

≤ λt . t

(39)

Recall that Ut ∈ Opt ,rt is the singular subspace for Mt (X), so there exists another matrix Q ∈ Rrt ×gt such that XΞt , a set of columns of Mt (X), can be written as XΞt = Ut · Q,

then

XΩt ×Ξt = Ut,Ω · Q.

(40)

Here Ut,Ω = (Ut )[Ωt ,:] is a collection of rows from Ut , and † , XΞt XΩ† t ×Ξt = Ut Q (Ut,Ω Q)† = Ut Ut,Ω

then



† −1 (Ut,Ω ).

XΞt XΩt ×Ξt = σmin

For convenience, we denote ¯t = λ

(22) 1

1

= XΞt XΩ† t ×Ξt ≤ λt . σmin (Ut,Ω ) 2

(41)

In this case,



ˆ t (M ˆ t> XΩt ×Ξt N ˆt )−1 ˆt (M ˆ t> Ut,Ω QN ˆt )−1

XΞt N

= Ut QN



 −1 −1 

1 ¯ ˆ > Ut,Ω ˆ > Ut,Ω



= M = U M t t t



σmin (Ut,Ω ) = λt .

(42)

Furthermore, σrt (XΩ×Ξt ) = σmin (Ut,Ω ·Q) ≥ σmin (Ut,Ω )·σmin (Q) = σmin (Ut,Ω )·σmin (Ut Q) = σmin (Ut,Ω )·σrt (XΞt ). (43)

26

Therefore,

−1 

ˆ t> (XΩt ×Ξt + ZΩt ×Ξt ) N ˆt ˆt · M

(XΞt + ZΞt )N





 −1   −1

> > ˆ ˆ ˆ ˆ ˆ ˆ

M (X + Z ) N M (X + Z ) N ≤ N · + N · X Z t t Ωt ×Ξt Ωt ×Ξt Ωt ×Ξt Ωt ×Ξt t t

Ξt t

Ξt t

−1  −1  

Lemma 5 > > > ˆ ˆ ˆ t ˆ ˆ ˆ ˆ ˆ M (X + Z ) N M N M Z N M X ≤ N · X t t t t Ω ×Ξ Ω ×Ξ Ω ×Ξ Ω ×Ξ Ξ t t t t t t t t t t

t

−1 

kZΞt k ˆt ˆ t> XΩt ×Ξt N ˆt · M

+ + N X Ξ t

σ (M

ˆ >Y ˆ) N min

(37)(42)

¯t + λ ¯t ≤ λ

(1 −

τ 4 /(1

t

Ωt ×Ξt

t

kZΩt ×Ξt k kZΞt k + 2 4 − 2τ ) − τ ) σmin (XΩt ×Ξt ) (1 − τ /(1 − 2τ )2 − τ ) σmin (XΩt ×Ξt )

¯tτ τ σrt (XΞt ) λ ¯t + + ≤λ 4 2 4 1 − τ /(1 − 2τ ) − τ (1 − τ /(1 − 2τ )2 − τ ) σmin (XΩt ×Ξt )   (41) (43) 2τ ¯ ≤ λt 1 + ≤ λt , 1 − τ 4 /(1 − 2τ )2 − τ

(20)

which has proved our claim that rˆt ≥ rt for t = 1, 2, 3. ˆ under the scenario that rˆt ≥ 3. In this step, we provide an important decomposition of X −1 −1 rt . One major difficulty is measuring the difference between Jt,[1:ˆ rt ,1:ˆ rt ] and Jt,[1:rt ,1:rt ] . For

convenience, we further introduce the following notations, Jrˆt ∈ Rrˆt ׈rt , (X)

Jrˆt

(B)>

:= Ut,[:,1:ˆrt ] XΩt ×Ξt Vt,[:,1:ˆrt ] ,

(B)>

(A)

Jrˆt := Jt,[1:ˆrt ,1:ˆrt ] = Ut,[:,1:ˆrt ] YΩt ×Ξt Vt,[:,1:ˆrt ] , (Z)

(B)>

Jrˆt := Ut,[:,1:ˆrt ] ZΩt ×Ξt Vt,[:,1:ˆrt ] ,

(44) (X)

then Jrˆt = Jrˆt

(Z)

+ Jrˆt . (45)

Arˆt ∈ Rpt ׈rt , (X)

Arˆt

(A)

:= XΞt Vt,[:,1:ˆrt ] ,

(A)

Arˆt := At,[:,1:ˆrt ] = YΞt Vt,[:,1:ˆrt ] ,

(Z)

(A)

Arˆt := ZΞt Vt,[:,1:ˆrt ] ,

Let the singular value decompositions of Jrˆt be  h i Λt1  Jrˆt = Kt Λt L> t = Kt1 Kt2 ·

(46) (X)

then Arˆt = Jrˆt

   L>  ·  t1  , Λt2 L> t2

(Z)

+ Jrˆt .

t = 1, 2, 3.

(47)

(48)

Here Kt , Lt ∈ Orˆt , Kt1 , Lt1 ∈ Orˆt ,rt , Kt2 , Lt2 ∈ Orˆt ,rt are the singular vectors, Λt ∈ Rrˆt ׈rt , Λt1 ∈ Rrt ×rt and Λt2 ∈ R(ˆrt −rt )×(ˆrt −rt ) are the singular values. Based on these SVDs, we can

27

¯ t (defined in (18)) as correspondingly decompose R  −1 (B) > > ¯ t =At,[:,1:ˆr ] J −1 U (B) R = A K Λ L + K Λ L Ut,[:,1:ˆrt ] t1 t1 t2 t2 t,[:,1:ˆ r ] t1 t2 t t rˆt t,[:,1:ˆ rt ]   (B) −1 > > =At,[:,1:ˆrt ] Lt1 Λ−1 K + L Λ K t2 t2 t1 t2 Ut,[:,ˆ t1 rt ] (B)

(B)

−1 > > =At,[:,1:ˆrt ] Lt1 Λ−1 rt ] Lt2 Λt2 Kt2 Ut,[:,1:ˆ t1 Kt1 Ut,[:,1:ˆ rt ] + At,[:,1:ˆ rt ]  −1 (B) (A) > > Kt1 Ut,[:,1:ˆrt ] =YΞt Vt,[:,1:ˆrt ] Lt1 Kt1 Jrˆt Lt1  −1 (A) > (B) > Kt2 Ut,[:,1:ˆrt ] . + YΞt Vt,[:,1:ˆrt ] Lt2 Kt2 Jrˆt Lt2

(49)

¯ t while the Here, the first term above associated with Kt1 , Lt1 ... stands for the major part in R second associated with Kt2 , Kt2 stands for the minor part. Based on this decomposition, we introduce the following notations: for any t = 1, 2, 3 (indicating Mode-1, 2 or 3) and s = 1, 2 (indicating the major or minor parts), (X)

(X)

(A)

(Z)

Ats := Arˆt Lts = XΞt Vt,[:,1:ˆrt ] Lts ,

(Z)

(A)

Ats := Arˆs Lts = ZΞt Vt,[:,1:ˆrt ] Lts

(A)

(X)

(Z)

Ats := Arˆt Lts = YΞt Vt,[:,1:ˆrt ] Lts = Ats + Ats . (X)

Jts

(X)

(B)>

(51)

(A)

> > > := Kts Jrˆt Lts = Kts Jrˆt Lts = Kts Ut,[:,1:ˆrt ] XΩt ×Ξt Vt,[:,1:ˆrt ] Lts ,

† > > > Jts := Kts Jrˆt Lts = Kts Arˆt Lts = Kts Jrˆt Lts = Kts Ut,[:,1:ˆrt ] ZΩt ×Ξt Vt,[:,1:ˆrt ] Lts , (Z)

(A)

(X)

Jts := Jts

(Z)

(B)>

(B)>

(50)

(A)

(A)

> + Jts = Kts Ut,[:,1:ˆrt ] YΩt ×Ξt Vt,[:,1:ˆrt ] Lts .

(52) (53)

Also, for s1 , s2 , s3 ∈ {1, 2}3 , we define the projected body measurements       (B) (B) (B) > > > B(X) × K U × K U (54) 2 3 s1 s2 s3 = X[Ω1 ,Ω2 ,Ω3 ] ×1 K1s1 U1,[1:ˆ 2s 3s 2 2,[1:ˆ 3 3,[1:ˆ r1 ] r2 ] r3 ]       (B) (B) (B) > > > B(Z) × K × K (55) U U 2 3 s1 s2 s3 = Z[Ω1 ,Ω2 ,Ω3 ] ×1 K1s1 U1,[1:ˆ 2s 3s 2 2,[1:ˆ 3 3,[1:ˆ r1 ] r2 ] r3 ]       (B) (B) (B) > > > (Z) Bs1 s2 s3 = Y[Ω1 ,Ω2 ,Ω3 ] ×1 K1s U r1 ] ×2 K2s U r2 ] ×2 K3s U r3 ] = B(X) s1 s2 s3 +Bs1 s2 s3 . 1 1,[1:ˆ 2 2,[1:ˆ 3 3,[1:ˆ (56) ˆ Combining (49)-(56), we can write down the following decomposition for X. ˆ = Y[Ω ,Ω ,Ω ] ×1 R ¯ 1 ×2 R ¯ 2 ×3 R ¯3 X 1 2 3   2 X    −1 −1 −1 −1 −1 −1 = Bs1 s2 s3  ×1 A11 J11 + A12 J12 ×2 A21 J21 + A32 J32 + A22 J22 ×3 A31 J31 s1 ,s2 ,s3 =1

=

2 X

Bs1 s2 s3 ×1 A1s1 (J1s1 )−1 ×2 A2s2 (J2s2 )−1 ×3 A3s3 (J3s3 )−1 .

s1 ,s2 ,s3 =1

(57) 28

This form is very helpful in our analysis later. 4. In this step, we derive a few formulas for the terms in (57). To be specific, we shall prove the following results. • Lower bound for singular value of the joint major part: for t = 1, 2, 3,   τ2 (X) σmin Jt1 ≥ (1−τ12 )(1−τ 4 /(1−2τ )2 )σmin (XΩt ×Ξt ), τ1 = . (1 − τ 4 /(1 − 2τ )2 − τ )2 − τ 2 (58) ˆ > XΩt ×Ξt N ˆt = (J (X) )[1:r ,1:r ] , i.e., M ˆ > XΩt ×Ξt N ˆt is a In fact, according to definitions, M t t t t rˆt (X)

sub-matrix of Jrˆt , we have   (X) ˆ > XΩt ×Ξt N ˆ (37) = (1 − τ 4 /(1 − 2τ )2 )σmin (XΩt ×Ξt ). σrt (Jrˆt ) ≥ σrt M Let ¯ t ∈ Orˆ ,r be the left and right singular vectors for rank-rt matrix J (X) . ¯ t ∈ Orˆ ,r , L K t t t t rˆt (59) We summarize some facts here: (X)

– Jrˆt = Jrˆt

(Z)

+ Jrˆt ;

¯ t, L ¯ t are the left and right singular vectors of rank-rt matrix J (X) ; – K rˆt – Kt1 , Lt1 are the first rt left and right singular vectors of Jrˆt . Weyl (1912)

– σrt +1 (Jrˆt ) ¯ t) – σrt (Jrˆt L

≤ Weyl (1912)



(X)

(Z)

σrt +1 (Jrˆt ) + kJrˆt k ≤ τ σmin (XΩt ×Ξt ). (X) ¯ (Z) (X) (Z) σrt (Jrˆt L t ) − kJrˆt k = σrt (Jrˆt ) − kJrˆt k

≥ (1 − τ 4 /(1 − 2τ )2 − τ )σmin (XΩt ×Ξt ). Then by the unilateral perturbation bound result (Proposition 1 and Lemma 1 in Cai and Zhang (2016)) and the facts above, (Z)

σr +1 (Jrˆt ) · kP(Jrˆt L¯ t ) Jrˆt k (60) τ2 ¯ t , Lt1 ) ≤ t

sin Θ(L ≤ := τ1 . ¯ t ) − σ 2 (Jrˆ ) (1 − τ 4 /(1 − 2τ )2 − τ )2 − τ 2 σr2t (Jrˆt L t rt +1 q ¯ σmin (L> L ) ≥ 1 − τ12 . t1 t p >K ¯ t ) ≥ 1 − τ 2 . Therefore, Similarly, σmin (Kt1 1     (59)     (X) (X) > (X) > > ¯ ¯ > (X) ¯ ¯ > σmin Jt1 =σmin Kt1 Jrˆt Lt1 = σmin Kt1 PK¯ t Jrˆt PL¯ t Lt1 = σmin Kt1 Kt Kt Jrˆt Lt Lt Lt1   (X) > ¯ 2 ¯ t> J (X) L ¯ t σmin (L> ¯ ≥σmin (Kt1 Kt )σmin K t1 Lt ) ≥ (1 − τ1 )σrt (Jrˆt ) rˆt ≥(1 − τ12 )(1 − τ 4 /(1 − 2τ )2 )σmin (XΩt ×Ξt ). 29

which has finished the proof for (58). • Upper bound for all terms related to perturbation “Z:” for example, (Z)

kJt1 k ≤ kZΩt ×Ξt k ≤ τ σmin (XΩt ×Ξt ),

(60)

(Z)



(Z)

Bs1 s2 s3

HS

kAt1 k ≤ kZΞt k ≤ τ σmin (XΞt ).



≤ kZ[Ω1 ,Ω2 ,Ω3 ] kop . ≤ kZ[Ω1 ,Ω2 ,Ω3 ] kHS , B(Z) s1 s2 s3

(61)

op

Since all these terms related to perturbation “(Z)” are essentially projections of ZΞt , ZΩt ×Ξt , Zt,Ω or Z[Ω1 ,Ω2 ,Ω3 ] , they can be derived easily. • Upper bounds in spectral norm for “arm · joint−1 ” and “joint−1 · body”:

(X) (X) −1 ¯ A (J )

t1

≤ λt ; t1 if (s1 , s2 , s3 ) ∈ {1, 2}3 , t ∈ {1, 2, 3},

−1 ¯t +

At1 (Jt1 ) ≤ λ

(62)



(X) −1

)

(Jtst ) Mt (B(X) s1 s2 s3 ≤ ξt ;

¯tτ 2λ ; (1 − τ12 )(1 − τ 4 /(1 − 2τ )2 ) − τ



−1 ¯t +

At2 (Jt2 ) ≤ λt + λ

(64)

¯tτ 2λ . (1 − τ12 )(1 − τ 4 /(1 − 2τ )2 ) − τ (X)

(63)

(X)

Recall (40) that XΞt = Ut Q, XΩt ×Ξt = Ut,Ω Q, the definition for At1 , Jt1

(65) and the fact

(X)

that σmin (Jt1 ) > 0, we have

  −1

(X) (X) −1 (A) (A)> > (B)>

At1 (Jt1 ) = XΞt Vt,[1:ˆrt ] Lt1 Kt1 Ut,[:,1:ˆrt ] XΩt ×Ξt Vt,[:,1:ˆrt ] Lt1

 −1

(A) (A)> > (B)>

= U QV L K U U QV L t t1 t1 t,Ω t1

t,[1:ˆ rt ] t,[:,1:ˆ rt ] t,[:,1:ˆ rt ]

   −1 −1

(A) (A)> (B)> >

= Kt1 Ut,[:,1:ˆrt ] Ut,Ω

Ut QVt,[1:ˆrt ] Lt1 QVt,[:,1:ˆrt ] Lt1

1

= σmin



(B)> Kt> Ut,[:,1:ˆrt ] Ut,Ω

≤

1 ¯t, =λ σmin (Ut,Ω )

which has proved (62). The proof for (63) is similar. Since XΩt ×Ξt is a collection of columns of Xt,Ω and rank(XΩt ×Ξt ) = rank(Xt,Ω ) = rt , these two matrices share the same column subspace. In this case,   XΩt ×Ξt XΩ† t ×Ξt Xt,Ω = PXΩt ×Ξt Xt,Ω = Xt,Ω .

30

Given the assumption (22) that kXΩ† t ×Ξt Xt,Ω k ≤ ξt , the rest of the proof for (63) essentially follows from the proof for (62). The proof for (64) is relatively more complicated. Note that (Z)

kJt1 k ≤ kZΩt ×Ξt k ≤ τ σmin (XΩt ×Ξt ), (Z)

kAt1 k ≤ kZΞt k ≤ τ σrt (XΞt ) ≤ τ σrt (Q) = τ σmin (Q) ≤

τ σmin (Ut,Ω Q) ¯ t σmin (XΩ ×Ξ ), ≤ τλ t t σmin (Ut,Ω ) (66)

we can calculate that



 

 (X)

(Z) (X) (Z) −1 −1

Jt1 + Jt1

At1 (Jt1 ) = At1 + At1



    

(Z)  (X)

(X) (X) −1 (X) (X) (Z) −1 (X) −1 (Z) −1



Jt1 + Jt1 − Jt1 ≤ At1 (Jt1 ) + At1

+ At1 Jt1 + Jt1





     

(X) (X) −1 (Z) (X) Lemma 5 (X) (X) −1 (Z) −1 (Z) −1 (X) (Z)

+ = At1 (Jt1 ) + A J J J + J A σ J + J

t1 t1 t1 t1 t1 t1 t1 min

t1

¯t + λ ¯t =λ

(66)(58)

(Z)

(Z)

kJt1 k

(62)

¯t + ≤ λ

(X)

(Z)

kAt1 k

+

σmin (Jt1 ) − kJt1 k ¯tτ 2λ

(X)

(Z)

σmin (Jt1 ) − kJt1 k

(1 − τ12 )(1 − τ 4 /(1 − 2τ )2 ) − τ

.

This has finished the proof for (64). Next, we move on to (65). Recall the definitions of Arˆt and Jrˆt , and the fact that Kt1 , Kt2 ; > K = L> L = I, K > K = L> L = I, K > K = Lt1 , Lt2 are orthogonal, we have Kt1 t1 t1 t1 t2 t2 t2 t2 t1 t2

L> t1 Lt2 = 0. Then,   −1 > > > > > > = A L L + A L L K K J L L + K K J L L Arˆt Jrˆ−1 t1 t1 rˆt t1 t1 t1 t1 rˆt t2 t2 rˆt t1 t1 rˆt t2 t2 t   −1    −1 > −1 > > = At1 L> Kt1 Jt1 L> = At1 L> Lt1 Jt1 Kt1 + Lt2 Jt2 Kt2 t1 + At2 Lt2 t1 + Kt2 Jt2 Lt2 t1 + At2 Lt2 −1 > −1 > =At1 Jt1 Kt1 + A21 Jt2 Kt2 .



≤ λt . Thus, On the other hand, Arˆt Jrˆ−1 t





−1 −1 >

+ A J K

At2 (Jt2 ) ≤ Arˆt Jrˆ−1

t1 t1 t1 t



¯t +

+ At1 J −1 ≤ λt + λ ≤ Arˆt Jrˆ−1 t1 t which has finished the proof for (65).

31

¯tτ 2λ , (1 − τ12 )(1 − τ 4 /(1 − 2τ )2 ) − τ

• Upper bounds in Frobenius and operator norm for “arm·(joint)−1 ·body:” for the major part:

 −1

(X) ¯

At1 (Jt1 )−1 Mt (B (X) ) − A(X) J (X) Mt (B111 ) 111 t1 t1

≤ C λt ξt kZΩt ×Ξt kF + Cξt kZΞt kF . F

(67)

 −1

(X) (X) (X) (X) −1 ¯

At1 (Jt1 ) Mt (B ) − A Jt1 Mt (B111 ) 111 t1

≤ C λt ξt kZΩt ×Ξt k + Cξt kZΞt k. (68) Actually, we can show that

 −1

(X)

At1 (Jt1 )−1 Mt (B (X) ) − A(X) J (X) Mt (B111 ) 111 t1 t1

F

    −1

(X) (Z) (X) (Z) (X) (X) (X) −1 (X)

= (At1 + At1 ) Jt1 + Jt1 Mt (B111 ) − At1 Jt1 Mt (B111 )

F



  −1  −1  −1



(X) (X) (Z) (X) (X) (Z) (X) (Z) (X)

≤ Jt1 + Jt1 − Jt1 Mt (B111 ) Mt (B111 )

+ At1 Jt1 + Jt1

At1 F F

   −1  −1

Lemma 5 (X)  (X) −1 (Z) (Z) (X) (Z) (Z) (X) (X) ≤ Jt1 − Jt1 Jt1 + Jt1 Jt1 Jt1 Mt (B111 )

At1 Jt1 F

     −1 (Z) −1

(Z) (X) (Z) (X) (X) + Jt1 + I Jt1 Mt (B111 )

At1 − Jt1 + Jt1 F



 −1   (62)(63)

(Z)



(Z) (Z) (X) (Z) (Z) (X) (Z) −1 (Z) ¯

≤ C λt ξt Jt1 − Jt1 Jt1 + Jt1 Jt1 + Cξt At1 · − Jt1 + Jt1 Jt1 + I

F

F

(60)(58)

(Z)

(Z)

¯ t ξt kJ kF + Cξt kA kF ≤ Cλ t1 t1 ¯ t ξt kZΩ ×Ξ kF + Cξt kZΞ kF . ≤C λ t t t

which has proved (67). The proof for (68) essentially follows from the proof for (67) when we replace the Frobenius norms with the spectral norm. • Upper bounds for minor “body” part: if (s1 , s2 , s3 ) ∈ {1, 2}3 and st = 2, then 

 kBs s s k ≤ 2ξt kZΩ ×Ξ k + Z

[Ω1 ,Ω2 ,Ω3 ] HS , t t F 1 2 3 HS

 kB

. k ≤ 2ξ kZ k + Z s1 s2 s3 op

t

Ωt ×Ξt

(69)

[Ω1 ,Ω2 ,Ω3 ] op

Instead of considering the “body” part above directly, we detour and discuss the “joint” part (X)

first. It is noteworthy that Jt2 is the rt + 1, . . . , rˆt -th principle components for Jrˆt = Jrˆt + (Z)

(X)

Jrˆt . As rank(Jrˆt ) = rt , by Lemma 1 in Cai et al. (2016) (which provides inequalities of

32

singular values in the low-rank perturbed matrix), (Z)

σi (Jt2 ) = σrt +i (Jrˆt ) ≤ σi (Jrˆt ), ∀1 ≤ i ≤ rˆt − rt v v urˆt −rt urˆt −rt uX uX (Z) (Z) t 2 ⇒ kJt2 kF = σi (Jt2 ) ≤ t σi2 (Jrˆt ) = kJrˆt kF ≤ kZΩt ×Ξt kF , i=1

⇒ (63)





i=1

(X) kJt2 kF

(Z) kJt2 kF

≤ kJt2 kF + ≤ 2kZΩt ×Ξt kF ,



 −1



(X) (X) (X) (X) Jt2 Mt (Bs1 s2 s3 )

Mt (Bs1 s2 s3 ) ≤ Jt2 ·

≤ 2ξt kZΩt ×Ξt kF F F



(Z)

(X) kBs1 s2 s3 kHS ≤ Bs1 s2 s3 + Bs1 s2 s3 ≤ kMt (Bs1 s2 s3 )kF + kZ[Ω1 ,Ω2 ,Ω3 ] kHS HS

HS

≤ 2ξt kZΩt ×Ξt kF + kZ[Ω1 ,Ω2 ,Ω3 ] kHS . We can similarly derive that kBs1 s2 s3 kop ≤ 2ξt kZΩt ×Ξt k+kZ[Ω1 ,Ω2 ,Ω3 ] kop , which has proved (69). • Equality for the original tensor X: (X)

(X)

X = B111 ×1 A11



 (X) −1

J11

(X)

×2 A21



 (X) −1

J21

(X)

×3 A31



 (X) −1

J31

.

(70)

In fact, according to the definitions (50) - (56), (X)

(X)

(X)

(X)

B111 ×1 A11 ×2 A21 ×3 A31

=X[Ω1 ,Ω2 ,Ω3 ]    >  −1  > (A) (B) (A) (B) ×1 XΞ1 V1,[:,1:ˆr1 ] L11 U1,[:,1:ˆr1 ] K11 XΩ1 ×Ξ1 V1,[:,1:ˆr1 ] L11 U1,[:,1:ˆr1 ] K11 ×2 XΞ2



(A) V2,[:,1:ˆr2 ] L21

×3 XΞ3



(A) V3,[:,1:ˆr3 ] L31 (X)

Based on (58), Jt1

 

> (B) U2,[:,1:ˆr2 ] K21 XΩ2 ×Ξ2



(A) V2,[:,1:ˆr2 ] L21

 

> (B) U3,[:,1:ˆr3 ] K31 XΩ3 ×Ξ3



(A) V3,[:,1:ˆr3 ] L31

(B)

−1  −1 

(B)

>

(B)

>

U2,[:,1:ˆr2 ] K21 U3,[:,1:ˆr3 ] K31

.

(A)

> (U > = Kt1 1,[:,1:ˆ rt ] ) XΩt ×Ξt Vt,[:,1:ˆ rt ] Lt1 is non-singular, then by Theorem

1, we can show the term above equals X, which has proved (70). ˆ based on all the preparations in the 5. Now we are ready to analyze the estimation error of X

33

ˆ (57) and X (70), one has previous steps. Based on the decompositions of X

ˆ

X − X HS

−1 −1 −1 ≤ B111 ×1 A11 J11 ×2 A21 J21 ×3 A31 J31       (X) (X) (X) −1 (X) (X) −1 (X) (X) −1 − B111 ×1 A11 J11 ×2 A21 J21 ×3 A31 J31

HS

2 X

+



Bs s s ×1 A1s J −1 ×2 A2s J −1 ×3 A3s J −1 3 3s3 HS 2 2s2 1 1s1 1 2 3

s1 ,s2 ,s3 =1 (s1 ,s2 ,s3 )6=(1,1,1)





(X) −1 −1 −1 ≤ B111 − B111 ×1 A11 J11 ×2 A21 J21 ×3 A31 J31

HS

 

(X)

(X) (X) −1 −1 −1 + B111 ×1 A11 J11 ×2 A21 J21 ×3 A31 J31 − A31 (J31 )−1 HS

 

(X) (X) (X)−1 (X) (X)−1 −1 −1 ×3 A31 J31 + B111 ×1 A11 J11 ×2 A21 J21 − A21 J21

HS

 

(X)

(X) (X)−1 (X) (X)−1 (X) (X)−1 −1 − A11 J11 ×2 A21 J21 ×3 A31 J31 + B111 ×1 A11 J11

(71)

HS

2 X

+



Bs s s ×1 A1s J −1 ×2 A2s J −1 ×3 A3s J −1 . 1 2 3 1 1s1 2 2s2 3 3s3 HS

s1 ,s2 ,s3 =1 (s1 ,s2 ,s3 )6=(1,1,1)

For each term separately above, we have

 

(X) −1 −1 −1 ×3 A31 J31 ×2 A21 J21

B111 − B111 ×1 A11 J11 HS

(61)(62)



−1 (Z) −1 −1 ≤ Cλ1 λ2 λ3 kZ[Ω1 ,Ω2 ,Ω3 ] kHS . · A31 J31 · A21 J21 ≤ A11 J11

B111 HS

 

(X)

(X) (X) −1 −1 −1 ×2 A21 J21 ×3 A31 J31 − A31 (J31 )−1

B111 ×1 A11 J11 HS







(X) (X) (X) −1 −1 −1 −1

M3 (B111 ) ≤ A11 J11 · A21 J21 · A31 J31 − A31 (J31 )

F

(67)(62)

≤ Cλ1 λ2 λ3 ξ3 kZΩ3 ×Ξ3 kF + Cλ1 λ2 ξ3 kZΞ3 kF ,

 

(X) (X) (X)−1 (X) (X)−1 −1 −1 B × A J × A J − A J × A J

111 1 11 11 2

21 21 3 31 21 21 31 HS



 



(X) (X) (X) (X) (X) −1 −1 ≤ A11 J11 · A31 (J31 )−1 · A21 J21 − A21 (J21 )−1 M2 (B111 )

F

(64)(67)(62)



Cλ1 λ2 λ3 ξ2 kZΩ2 ×Ξ2 kF + Cλ1 λ3 ξ2 kZΞ2 kF ,

 

(X) (X) (X)−1 (X) (X)−1 (X) (X)−1 −1 − A11 J11 ×2 A21 J21 ×3 A31 J31

B111 ×1 A11 J11

HS







(X) (X) −1 (X) (X) −1 (X) (X) (X) −1 ≤ A21 (J21 ) · A31 (J31 ) · A11 J11 − A11 (J11 )−1 M1 (B111 )

F

(64)(67)(62)



Cλ1 λ2 λ3 ξ1 kZΩ1 ×Ξ1 kF + Cλ2 λ3 ξ1 kZΞ1 kF . 34

Last but not least, for any s1 , s2 , s3 ∈ {1, 2}3 such that (s1 , s2 , s3 ) 6= (1, 1, 1, ), let us specify that st = 2 for some 1 ≤ t ≤ 1. Then

Bs s s ×1 A1s J −1 ×2 A2s J −1 ×3 A3s J −1 3 3s3 HS 2 2s2 1 1s1 1 2 3





−1 −1 −1 · kBs1 s2 s3 kHS · A2s2 J1s · A2s2 J2s ≤ A1s1 J1s 1 2 1 (62)(65)(69)



 Cλ1 λ2 λ3 ξt kZΩt ×Ξt kF + kZ[Ω1 ,Ω2 ,Ω3 ] kHS .

Combing all terms above, we have proved the following equality,

ˆ

X − X

HS

3 3 X X

ξt

ξt kZΩt ×Ξt kHS +C λ1 λ2 λ3 kZΞt kHS . ≤ Cλ1 λ2 λ3 Z[Ω1 ,Ω2 ,Ω3 ] HS +Cλ1 λ2 λ3 λt t=1

t=1

ˆ − Xkop , By similar argument, we can show the upper bound for kX

ˆ

X − X

op

3 3 X X

ξt ≤ Cλ1 λ2 λ3 Z[Ω1 ,Ω2 ,Ω3 ] op + Cλ1 λ2 λ3 ξt kZΩt ×Ξt k + C λ1 λ2 λ3 kZΞt k. λt t=1

t=1

Therefore, we have finished the proof for Theorem 2. 

References Agarwal, A., Negahban, S., and Wainwright, M. J. (2012). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, pages 1171– 1197. Barak, B. and Moitra, A. (2016). Noisy tensor completion via the sum-of-squares hierarchy. In 29th Annual Conference on Learning Theory, pages 417–445. Bhojanapalli, S. and Sanghavi, S. (2015). A new sampling technique for tensors. arXiv preprint arXiv:1502.05023. Cai, T., Cai, T. T., and Zhang, A. (2016). Structured matrix completion with applications to genomic data integration. Journal of the American Statistical Association, 111(514):621–633. Cai, T. T. and Zhang, A. (2015). Rop: Matrix recovery via rank-one projections. The Annals of Statistics, 43(1):102–138. 35

Cai, T. T. and Zhang, A. (2016). Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. arXiv preprint arXiv:1605.00353. Cai, T. T. and Zhou, W.-X. (2016). Matrix completion via max-norm constrained optimization. Electronic Journal of Statistics, 10(1):1493–1525. Cand`es, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080. Cao, Y. and Xie, Y. (2016). Poisson matrix recovery and completion. IEEE Transactions on Signal Processing, 64(6):1609–1620. Cao, Y., Zhang, A., and Li, H. (2016). Count zeros replacement in compositional data via multinomial matrix completion. preprint. Gandy, S., Recht, B., and Yamada, I. (2011). Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Problems, 27(2):025010. Gross, D. and Nesme, V. (2010). Note on sampling without replacing from a finite collection of matrices. arXiv preprint arXiv:1001.2738. Guhaniyogi, R., Qamar, S., and Dunson, D. B. (2015). Bayesian tensor regression. arXiv preprint arXiv:1509.06490. Hillar, C. J. and Lim, L.-H. (2013). Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):45. Jain, P. and Oh, S. (2014). Provable tensor factorization with missing data. In Advances in Neural Information Processing Systems, pages 1431–1439. Jiang, X., Raskutti, G., and Willett, R. (2015). Minimax optimal rates for poisson inverse problems with physical constraints. IEEE Transactions on Information Theory, 61(8):4458– 4474. Karatzoglou, A., Amatriain, X., Baltrunas, L., and Oliver, N. (2010). Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems, pages 79–86. ACM. 36

Keshavan, R. H., Oh, S., and Montanari, A. (2009). Matrix completion from a few entries. In 2009 IEEE International Symposium on Information Theory, pages 324–328. IEEE. Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM review, 51(3):455–500. Koltchinskii, V., Lounici, K., and Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, pages 2302– 2329. Kressner, D., Steinlechner, M., and Vandereycken, B. (2014). Low-rank tensor completion by riemannian optimization. BIT Numerical Mathematics, 54(2):447–468. Krishnamurthy, A. and Singh, A. (2013). Low-rank matrix and tensor completion via adaptive sampling. In Advances in Neural Information Processing Systems, pages 836–844. Li, L., Chen, Z., Wang, G., Chu, J., and Gao, H. (2014). A tensor prism algorithm for multienergy ct reconstruction and comparative studies. Journal of X-ray science and technology, 22(2):147–163. Li, L. and Zhang, X. (2016). Parsimonious tensor response regression. Journal of the American Statistical Association, (just-accepted). Li, N. and Li, B. (2010). Tensor completion for on-board compression of hyperspectral images. In 2010 IEEE International Conference on Image Processing, pages 517–520. IEEE. Li, X., Zhou, H., and Li, L. (2013). Tucker tensor regression and neuroimaging analysis. arXiv preprint arXiv:1304.5637. Liu, J., Musialski, P., Wonka, P., and Ye, J. (2013). Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208–220. Mahoney, M. W., Maggioni, M., and Drineas, P. (2008). Tensor-cur decompositions for tensorbased data. SIAM Journal on Matrix Analysis and Applications, 30(3):957–987. Mu, C., Huang, B., Wright, J., and Goldfarb, D. (2014). Square deal: Lower bounds and improved relaxations for tensor recovery. In ICML, pages 73–81. 37

Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, pages 1069–1097. Nowak, R. D. and Kolaczyk, E. D. (2000). A statistical multiscale framework for poisson inverse problems. IEEE Transactions on Information Theory, 46(5):1811–1825. Rauhut, H., Schneider, R., and Stojanac, Z. (2016). Low rank tensor recovery via iterative hard thresholding. arXiv preprint arXiv:1602.05217. Recht, B. (2011). A simpler approach to matrix completion. Journal of Machine Learning Research, 12(Dec):3413–3430. Rendle, S. and Schmidt-Thieme, L. (2010). Pairwise interaction tensor factorization for personalized tag recommendation. In Proceedings of the third ACM international conference on Web search and data mining, pages 81–90. ACM. Rohde, A., Tsybakov, A. B., et al. (2011). Estimation of high-dimensional low-rank matrices. The Annals of Statistics, 39(2):887–930. Semerci, O., Hao, N., Kilmer, M. E., and Miller, E. L. (2014). Tensor-based formulation and nuclear norm regularization for multienergy computed tomography. IEEE Transactions on Image Processing, 23(4):1678–1693. Shah, P., Rao, N., and Tang, G. (2015). Optimal low-rank tensor recovery from separable measurements: Four contractions suffice. arXiv preprint arXiv:1505.04085. Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In International Conference on Computational Learning Theory, pages 545–560. Springer. Sun, W. W. and Li, L. (2016). Sparse low-rank tensor response regression. arXiv preprint arXiv:1609.04523. Sun, W. W., Lu, J., Liu, H., and Cheng, G. (2015). Provable sparse tensor decomposition. Journal of Royal Statistical Association. Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311. 38

Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027. Weyl, H. (1912). Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4):441–479. Yuan, M. and Zhang, C.-H. (2014). On tensor completion via nuclear norm minimization. Foundations of Computational Mathematics, pages 1–38. Yuan, M. and Zhang, C.-H. (2016). Incoherent tensor norms and their applications in higher order tensor completion. arXiv preprint arXiv:1606.03504. Zhang, A., Brown, L. D., and Cai, T. T. (2016). Semi-supervised inference: General theory and estimation of means. arXiv preprint arXiv:1606.07268. Zhou, H., Li, L., and Zhu, H. (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540–552.

39

Suggest Documents