1
Multivariate Regression with Gross Errors on Manifold-valued Data Xiaowei Zhang, IEEE Member, Xudong Shi, Yu Sun, IEEE Member, Li Cheng, IEEE Senior Member
F
arXiv:1703.08772v1 [stat.ML] 26 Mar 2017
Abstract We consider the topic of multivariate regression on manifold-valued output, that is, for a multivariate observation, its output response lies on a manifold. Moreover, we propose a new regression model to deal with the presence of grossly corrupted manifold-valued responses, a bottleneck issue commonly encountered in practical scenarios. Our model first takes a correction step on the grossly corrupted responses via geodesic curves on the manifold, and then performs multivariate linear regression on the corrected data. This results in a nonconvex and nonsmooth optimization problem on manifolds. To this end, we propose a dedicated approach named PALMR, by utilizing and extending the proximal alternating linearized minimization techniques. Theoretically, we investigate its convergence property, where it is shown to converge to a critical point under mild conditions. Empirically, we test our model on both synthetic and real diffusion tensor imaging data, and show that our model outperforms other multivariate regression models when manifold-valued responses contain gross errors, and is effective in identifying gross errors. Index Terms Manifold-valued data, multivariate linear regression, gross error, nonsmooth optimization on manifolds, diffusion tensor imaging.
1
I NTRODUCTION
paper focuses on multivariate regression on manifolds [1], [2], [3], [4], where given a multivariate observation x ∈ Rd , the output response y lies on a Riemannian manifold M. This line of work has many applications. For example, research evidence in diffusion tensor imaging (DTI) (e.g. [5]) indicates that the shape and orientation of diffusion tensor are profoundly affected by age, gender and handedness (i.e. left- or right-handed). In particular, we consider noisy manifold-valued output scenarios where data are subject to sporadic contamination by gross errors of large or even unbounded magnitude. Such grossly corrupted data are often encountered in practice due to unreliable data collection or data with missing values: For example, errors in DTI data can be introduced by Echo-Planar Imaging (EPI) distortion [6] or inter-subject registration [7], where practical measurement errors such as Rician noise or other sensor noise have a significant impact on the shape and orientation of tensors [8], [9]. Although the problem of learning from data with possible gross error in Euclidean spaces has gained increasing interest [10], [11], [12], [13], [14], [15], to our best knowledge, there exists no prior work in dealing with manifold-valued response with gross errors. Our main idea can be summarized as follows: For each manifold-valued response y ∈ M, we explicitly model its possible gross error (in y ). This gives rise to a corrected manifold-valued data y c by removing the identified gross error component from y , which is realized via geodesic curves on M. Note that y c could be the same as y , corresponding to no gross error in y . Then the corrected manifold-valued data can be utilized as the responses in multivariate geodesic regression, which boils down to a known problem [4]. More details are illustrated in Fig. 1 and are fully described in Section 3. Unfortunately the induced optimization problem becomes rather challenging as it contains nonconvex and nonsmooth functions on manifolds. Inspired by the recent development of proximal alternating linearized minimization methods in Euclidean spaces, in this paper we propose to generalized this technique onto Riemannian manifolds, which we have named as PALMR. The main contributions of this paper are three-fold. First, we propose to address a novel problem of multivariate regression on manifolds where the manifold-valued responses are subject to possible contamination of gross errors. Second, a new algorithm named PALMR is proposed to tackle the induced nonconvex and nonsmooth optimization on manifolds, for which we also provide the convergence analysis. The algorithm and analysis is applicable to a class of nonconvex and nonsmooth optimization problem on manifolds. Empirically our algorithm has been evaluated on both synthetic and real DTI data, where results suggest the algorithm
T
• • • •
HIS
Xiaowei Zhang is with Bioinformatics Institute, A*STAR, Singapore. E-mail:
[email protected] Xudong Shi is with School of Computing, National University of Singapore, Singapore. E-mail:
[email protected] Yu Sun is with the Singapore Institute for Neurotechnology, National University of Singapore, Singapore 117456. E-mail:
[email protected] Li Cheng is with Bioinformatics Institute, A*STAR, Singapore and School of Computing, National University of Singapore, Singapore. E-mail:
[email protected]
2
Fig. 1. An illustration of the proposed approach for multivariate regression on grossly corrupted manifold-valued data, which contains two main ingredients: The first is to obtain the corrected response yic by removing its possible gross error, as illustrated by the directed curves on the manifold; The second one involves the manifold-valued regression process using {xi , yic }. Here x11 and x21 are the components of input x1 along tangent vectors v1 and v2 of point p ∈ M. Similarly for x12 and x22 . See section 3.1 for details. Best viewed in color.
is effective in identifying gross errors and recovering corrupted data, and it produces better predictive results than regression models that do not consider gross errors. Third, our approach makes connections to two established research areas, namely learning from grossly corrupted data and multivariate regression on manifolds: When we restrict ourselves to the special case of Euclidean space, our approach reduces to robust regression considered in e.g. [13], [14]; On the other hand, when there is no gross error, the problem boils down to that of multivariate regression on manifolds as considered in [4], where the method of [4] can be regarded as a special case of our approach. Our code is also made publicly available 1 . 1.1
Related work
Manifold-valued data arise from a wide range of application domains including neural imaging [16], shape modelling [17], robotics [18], and graphics [19]. One prominent example is DTI [20] where data lie in the Riemannian manifold of 3 × 3 symmetric positive definite (SPD) matrices. In this work, we use S(n) and S++ (n) to denote the set of n × n symmetric matrices and n × n SPD matrices, respectively. Other examples include higher angular resolution diffusion imaging where data can be modelled as the square root of orientation distribution functions lying on the unit sphere [4], [21], as well as group-valued data such as SO(3) and SE(3) in shape analysis [17] and robotics [18]. It is well known that for such scenarios, it is in general much better to conduct statistical analysis directly on the manifold (i.e. curved space) instead of in the ambient Euclidean space (i.e. flat space), which we also verify empirically. Unsurprisingly, there exists plenty of prior work studying statistics on manifolds [2], [17], [20], [22]. This is to be distinguished from the well-known topic of manifold learning [23], where the data are assumed to be sampled from certain manifold embedded in a usually much higher dimensional Euclidean space and one is supposed to extract intrinsic geometric properties of the manifold from observations. Instead here the manifold is usually known in priori, and the task is to engage appropriate statistical models in the analysis of the manifold-valued data. In the area of regression on manifolds, Fletcher [2] proposes geodesic regression that generalizes univariate linear regression on flat spaces to manifolds by regressing a manifold-valued response from a real-valued independent data with a geodesic curve. [21] adapts the idea of geodesic regression for regressing sphere-valued data against real scalar. [3] investigates parametric polynomial regression on Riemannian manifolds, while [1] studies regression on the group of diffeomorphisms for detecting longitudinal anatomical shape changes. Banerjee et al. [24] propose a nonlinear kernel-based regression method for manifold-valued data. Hong et al. [25] propose a shooting spline-based regression technique specifically designed for the Grassmannian manifold. The closest work might be [4], which extends the idea of geodesic regression [2] to multivariate regression on manifolds, and applies it to analyze diffusion weighted imaging data. In the area of learning with grossly corrupted data, there have been various methods [10], [13], [14], [15], [26] proposed for linear regression with gross errors in the Euclidean space, among which robust lasso in [13] and robust multi-task regression in [14] can be considered as special cases of our approach when restricted to Euclidean spaces.
2
BACKGROUND
We first briefly review some concepts in Riemannian manifolds in subsection 2.1, nonsmooth analysis and Kurdyka–Łojasiewicz property on Riemannian manifolds in subsections 2.2 and 2.3, respectively, which are necessary for the derivation of our algorithm and the proof of convergence. We then review some models regarding multivariate linear regression with gross errors in Euclidean space, whose ideas are utilized to design our new model. 1. Our implementation is available at the project website http://web.bii.a-star.edu.sg/∼zhangxw/palmr-SPD/.
3
2.1
Riemannian manifolds
Let (M, g) denote a smooth manifold M endowed with a Riemannian metric g . Denote Tp M the tangent space at point p and T M := ∪p∈M Tp M the tangent bundle. Notation (p, v) ∈ M × T M refers to p being a point of M and v being a tangent vector at p. hu, vip := gp (u, v) is the inner product between two vectors u and v in Tp M, with gp being the metric at p. The induced 1/2 norm thus becomes kukp := hu, uip . Let γ : [a, b] → M be a piecewise smooth curve such that γ(a) = p and γ(b) = q , Rb 0 with the curve length as a kγ (t)kγ(t) dt where γ 0 (t) denotes derivative. The Riemannian distance dM (p, q) between p and q is defined as the infimum of the length over all piecewise smooth curves joining these two points. Let ∇ be the Levi-Civita connection associated with (M, g). Curve γ is called a geodesic if ∇γ 0 γ 0 = 0. A Riemannian manifold is complete if its geodesics γ(t) are defined for any value of t ∈ R. The parallel transport along γ from γ(a) to γ(b) is a mapping Pγ(a)γ(b) : Tp M → Tq M defined by Pγ(a)γ(b) (v) = V (b), where V is the unique vector field satisfying ∇γ 0 V = 0 and V (a) = v . The exponential map at point p is a mapping Expp : Tp M → M defined as Expp (v) = γ(1), where γ : [0, 1] → M is the geodesic such that γ(0) = p and γ 0 (0) = v . The inverse of the exponential map (if existed) is denoted by Exp−1 p . To simplify notations, we also use h, i, k · k, d(·, ·), and Exp(p, v) to denote inner product, norm, Riemannian distance, and exponential map respectively, when there is no confusion. We focus on Hadamard manifold M, which is a complete and simply connected finite dimensional Riemannian manifold with nonpositive sectional curvature. The class of Hadamard manifolds possesses many nice properties: For example, any two points in M can be joined by a unique geodesic. In this case, the exponential map is a global diffeomorphism and d(p, q) = kExp−1 p qkp . One example of Hadamard manifold is the manifold of symmetric positive definite matrices. Motivated readers can consult [27], [28] for further details of manifolds and differential geometry. 2.2
Nonsmooth analysis on Riemannian manifolds
Given an extended real-valued function σ : M → R ∪ {+∞} we define its domain by dom σ := {p ∈ M : σ(p) < +∞} and its epigraph by epiσ := {(p, β) ∈ M × R : σ(p) ≤ β}. σ is said to be proper if dom σ 6= ∅ and σ(p) > −∞ for all p ∈ dom σ . σ is a lower semicontinuous function if epiσ is closed. Definition 1 ( [29]). Let σ be a proper and lower semicontinuous (PLS) function, then •
ˆ the Fre´chet subdifferential of σ at any p ∈ dom σ , denoted as ∂σ(p) , is defined as the set of all v ∈ Tp M which satisfies σ(q) − σ(p) − hv, γ 0 (0)i ≥ 0, q6=p q→p d(p, q) lim inf
•
ˆ for geodesic γ joining γ(0) = p and γ(1) = q . When p ∈ / dom σ , we set ∂σ(p) = ∅. the (limiting) subdifferential of σ at any p ∈ M, denoted as ∂σ(p), is defined as ∂σ(p) = {v ∈ Tp M : ∃(pk , σ(pk )) → (p, σ(p)), k ˆ ∃v k ∈ ∂σ(p ) s.t. Pγ k (0)γ k (1) (v k ) → v},
•
where γ k is the geodesic joining pk and p. p ∈ M is a critical point of σ if 0 ∈ ∂σ(p). We denote the set of critical points of σ by crit σ . That is, crit σ = {x ∈ M : 0 ∈ ∂σ(x)}.
If p is a local minimizer of σ then by the Fermat’s rule 0 ∈ ∂σ(p). If σ is differentiable, then its gradient, denoted as gradσ , is a vector field satisfying hgradσ(p), vip = v(σ) for all v ∈ Tp M and p ∈ M, where v(σ) denotes the directional derivative of σ in the direction v . In this case ∂σ(p) = {gradσ(p)}. Moreover, we have the following definition of Lipschitz gradients for smooth functions on manifolds: Definition 2 ( [30]). Let σ : M → R be a continuously differentiable function and L > 0. σ is said to have L-Lipschitz gradient if, for any p, q ∈ M and any geodesic segment γ : [0, r] → M joining p and q , then
k∂σ(γ(t)) − Pγ(0)γ(t) ∂σ(p)kγ(t) ≤ Ll(t), ∀t ∈ [0, r], where γ(0) = p, Pγ(0)γ(t) is the parallel transport along γ from p to γ(t), and l(t) denotes the length of the segment between p and γ(t). In addition, if M is a Hadamard manifold, then the last inequality becomes
k∂σ(γ(t)) − Pγ(0)γ(t) ∂σ(p)kγ(t) ≤ Ld(p, γ(t)). 2.3
Kurdyka–Łojasiewicz (K-L) property on Riemannian manifolds
The Kurdyka–Łojasiewicz (K-L) property plays a crucial role in nonsmooth analysis [31], [32]. In this subsection we extend the K-L property from Euclidean spaces to Riemannian manifolds. To do this, we need to introduce some basic notations. If A is a subset of M, then the distance between x ∈ M and A is defined by dist(x, A) := inf{d(x, y) : y ∈ A}
4
when A is nonempty, and dist(x, A) = +∞ for all x ∈ M when A is empty. For a fixed x ∈ M, the open ball neighborhood of x with radius η is defined as B(x, η) := {y ∈ M : d(x, y) < η}. Definition 3. Given real scalars α and β , we define
[α ≤ σ ≤ β] := {x ∈ M : α ≤ σ(x) ≤ β}. We define similarly [α < σ < β]. Now, we define the K-L property.
¯ ∈ dom σ if there Definition 4 ( [32]). Let σ : M → R ∪ {+∞} be a PLS function. The function σ is said to have K-L property at x ¯ and a continuous concave function φ : [0, η) → R+ such that exists η ∈ (0, ∞], a neighborhood U of x (i) (ii)
φ(0) = 0, φ is continuously differentiable on (0, η) and φ0 (s) > 0 for all s ∈ (0, η); the following K-L inequality holds ¯ dist(0, ∂σ(x)) ≥ 1, φ0 (σ(x) − σ(x)) ¯ < σ < σ(x) ¯ + η]. ∀x ∈ U ∩ [σ(x)
We call σ a K-L function if it has K-L property at each point of dom σ . K-L functions are ubiquitous in applications, for example, semi-algebra, subanalytic and log-exp functions [31], [32]. 2.4
Multivariate linear regression with gross errors
Given a matrix representation of N observations X ∈ RN ×d , and the corresponding m-dimensional response matrix Y ∈ RN ×m , one of the central problems in linear regression is to accurately estimate the regression matrix V ∈ Rd×m from
Y = XV ∗ + Z,
(1)
N ×m
with Z ∈ R being the stochastic noise. In most of existing work regarding linear regression, Z is assumed to be composed of entries following normal distribution with zero mean. However, when the response Y is subject to possible gross error, the estimated regression matrix deviates significantly from the true value and becomes unreliable. To deal with this problem, several recent works [12], [13], [14] suggest to consider model Y = XV ∗ + G∗ + Z, (2) where G∗ ∈ RN ×m is used to explicitly characterize the gross error component. As in practice only a subset of responses are corrupted by gross error, G∗ is a sparse matrix whose nonzero entries are unknown and magnitudes can be arbitrarily large. Moreover, this model can as well be applied to deal with the case where some entries of Y are missing. A commonly used paradigm of estimating (V ∗ , G∗ ) is by solving convex optimization problem
1 min kY − XV − Gk2F + λRv (V ) + ρRg (G), V,G 2
(3)
where λ > 0 and ρ > 0 are tuning parameters, and Rv and Rg are regularization terms of V and G, respectively. Some frequently used regularizations include `1 norm k · k1 which is the summation of the absolute value of all entries, and `1,2 norm which is the summation of `2 norm of rows of a matrix. For example, in [14] the authors propose to use Rv (V ) = kV k1,2 and Rg (G) = kGk1 .
3
O UR A PPROACH
d Consider a set of training examples {(xi , yi )}N i=1 , where yi lies on Riemannian manifold M and xi ∈ R is the associated independent variable. We propose a novel extension of the modeling approach of (2) for Euclidean spaces to deal with the more general curved spaces, as follows.
3.1
From Euclidean spaces to manifolds
Model (2) can be reformulated as Y − G∗ = XV ∗ + Z . Denote Y c := Y − G∗ , which can be interpreted as corrected response after removing the gross error. Now model (2) can be reformulated as standard linear regression (1) with response Y c . With this in mind, we proceed to extend the aforementioned idea to regression on manifolds. For each manifold-valued response yi , denote as yic its corrected version. Different from the Euclidean space setting where Y c can be obtained from Y simply by a translation, we need to ensure that yic remains on the manifold. This is accomplished by the exponential map yic = Expyi (gi ) with the gross error gi ∈ Tyi M over each of the training examples, i ∈ {1, · · · , N }. Note that when M is a Euclidean space, the exponential map reduces to addition, as Expyi (gi ) = yi + gi . In other words, translation in the affine space is a special case of exponential map in the more general curved space. As illustrated in Fig. 1, we first obtain the corrected manifold-valued response yic = Expyi (gi ). Then the relationship between xi and yic can be modeled as d X xji vj , zi , (4) Expyi (gi ) = Exp Exp p, j=1
5
xji
where ∈ M and {vj }dj=1 ∈ Tp M is a set of tangent vectors at p, is the j th component of xi , and zi is a tangent vector at pP j d Exp p, j=1 xi vj . Our model can be viewed as a generalization of linear regression model of (1) from flat spaces to manifolds, where p denotes the intercept that is in analogy to the origin 0 in the flat space as in (1), and exponential map corresponds to the addition operator in (1). To measure the training loss, we use X j 1 X 2 xi vj ) d Expyi (gi ), Expp ( E (p, {vj }, {gi }) := 2 i j P j to denote the sum-of-squared Riemannian distance between the corrected data yic = Expyi (gi ) and the prediction Expp j x i vj , and let Rv and Rg denote two regularization terms controlling the magnitude of {vj } and {gi }, respectively. The problem considered in our paper can now be formulated as the following optimization problem
˜ {˜ (p, vj }, {˜ g }i ) =
argmin
E (p, {vj }, {gi }) + λRv ({vj }) + ρRg ({gi }) ,
(5)
(p, {vj }) ∈ M × T M {gi } ∈ Ty M i
where λ ≥ 0 and ρ ≥ 0 are Pdregularization parameters. Without PN loss of generality, the following regularization terms are specifically considered: Rv ({vj }) := j=1 kvj kp and Rg ({gi }) := i=1 kgi kyi , with k·kp being the norm of tangent vector at p ∈ M, and similarly for k·kyi . The optimization problem now becomes
min (p, {vj }) ∈ M × T M g i ∈ Ty M i
3.2
E (p, {vj }, {gi }) + λ
d X
kvj kp + ρ
j=1
N X
kgi kyi .
(6)
i=1
Connections to existing works
We would like to point out that model (6) includes as special cases a number of related research works on gross error or on manifoldvalued regression. In this subsection, we provide three such examples. Example 1. When M = Rm , we can establish a connection between model (6) and the robust multi-task regression studied in [14]. Specifically, instead of optimizing (6) over p ∈ Rm we select p = 0, resulting in min m
vj ,gi ∈R
d N d N X X X 1X xji vj + gi k2 + λ kvj k + ρ kgi k , kyi − 2 i=1 j=1 j=1 i=1
which can be rewritten as
min V,G
1 2 kY − XV − GkF + λ kV k1,2 + ρ kGk1,2 , 2
(7)
where k·k becomes the usual Euclidean norm, Y = [y1 , · · · , yN ]> ∈ RN ×m , X = [x1 , · · · , xN ]> ∈ RN ×d , V = [v1 , · · · , vd ]> ∈ Rd×m and G = [g1 , · · · , gN ]> ∈ RN ×m . The resulting model (7) is exactly the one considered in [14] except that regularization term kGk1 in [14] is replaced by kGk1,2 here. N P 1 P (yi + gi − j xji vj ). Example 2. If M = Rm , we can show by Fermat’s rule that the optimal solution p˜ is given by p˜ = N i=1 PN PN By substituting p˜ into problem (6) and assuming {(xi , yi )} has empirical mean 0, that is, i=1 xi = 0 and i=1 yi = 0, the optimization problem (6) reduces to
min m
vj ,gi ∈R
N d N d N X X X 1 X 1X kyi − xji vj + gi − gi k2 + λ kvj k + ρ kgi k , 2 i=1 N i=1 j=1 j=1 i=1
which can be reformulated as
1 ¯ 2F + λkV k1,2 + ρkGk1,2 min kY − XV − Gk V,G 2
s.t.
1 > ¯ G = I − 1N 1N G, N
(8)
where 1N ∈ RN is a column vector with all entries being 1. The difference between Example 1 and Example 2 lies in that the former is obtained from selecting p = 0 while the latter is from optimizing p which exactly follows model (6). The resulted models are quite similar except that model 8 needs to center variable G. Example 3. If we let λ = 0 and ρ = +∞, then optimization problem (6) reduces to N d X 1X 2 j min d yi , Expp xi vj , (p,{vj })∈M×T M 2 i=1 j=1 which recovers exactly the model considered in [4]. In this regard, the MGLM model in [4] is a special case of our model.
6
Algorithm 1 (PALMR): PALM on Riemannian manifolds Input: µ1 > 1 and µ2 > 1. Output: the sequence {(xk , y k )}k∈N . 1: Initialization: (x0 , y 0 ) and k = 0. 2: while stopping criterion not satisfied do 3: Set ck = µ1 L1 (y k ) and compute xk+1 as in (10). 4: Set dk = µ2 L2 (xk+1 ) and compute y k+1 as in (11). 5: end while 3.3 PALM for optimization on Hadamard manifolds In this subsection, we propose a new algorithm to solve optimization problem (6). Existing techniques [33], [34], [35] for manifold-based optimization problems focus on smooth optimization, which are however not applicable to the nonconvex and nonsmooth optimization manifold-based problem of (6). On the other hand, as we will explain in details in subsection 3.4, we observe that (6) admits the form
min
x∈M1 ,y∈M2
Ψ(x, y) := f (x) + g(y) + h(x, y),
(9)
where M1 and M2 are Hadamard manifolds, f : M1 → R ∪{+∞} and g : M2 → R ∪{+∞} are proper and lower semicontinuous functions, and h : M1 × M2 → R is a smooth function. Inspired by the Proximal Alternating Linearized Minimization (PALM) algorithm in [36], in what follows we propose PALMR, an inexact proximal alternating minimization algorithm for problem (9). We alternately solve the following two proximally linearized subproblems D E c k x, ∂x h(xk , y k ) + d2M1 (xk , x), xk+1 ∈ argmin f (x) + Exp−1 (10) xk 2 x∈M1 D E d k k+1 k (11) y k+1 ∈ argmin g(y) + Exp−1 y, ∂ h(x , y ) + d2M2 (y k , y), y k y 2 y∈M2 where ck = µ1 L1 (y k ) and dk = µ2 L2 (xk+1 ) with µ1 > 1, µ2 > 1 and L1 (y k ), L2 (xk+1 ) being the Lipschitz constants of ∂x h and ∂y h, respectively, as to be explained in Assumption 1. In particular, by exploiting the fact that M1 is a Hadamard manifold, (10) can be reformulated with a change of variable v = Exp−1 x as xk
v k ∈ argmin (f ◦ Expxk ) (v) + v∈Txk M1
1 ck kv + ∂x h(xk , y k )k2 , 2 ck
which becomes an optimization problem in linear space Txk M1 , then we have xk+1 = Expxk (v k ). Since f is PLS satisfying inf x∈M f (x) > −∞ and Expxk is smooth, it follows that the composite function f ◦ Expxk is PLS and inf v∈Txk M f ◦ Expxk (v) > −∞ which, together with Theorem 1.25 of [37], implies that v k is well-defined. Moreover, the above optimization problem for v k is called proximity operator [38], denoted as 1 f ◦Expxk k k k v = proxck − ∂x h(x , y ) . ck Similar claims apply to problem (11), implying the well-definiteness of xk+1 and y k+1 . Solving (10) and (11) alternately yields the algorithm PALMR outlined in Algorithm 1. We can also prove the convergence of PALMR. For this purpose, we need the following assumptions. Assumption 1. Function Ψ(x, y) satisfies the following conditions: (i) (ii) (iii)
Ψ(x, y) has the Kurdyka–Łojasiewicz (K-L) property on Hadamard manifolds. inf f > −∞, inf g > −∞ and inf Ψ > −∞. For any fixed y , the function x → h(x, y) has L1 (y)-Lipschitz gradient. Likewise, for any fixed x, the function y → h(x, y) + has L2 (x)-Lipschitz gradient. Moreover, there exist real scalars λ− i , λi > 0 for i = 1, 2, such that inf {L1 (y k )} ≥ λ− 1,
k∈N
(iv)
inf {L2 (xk )} ≥ λ− 2,
k∈N
sup{L1 (y k )} ≤ λ+ 1, k∈N
sup{L2 (xk )} ≥ λ+ 2 k∈N
∂h is Lipschitz continuous on bounded subset of M1 × M2 . More specifically, for bounded subset A1 × A2 ∈ M1 × M2 , there exists constant L > 0 such that for all (xi , yi ) ∈ A1 × A2 , i = 1, 2, we have k∂x h(x1 , y1 ) − ∂x h(x1 , y2 )k ≤ LdM2 (y1 , y2 ),
k∂y h(x1 , y1 ) − ∂y h(x2 , y1 )k ≤ LdM1 (x1 , x2 ).
Under Assumption 1 we have the following theorem. Theorem 5. Suppose Assumption 1 holds. Let {(xk , y k )}k∈N be a sequence generated by PALMR. Then either the sequence {dM1 ×M2 ((x0 , y 0 ), (xk , y k ))} is unbounded or the following assertions hold: P P 1) The sequence {(xk , y k )}k∈N has finite length, i.e. dM1 (xk+1 , xk ) < ∞ and dM2 (y k+1 , y k ) < ∞. k
2)
The sequence {(xk , y k )}k∈N converges to a critical point (x∗ , y ∗ ) of Ψ.
k
In what follows, we specifically investigate the dedicated realization of PALMR to solve the optimization problem (6). To simplify the notation, the resulting algorithm is also referred to as PALMR when there is no confusion.
7
Algorithm 2 PALMR for multivariate regression with gross error on manifolds Input: {(xi , yi )}, λ ≥ 0, ρ ≥ 0, µ1 > 1, µ2 > 1, and k = 0. Output: p˜, {˜ vj }, and {˜ g }i . 1: Initialize p, {vj }, and {g}i . 2: while stopping criterion not satisfied do 3: pk+1 = Exppk − c1k ∂p E k . 4:
vˆjk = proxRλv (skj ). dk
vjk+1 = Ppk pk+1 (ˆ vjk ). Rg k k+1 6: gi = prox ρ (ti ). ek 7: end while ˜ ← pk+1 , v˜j ← vjk+1 and g˜i ← gik+1 . 8: return p 5:
3.4 Applying PALMR to problem (6) Optimization problem (6) in our context can be reformulated as N d X X min E(p, {vj }, {gi }) + λ kgi kyi , kvj kp + ρ {z } (p, {vj }) ∈ M1 | i=1 j=1 h(p,{vj },{gi }) {z } {z } | | {gi } ∈ M2 f (p,{vj })
g({gi })
which is of the form (9) with M1 = M × T M and M2 = Ty1 M × · · · × TyN M. To apply PALMR to solve problem (6), we need P j to evaluate the gradients of E(p, {vj }, {gi }). To simplify the notation, we further denote the prediction yˆi := Expp j xi vj , as well as the derivatives of the exponential map with respect to p and v as dp Expp (v) and dv Expp (v), respectively. Now, the partial gradient of E with respect to p amounts to X X j † c (12) Exp−1 dp Expp x i vj ∂p E = − ˆi yi ∈ Tp M, y i
j
where (·)† is the adjoint derivative of the exponential map [2] defined by µ, dp Expp (v)w Exp
p (v)
=
† dp Expp (v) µ, w p with
c ˆi to the tangent space of µ ∈ TExpp (v) M, w ∈ Tp M. The adjoint derivative operator maps Expy−1 ˆi (yi ) from the tangent space of y p. Thus ∂p E ∈ Tp M. Similarly, the partial gradient of E with respect to vj and gi are given by
∂ vj E = −
X
X j 0 † c xji dv Expp xi vj 0 Exp−1 ˆi yi ∈ Tp M, y
(13)
j0
i
and † ˆi ∈ Tyi M, ∂gi E = − dv Expyi (gi ) Exp−1 yc y
i
(14)
respectively. The PALMR algorithm for problem (6) proceeds as follows: To update (p, {vj }), we let ∂p E k := ∂p E(pk , {vjk }, {gik }) and ∂vj E k := ∂vj E(pk , {vjk }, {gik }) and solve D E c k k (pk+1 , {vjk+1 }) = argmin Exp−1 + d2 (p, pk ) pk p, ∂p E 2 p,{vj }
+
d D X j=1
E ck Pppk (vj ) − vjk , ∂vj E k + λ kvj kp + kPppk (vj ) − vjk k2 , 2
where Pppk is the parallel transport from p to pk along the unique geodesic between them. The above subproblem is solved by alternate minimization over p and vj . Specifically, to update p, we solve D E c k k pk+1 = arg min Exp−1 p, ∂ E + d2 (p, pk ) k p p p∈M 2 which, by a change of variable u = Exp−1 pk p, is equivalent to solving D E c 1 k 2 uk = argmin u, ∂p E k + kukpk = − ∂p E k , 2 ck u∈Tpk M and pk+1 = Exppk (uk ).
8
ˆjk by To update {vj }, we need to first obtain v
2 E c k k vj − vjk , ∂vj E k +
vj − vj k + λ kvj kpk 2 p vj ∈Tpk M
2 λ 1
kvj kpk , = argmin vj − skj + ck pk vj ∈Tpk M 2
vˆjk = argmin
where skj = vjk −
1 k c k ∂ vj E .
D
Notice that the above optimization problem have closed form solution of k·kpk
vˆjk = prox λ
ck
(skj ) := 1 −
λ
k ck sj
k sj ,
pk+1
{ˆ vjk }
+
k
where (α)+ = α if α > 0 and 0 otherwise. Since lie on the tangent space at p , we need to parallel transport them to Tpk+1 M by vjk+1 = Ppk pk+1 (ˆ vjk ) along the unique geodesic between pk and pk+1 . Similarly, update {gi } by !
2 ρ ρ 1
k+1 k
kgi kyi = 1 − tki , gi =arg min
gi − ti + gi ∈Tyi M 2 yi ek ek tki yi +
1 k+1 , {vjk+1 }, {gik }). ek ∂gi E(p
where tki = gik − Now, we are ready to present our algorithm for multivariate regression with grossly corrupted manifold-valued data, as shown in Algorithm 2. Noticethat when letting λ = 0 and ρ = +∞, Algorithm 2 alternately updates the values of p and vj via three steps: 1 k k+1 vjk ), which recovers the gradient descent method (1) p = Exppk − ck ∂p E , (2) vˆjk = vjk − c1k ∂vj E k , (3) vjk+1 = Ppk pk+1 (ˆ proposed in [4]. During each iteration of Algorithm 2, we need to evaluate partial derivatives ∂p E , ∂vj E and ∂gi E , which can be intractable to compute due to the adjoint derivatives of the exponential map. As a remedy of this issue, we adopt the variational technique of [4], [39] for computing derivatives, which basically replaces the adjoint derivative operators by parallel transports: ∂p E ≈ −
X
c Pyˆi p (Expy−1 ˆi yi ),
(15)
i
∂vj E ≈ −
X
c xji Pyˆi p (Exp−1 ˆi yi ), y
(16)
i
ˆi ). ∂gi E ≈ −Pyic yi (Exp−1 yc y i
(17)
One advantage of such approximation is that for some special manifolds, including manifold of SPD matrices S++ (n), parallel transports have analytical expressions and can be computed directly. For general manifolds that have no analytical expressions for parallel transports, approximation approaches such as Schild’s ladder approximation [40] can be used.
4
E XPERIMENTS
In this section, we empirically evaluate the performance of the proposed approach (i.e. PALMR) in working with synthetic and real DTI data sets, which lies in the S++ (3) manifold of SPD matrices. Throughout all experiments, we fix λ = 0.1 and choose the optimal ρ from set {0.05, 0.1, · · · , 0.95, 1} by a validation process using a validation data set consisting of the same number of data points as the testing data. As our algorithm is iterative by nature, in practice it stops if either of the two stopping criteria is met: (1) the difference between consecutive objective function values is below 1e-5, or (2) maximum number of iterations (100) is reached. 4.1
Synthetic DTI data
Synthetic DTI data sets are constructed with known ground-truths and gross errors as follows: First, we randomly generate p ∈ S++ (3), d symmetric matrices {vj }dj=1 ⊆ S(3) and {xi }N i=1 ⊆ R where Pentries of x i are sampled from standard normal distribution N (0, 1). j d t Then the ground-truth DTI data is obtained as yi := Expp j=1 xi vj . This is followed by DTI data with stochastic noise as s yi := Expyit (zi ), where zi is a random matrix in S(3) with its entries being sampled from N (0, 1) and satisfies kzi kyt ≤ 0.1. i Meanwhile, the gross errors are generated by a two-step process: (a) Randomly select an index subset Ig from {1, 2, · · · , N }, such that |Ig | = β ∗ N with 0 ≤ β ≤ 1. (b) For i ∈ Ig , its grossly corrupted response is attained by yi = Expys (gi ), where gi is a i random matrix in S(3) satisfying kgi kys = σg . The rest of training data remain unchanged, i.e. yi = yis for i ∈ / Ig . Thus, among all i N manifold-valued data, the percentage of grossly corrupted data is β . With the same p and {vj }, we also generate Nt pairs of testing data {(xtest , yitest )} and validation data. i We first conduct experiments on a data set with d = 2, N = 50, β = 40% and σg = 5, and compared with the multivariate general linear model (MGLM) of [4] which has not considered gross error. Training samples are displayed in Fig. 2, where we also show the predictions of PALMR and MGLM on the training data. Visual results of PALMR and MGLM on 20 testing data and training data correction by PALMR are presented in Fig. 3(a) and Fig. 3(b), respectively. Collectively, the results suggest that PALMR indeed
9
Fig. 2. Visualization of the synthesized training samples and the predictions of PALMR and MGLM. The two row vectors on the top give the values of X generating the data, red boxes identify the samples with gross error. The rows indexed by PALMR and MGLM display the predictions of corresponding method on the training data. All objects are viewed directly from overhead. Best viewed in color.
(a) Predictions of PALMR and MGLM on the testing data
(b) Training data correction by our model Fig. 3. Visual results of PALMR and MGLM. (a) Predictions for 20 testing data. (b) From top to bottom: training samples corrupted by gross error (i.e. samples marked by red boxes in Fig. 2), correction results of PALMR, and the true data without gross error. Best viewed in color.
is capable of correctly identifying the gross errors during training. This enables the delivery of a better-behaved model. Fig. 3(b) shows that PALMR can effectively recover the original data (i.e. true data without gross error). It also produces improved regression results on testing data as displayed in Fig. 3(a). Next we quantitatively evaluate the effect of varying the internal parameters of PALMR, which include the number of independent variables d, the number of training data N , magnitude of gross error σg , and percentage of grossly corrupted training data β . To see the effect of one specific parameter, synthetic DTI data are generated by varying this parameter value while keeping rest parameters at their default values. The following default values are used: d = 2, N = 50, β = 20%, and σg = 1. To evaluate performance of
RateG
M SGEG
M SGEV
M SGEp
M SGEtest
M SGEtrain
10
(a) Number of independent variables d (b) Training size N (c) Magnitude of gross error σg (d) Ratio of corrupted data β Fig. 4. Box plots showing the effect of four different parameters: the number of tangent basis d, the number of training data N , the magnitude of gross error σg , and the percentage of grossly corrupted training data β , corresponding to four columns accordingly. Plots in each row show results using the same metric. Since MGLM does not consider gross error, the last two rows only show results of PALMR. See text for details. Best viewed in color.
PALMR, we considered the following mean squared geodesic error (MSGE) metrics: 1 X 2 d (yi , yˆi ), M SGEtrain := N i 1 X 2 test test M SGEtest := d (yi , yˆi ), Nt i
˜ M SGEp := d2 (p, p), 1X M SGEV := kvj − Ppp vj )k2p , ˜ (˜ d j 1 X s ˜i k2yi , M SGEG := kExp−1 yi (yi ) − g N i P ˜j and g˜i are the outputs of Algorithm 2. The data correction error is measured as N1 i d(yis , yic )2 . In addition, we where p˜, v say that gross error gi is correctly identified if both gi and g˜i are either zero or nonzero, and compute the rate RateG :=
11
number of correctly identified gross errors/N . Results averaged over 10 repetitions are presented in Fig. 4, where each column corresponds to the effect of one parameter and each row corresponds to the results using one metric. From Fig. 4, we have four observations: (1) PALMR has lower MSGE for all values of d, and our correction performs well on training data, cf. column Fig. 4(a). (2) PALMR has large advantage over MGLM for all values of training size (N ) and magnitude of gross error (σg ), cf. columns Fig. 4(b-c). (3) PALMR can handle training data with up to 80% being grossly corrupted, and delivers better result than MGLM. On the other hand, the performance is slightly worse if more than 80% of training data are corrupted, cf. column Fig. 4(d). (4) PALMR can reliably identify most of the gross errors. Still it may not always correctly recover the true value of the error. This is evidenced in the last row of Fig. 4, where the MSGE on G increases as σg or β increases, and our correction error starts to stand out (i.e. being larger than both prediction errors of PALMR and MGLM) when over β = 30% of the training samples are grossly corrupted. This we feel is acceptable as in most practical situations, there are few instead of majority of the training examples that would be contaminated by gross errors. 4.2
Real DTI data
In this section, we apply PALMR to examine the effect of age and gender on human brain white matter. We experiment with the C-MIND database 2 released by Cincinnati Children’s Hospital Medical Center (CCHMC) with the purpose of investigating brain development in children from infants and toddlers (0 ∼ 3 years) through adolescence (18 years). We use the imaging data of participants who were scanned at CCHMC at year one and whose age were between 8 and 18 (2947 to 6885 days), consisting of 27 female and 31 male. The DTI data of each subject are first manually inspected and corrected for subject movements and eddy current distortions using FSL’s eddy tool [41], then passed to FSL’s brain extraction tool to delete non-brain tissue 3 . After the pre-processing, we use FSL’s DTIFIT tool to reconstruct DTI tensors. Finally, all DTIs are registered to a population specific template constructed using DTI-TK 4 . We investigate six exemplar slices that have been identified as typical slices by domain experts and have been also similarly used by many existing works such as [4], [21]. And in particular, we are interested in the white matter region. At each voxel within the white matter region, the following multivariate regression model
y = Expp (v1 × age + v2 × gender)
(18)
is adopted to describe the relation between the DTI data y and variables ‘age’ and ‘gender’. In DTI studies, another frequently used measure of a tensor is fractional anisotropy (FA) [42], [43] defined as s (λ1 − λ2 )2 + (λ2 − λ3 )2 + (λ1 − λ3 )2 FA = , 2(λ21 + λ22 + λ22 ) where λ1 , λ2 and λ3 are eigenvalues of the tensor. FA is an important measurement of diffusion asymmetry within a voxel and reflects fiber density, axonal diameter, and myelination in white matter. In our experiments, we also compared three models: two geodesic regression models, MGLM and PALMR, and the FA regression model which uses FA value to replace tensor y in (18). To compare geodesic regression with FA regression, since the responses of geodesic regression are tensors, we first computed the FA values of the tensors and used the relative FA error on testing data, defined as the mean relative error between the FA values of the predicted tensors and the true tensors, as an evaluation metric. Besides the relative FA error metric, we also adopted the mean squared geodesic error (MSGE) on testing data as in subsection 4.1 to compare the performance of MGLM and PALMR. 4.2.1 Model significance To examine the significance of the statistical model (18) considered in our approach, the following hypothesis test is performed. The null hypothesis is H0 : v1 = 0, which means, under this hypothesis, age has no effect on the DTI data. We randomly permute the values of age 5 among all samples and P fix the DTI data, then apply model (18) to the permuted data and compute the mean 1 ˆip )2 , where yˆip is the prediction of PALMR on the permuted data. Repeat squared geodesic error M SGEperm = N i dist(yi , y i the permutation T = 1000 times, we get a sequence of errors {M SGEperm }Ti=1 and calculate a p-value at each voxel using |{i|M SGE i
−∞. For any fixed α ∈ R and y ∈ dom σ , let
α 2 y + ∈ arg min σ(x) + Exp−1 d (y, x), y x, ∂h(y) + x∈M 2
17
we have
α−L 2 + d (y , y). 2 Proof. Since y + is a minimizer, let x = y in the objective function, we have D E α + σ(y + ) + Exp−1 d2 (y, y + ) ≤ σ(y). y y , ∂h(y) + 2 Moreover, since h has L-Lipschitz gradient, it follows from Lemma 7 that D E L + h(y + ) ≤ h(y) + Exp−1 y , ∂h(y) + d2 (y, y + ). y 2 Adding the above two inequalities yields the inequality we require. σ(y + ) + h(y + ) ≤ σ(y) + h(y) −
¯ ∈ dom σ is not a critical point of σ (i.e. x ¯∈ ¯ Next, we show that if x / crit σ ) then K-L inequality holds at x ¯ ∈ dom ∂σ such that 0 ∈ ¯ . Then the K-L inequality Proposition 9 ( [32]). Let σ : M → R ∪ {+∞} be a PLS function and x / ∂σ(x) ¯. holds at x ¯ δ/2), η = δ/2, then for each x ∈ dom σ , we have Proof. For any δ > 0, take φ(t) = t/δ , U = B(x, ¯ dist(0, ∂σ(x)) = dist(0, ∂σ(x))/δ. φ0 (σ(x) − σ(x))
(19)
¯ < σ < σ(x) ¯ + η] implies Notice that x ∈ U ∩ [σ(x) ¯ + |σ(x) − σ(x)| ¯ < δ. d(x, x)
(20)
We further claim that for each x satisfying (20), it holds dist(0, ∂σ(x)) ≥ δ. k
k
k
(21)
k
Otherwise, there exist sequences {(x , u ) : u ∈ ∂σ(x )} and {δk } ⊂ R++ such that δk → 0 as k → ∞ and
¯ + |σ(xk ) − σ(x)| ¯ < δk , d(xk , x)
kuk k < δk ,
which implies
¯ σ(xk ) → σ(x), ¯ uk ∈ ∂σ(xk ), and kuk k → 0. xk → x, ¯ , which is a contradiction with the assumption. Thus, 0 ∈ ∂σ(x) Therefore, combining (19), (20) and (21), we get ¯ dist(0, ∂σ(x)) ≥ 1. φ0 (σ(x) − σ(x)) ¯. So, the K-L inequality holds at x A.2
Proof of Theorem 5
Following the idea of [36], [45], we outline the proof in three steps: (1) sufficient decrease property; (2) a subdifferential lower bound for the iterates gap; (3) using the K-L property. We first show the sufficient decrease of objective function value under Assumption 1. For simplicity of notations, we use Ψk := Ψ(xk , y k ), k ≥ 0 in the sequel. Lemma 10. Suppose Assumption 1 holds. Let {(xk , y k )}k∈N be the sequence generated by PALMR. The following assertions hold. (i)
(ii)
The sequence {Ψk }k∈N is nonincreasing and satisfies τ 2 (d (xk+1 , xk ) + d2M2 (y k+1 , y k )) ≤ Ψk − Ψk+1 , 2 M1 − where τ := min{(µ1 − 1)λ− 1 , (µ2 − 1)λ2 } > 0. We have ∞ X d2M1 (xk+1 , xk ) + d2M2 (y k+1 , y k ) < ∞,
(22)
(23)
k=0
which implies
lim dM1 (xk+1 , xk ) = lim dM2 (y k+1 , y k ) = 0.
k→∞
k→∞
Proof. (i) Applying Lemma 8 to subproblems (10) and (11), we get
ck − L1 (y k ) 2 dM1 (xk+1 , xk ), 2 dk − L2 (xk+1 ) 2 g(y k+1 ) + h(xk+1 , y k+1 ) ≤g(y k ) + h(xk+1 , y k ) − dM2 (y k+1 , y k ), 2 f (xk+1 ) + h(xk+1 , y k ) ≤f (xk ) + h(xk , y k ) −
(24)
18
Adding the above two inequalities, we obtain for k ≥ 0 that
1 (µ1 − 1)L1 (y k ) 2 (µ2 − 1)L2 (xk+1 ) 2 Ψk − Ψk+1 ≥ dM1 (xk+1 , xk ) + dM2 (y k+1 , y k ) 2 2
2 (µ1 − 1)λ− (µ2 − 1)λ− 2 2 1 2 dM1 (xk+1 , xk ) + dM2 (y k+1 , y k ) ≥ 2 2 τ ≥ (d2M1 (xk+1 , xk ) + d2M2 (y k+1 , y k )), (25) 2 where in 1 we used ck = µ1 L1 (y k ), dk = µ2 L2 (xk+1 ) and µ1 , µ2 > 1, and in 2 we used the assumption inf{L1 (y k ) : k ∈ − − k N} ≥ λ1 and inf{L2 (x ) : k ∈ N} ≥ λ2 . (ii) Inequality (22) shows that {Ψk }k∈N is nonincreasing sequence, and we know from Assumption 1(ii) that inf x,y Ψ(x, y) > −∞, it follows that {Ψk }k∈N converges. Denote the limit by Ψ∗ , we have lim Ψk = Ψ∗ .
(26)
k→∞
Let K be a positive integer and sum (25) from k = 0 to k = K , we get
Ψ0 − ΨK+1 ≥
K τ X 2 d (xk+1 , xk ) + d2M2 (y k+1 , y k ). 2 k=0 M1
Taking K → ∞ in the above inequality leads to ∞ X
d2M1 (xk+1 , xk ) + d2M2 (y k+1 , y k ) ≤
k=0
2(Ψ0 − Ψ∗ ) < +∞, τ
which completes the proof. Regarding the subdifferential of Ψ defined in (9) of the main text, we have the following result whose derivation is similar to that in [31]. Proposition 11 ( [32]). Let Ψ be as in (9), then for all (x, y) ∈ dom Ψ = dom f + dom g , we have
∂Ψ(x, y) = {∂f (x) + ∂x h(x, y)} × {∂g(y) + ∂y h(x, y)} = ∂x Ψ(x, y) × ∂y Ψ(x, y). Next, we derive a subdifferential lower bound for the gap between two consecutive iterates, and study some properties of the set of limit point of iterates under the assumption of boundedness. Lemma 12. Suppose Assumption 1 holds. Let {(xk , y k )}k∈N be the sequence generated by PALMR which is assumed to be bounded. For each k ≥ 0, define
Ak+1 =ck Exp−1 xk + ∂x h(xk+1 , y k+1 ) − Pγxk (0)γxk (1) ∂x h(xk , y k ) x xk+1 Ak+1 =dk Exp−1 y k + ∂y h(xk+1 , y k+1 ) − Pγyk (0)γyk (1) ∂y , h(xk+1 , y k ). y y k+1 where γxk : [0, 1] → M1 is the geodesic joining xk and xk+1 such that γxk (0) = xk and γxk (1) = xk+1 , and γyk is similarly defined. Then
(Ak+1 , Ak+1 ) ∈ ∂Ψ(xk+1 , y k+1 ) ⊆ Txk+1 M1 × Tyk+1 M2 . x y
(27)
Moreover, we have k+1 kAk+1 kxk+1 ≤ (1 + µ1 )λ+ , xk ) + LdM2 (y k+1 , y k ), x 1 dM1 (x
k+1 kAk+1 kyk+1 ≤ (1 + µ2 )λ+ , y k ). y 2 dM2 (y
Proof. Since xk+1 is a minimizer of subproblem (10), by the Fermat’s rule, we have D E c k −1 k k 0 ∈∂ f (x) + Expx + d2M1 (xk , x) k x, ∂x h(x , y ) 2 x=xk+1 k =∂f (xk+1 ) + Pγxk (0)γxk (1) ∂x h(xk , y k ) − ck Exp−1 x , xk+1
(28)
(29)
0 where we used Proposition 6, Proposition 11 and the fact that ∂x d2M (x0 , x) = −2Exp−1 x x for Hadamard manifold M [32] to get the equality. Similarly, we also have
0 ∈∂g(y k+1 ) + Pγyk (0)γyk (1) ∂y h(xk+1 , y k ) − ck Exp−1 yk . y k+1 Combining (29) and (30) , we get
Axk+1 ∈ ∂f (xk+1 ) + ∂x h(xk+1 , y k+1 ), which, together with Proposition 11, results in the assertion in (27).
Ak+1 ∈ ∂g(y k+1 ) + ∂y h(xk+1 , y k+1 ), y
(30)
19
Now, we estimate the norms of Ak+1 and Ak+1 . x y
kAk+1 kxk+1 ≤ck d(xk+1 , xk ) + k∂x h(xk+1 , y k+1 ) − Pγxk (0)γxk (1) ∂x h(xk , y k )kxk+1 x
1 ≤ ck d(xk+1 , xk ) + k∂x h(xk+1 , y k+1 ) − Pγxk (0)γxk (1) ∂x h(xk , y k+1 )kxk+1 + k∂x h(xk , y k+1 ) − ∂x h(xk , y k )kxk
2 ≤ (ck + L1 (y k+1 ))dM1 (xk+1 , xk ) + LdM2 (y k+1 , y k ) k+1 ≤(1 + µ1 )λ+ , xk ) + LdM2 (y k+1 , y k ), 1 dM1 (x
where the first two inequalities are obtained from triangular inequality and in 1 we used the fact kPγxk (0)γxk (1) ∂x h(xk , y k+1 ) − k k k k+1 k k Pγxk (0)γxk (1) ∂x h(x , y )kxk+1 = k∂x h(x , y ) − ∂x h(x , y )kxk , and in 2 we used the Lipschitz continuity in Assumption 1 (iii)-(iv) since we assume that {(xk , y k )}k∈N is bounded. Similarly, from the Lipschitz continuity of ∂y h, we get
kAk+1 kyk+1 ≤dk dM2 (y k+1 , y k ) + k∂y h(xk+1 , y k+1 ) − Pγyk (0)γyk (1) ∂y h(xk+1 , y k )kyk+1 y ≤dk dM2 (y k+1 , y k ) + L2 (xk+1 )dM2 (y k+1 , y k ) k+1 ≤(1 + µ2 )λ+ , y k ). 2 dM2 (y
Thus, we obtain the assertions in (28). It follows from Lemma 10 (ii) and Lemma 12 that
lim kAkx kxk = lim kAky kyk = 0.
k→∞
(31)
k→∞
In addition, for k ≥ 1 we have dist(0, ∂Ψ(xk , y k )) ≤k(Akx , Aky )k(xk ,yk ) ≤ β0 (kAkx k2xk + kAky k2yk )1/2 2 2 k k−1 2 2 k k−1 1/2 ≤β0 [2((1 + µ1 )λ+ ) + (2L2 + ((1 + µ2 )λ+ )] 1 ) dM1 (x , x 2 ) )dM2 (y , y
≤β(d2M1 (xk , xk−1 ) + d2M2 (y k , y k−1 ))1/2 → 0, as k → ∞, q √ + + 2 6 2 2(1 + µ1 )λ1 , (2L + ((1 + µ2 )λ2 ) ) . where β0 is a universal constant and β := β0 max
(32)
Under the assumption that {(xk , y k )}k∈N is bounded, there exists at least one limit point. We denote the set of all limit points as Γ(x0 , y 0 ), that is
Γ(x0 , y 0 ) := {(x∗ , y ∗ ) ∈ M1 × M2 : ∃ subsequence (xkj , y kj ) → (x∗ , y ∗ ) as j → ∞}. Some properties of Γ(x0 , y 0 ) are presented in Lemma 13. Lemma 13. Suppose Assumption 1 holds. Let {(xk , y k )}k∈N be the sequence generated by PALMR which is assumed to be bounded. Then the following assertions hold. (i)
Γ(x0 , y 0 ) ⊆ crit Ψ is a nonempty, compact and connected set and lim dist((xk , y k ), Γ(x0 , y 0 )) = 0.
(33)
k→∞
(ii)
Ψ is finite and constant on Γ(x0 , y 0 ), which equals to Ψ∗ .
Proof. (i) Since the sequence {(xk , y k )}k∈N is bounded, there exists at least one limit point, so Γ(x0 , y 0 ) is a nonempty bounded closed set, which, together with the Hopf-Rinow’s Theorem [27], implies that it is compact. It is easy to show lim dist((xk , y k ), Γ(x0 , y 0 )) = k→∞
0 by the definition of limit point. The connectedness can be proved by contradiction. Suppose Γ(x0 , y 0 ) is not connected, which means there exist two nonempty and closed disjoint subsets C0 and C1 of Γ(x0 , y 0 ) such that C0 ∪ C1 = Γ(x0 , y 0 ). According to the smooth Urysohn Lemma [46], there exists a smooth function ζ : M1 × M2 → [0, 1] such that C0 = ζ −1 (0) and C1 = ζ −1 (1). Setting U0 := [ζ < 1/4] and U1 := [ζ < 3/4], we obtain two open neighborhoods of C0 and C1 , respectively. Since lim dist((xk , y k ), Γ(x0 , y 0 )) = 0, there exists some integer k0 such that (xk , y k ) ∈ U0 ∪ U1 for all k ≥ k0 . Let δk = ζ(xk , y k ), k→∞ the the sequence {δk }k∈N satisfies (1) (2) (3) (4)
δk ∈ / [1/4, 3/4] for all k ≥ k0 , there exists infinitely many k such that δk < 1/4, there exists infinitely many k such that δk > 3/4, ζ is uniformly continuous on compact sets which, together with the boundedness of {(xk , y k )}, dM1 (xk+1 , xk ) → 0 and dM2 (y k+1 , y k ) → 0, implies that |δ k+1 − δk | → 0 as k → ∞,
6. For any (u, v) ∈ Tx M1 × Ty M2 , the function (u, v) → (kuk2x + kvk2y )−1/2 defines a norm on the linear space Tx M1 × Ty M2 . On the other hand, k(u, v)k(x,y) also defines a norm. Due to the equivalence of norms, there exists a universal constant β0 such that k(u, v)k(x,y) ≤ β0 (kuk2x + kvk2y )−1/2
20
simultaneously. However, there exists no sequence satisfying all the above four requirements. Therefore, Γ(x0 , y 0 ) is connected. Now, we show that every point (x∗ , y ∗ ) in Γ(x0 , y 0 ) is a critical point of Ψ. Suppose (xkj , y kj ) → (x∗ , y ∗ ) as j → ∞, since both f and g are PLS functions, we have
lim inf f (xkj ) ≥ f (x∗ ) and j→∞
lim inf g(y kj ) ≥ g(y ∗ ).
(34)
j→∞
From subproblem (10), we have D E c k −1 f (xkj ) + Exp−1 xkj , ∂x h(xkj −1 , y kj −1 ) + j d2M1 (xkj , xkj −1 ) xkj −1 D E c2 k −1 −1 ∗ ∗ kj −1 kj −1 ≤ f (x ) + Expxkj −1 x , ∂x h(x ,y ) + j d2M1 (x∗ , xkj −1 ), 2 which yields D E D E c k −1 ∗ kj −1 kj −1 kj kj −1 kj −1 f (xkj ) ≤f (x∗ ) + Exp−1 , y ) − Exp−1 , ∂ h(x , y ) + j d2M1 (x∗ , xkj −1 ) x kj −1 x , ∂x h(x kj −1 x x x 2 ckj −1 2 kj −1 kj −1 kj −1 ∗ ∗ kj −1 kj ))k∂x h(x ,y )kxkj −1 + ≤f (x ) + (dM1 (x , x ) + dM1 (x , x dM1 (x∗ , xkj −1 ), 2 where we used the Cauchy-Schwarz inequality in the last inequality. From Lemma 10 (ii), we know lim dM1 (xkj , xkj −1 ) = 0, j→∞
which implies lim xkj −1 = x∗ . Note that ckj −1 is bounded and ∂x h is continuous and thus bounded. Taking lim sup on both sides j→∞
j→∞
of the above inequality yields lim sup f (xkj ) ≤ f (x∗ ). By a similar argument we obtain lim sup g(y kj ) ≤ g(y ∗ ). Thus, in view of j→∞
j→∞
(34), we have
lim f (xkj ) = f (x∗ ) and
j→∞
lim g(y kj ) = g(y ∗ ),
j→∞
which leads to
lim Ψkj → Ψ(x∗ , y ∗ ).
j→∞ k
(35)
k
In addition, from Lemma 12 we know that (Axj , Ayj ) ∈ ∂Ψ(xkj , y kj ) which, together with equation (31) and the closedness of ∂Ψ, implies that 0 ∈ ∂Ψ(x∗ , y ∗ ). From the definition of critical point, we get (x∗ , y ∗ ) is a critical point of Ψ. Hence Γ(x0 , y 0 ) ⊆ crit Ψ. (ii) From equation (35) and the fact that {Ψk }k∈N converges to Ψ∗ , we get Ψ(x∗ , y ∗ ) = Ψ∗ for any (x∗ , y ∗ ) ∈ Γ(x0 , y 0 ). So assertion (ii) holds. Next, we prove that the sequence generated by PALMR converges to a critical point of the objective function in (9). For this purpose, we need the K-L property of Ψ. With this property, we have the following result, which is adapted from Lemma 6 of [36]. Lemma 14. Let Γ ⊆ M be a compact set and σ : M → R ∪ {+∞} be a PLS function such that it is constant on Γ and has K-L property at each point of Γ. Then, there exist > 0, η > 0 and a continuous concave φ satisfying Definition 4 (i), such that for all ¯ < σ(p) < σ(p) ¯ + η], we have p¯ ∈ Γ and all p ∈ {p ∈ M : dist(p, Γ) < } ∩ [σ(p)
¯ dist(0, ∂σ(p)) ≥ 1. φ0 (σ(p) − σ(p)) Equipped with all these results, we are ready to prove Theorem 5. Proof. Suppose the sequence {dM1 ×M2 ((x0 , y 0 ), (xk , y k ))}k∈N is bounded, or equivalently, the sequence {(xk , y k )}k∈N is bounded. Then Lemma 13 (i) shows that the limit set Γ(x0 , y 0 ) is a nonempty compact set and equalities (26) and (33) hold. If there exists integer k0 ≥ 0 such that Ψk0 = Ψ∗ , then the decreasing property (22) in Lemma 10 shows that {(xk , y k )}k≥k0 is stationary and the claimed results are trivial to prove. In the rest of the proof, we assume Ψk > Ψ∗ for all k ≥ 0. (i) Lemma 13 (ii) shows that Ψ is constant on Γ(x0 , y 0 ) and equals to Ψ∗ . Applying Lemma 14 with Γ = Γ(x0 , y 0 ) and σ = Ψ, there exist > 0, η > 0 and a continuous concave φ satisfying Definition 4 (i), such that for all (x, y) in the intersection
{(x, y) ∈ M1 × M2 : dist((x, y), Γ(x0 , y 0 )) < } ∩ [Ψ∗ < Ψ(x, y) < Ψ∗ + η], we have
φ0 (Ψ(x, y) − Ψ∗ )dist(0, ∂Ψ(x, y)) ≥ 1. For the above and η , equalities (26) and (33) imply that there exists integer k0 such that dist((xk , y k ), Γ(x0 , y 0 )) <
and
Ψ∗ < Ψk < Ψ∗ + η, ∀k ≥ k0 ,
which further implies
φ0 (Ψk − Ψ∗ )dist(0, ∂Ψ(xk , y k )) ≥ 1
(36)
21
holds for k ≥ k0 . Substituting inequality (32) into the above inequality, we get
φ0 (Ψk − Ψ∗ ) ≥ β −1 (d2M1 (xk , xk−1 ) + d2M2 (y k , y k−1 ))−1/2 . s
∗
t
(37)
∗
Define ∆s,t := φ(Ψ − Ψ ) − φ(Ψ − Ψ ), then the concavity of φ yields that for k ≥ k0
∆k,k+1 ≥φ0 (Ψk − Ψ∗ )(Ψk − Ψk+1 ) τ ≥ (d2M1 (xk , xk−1 ) + d2M2 (y k , y k−1 ))−1/2 (d2M1 (xk+1 , xk ) + d2M2 (y k+1 , y k )), 2β or equivalently, r
(d2M1 (xk+1 , xk )
√ where we used the fact that to k = K , we get K X k=k0
+
d2M2 (y k+1 , y k ))1/2
2β ∆k,k+1 (d2M1 (xk , xk−1 ) + d2M2 (y k , y k−1 ))1/2 τ β 1 ≤ ∆k,k+1 + (d2M1 (xk , xk−1 ) + d2M2 (y k , y k−1 ))1/2 , τ 2 ≤
ab ≤ (a + b)/2 for all a, b ≥ 0 in the second inequality. Summing up the above inequality from k = k0
K β 1 X 2 (d2M1 (xk+1 , xk ) + d2M2 (y k+1 , y k ))−1/2 ≤ ∆k0 ,K+1 + (d (xk , xk−1 ) + d2M2 (y k , y k−1 ))1/2 τ 2 k=k M1 0
K−1 β 1 X ≤ φ(Ψk0 − Ψ∗ ) + (d2 (xk+1 , xk ) + d2M2 (y k+1 , y k ))1/2 , τ 2 k=k −1 M1 0
where the first inequality follows from the fact that ∆s,r + ∆r,t = ∆s,t and the second inequality follows from φ ≥ 0. A simple manipulation of the above inequality yields K X
(d2M1 (xk+1 , xk ) + d2M2 (y k+1 , y k ))−1/2 ≤
k=k0
Taking K → ∞ leads to
∞ X
2β φ(Ψk0 − Ψ∗ ) + (d2M1 (xk0 , xk0 −1 ) + d2M2 (y k0 , y k0 −1 ))1/2 . τ
(d2M1 (xk+1 , xk ) + d2M2 (y k+1 , y k ))−1/2 < ∞,
k=k0
which implies that
∞ X
dM1 (xk+1 , xk ) < ∞
and
k=0
∞ X
dM2 (y k+1 , y k ) < +∞.
k=0 k
(ii) From assertion (i) we know that both sequences {x }k∈N and {y k }k∈N are Cauchy sequences on Hadamard manifolds. Thus, by the Hopf-Rinow’s Theorem we infer that both sequences converge. Denote the limit points by x∗ and y ∗ , then we conclude from Lemma 13 that (x∗ , y ∗ ) is a critical point of Ψ.
A PPENDIX B PARTIAL D ERIVATIVES OF E Denote p = Expp (v) the neighboring point of p along tangent vector direction v , where ∈ R and v is an arbitrary tangent vector ˆj denote the parallel transport of vj from p to p . By taking derivative of E p , {ˆ of p. Similarly let v vj }, {gi } at = 0 we have X j 1X ∂ E|=0 = ∂ d2 yic , Expp xi vˆj 2 i j =0 + * X X j −1 c yi , ∂ Expp = −Exp xi vˆj ˆj Expp xji v i j =0 * + X X j c = −Exp−1 xi vˆj |=0 ˆi yi , ∂ Expp y i
j
ˆi y
+
*
=
X
=
X
i
i
c −Exp−1 ˆi yi , dp Expp y
X
xji vj v
j
yˆi * + X j † −1 c , − dp Expp x i vj Expyˆi yi , v j
p
22 0 −2Exp−1 p p
where in the second equality we used the fact that ∂p d2M (p0 , p) = for Hadamard manifold M and in the last equality we used the definition of adjoint derivative. As a result, we have X j † X c xi vj Exp−1 ∂p E = − dp Expp ˆi yi . y j
i
For any fixed j0 satisfying 1 ≤ j0 ≤ d and arbitrary v from Tp M, let vj0 = vj0 +v and take derivative of E p, vj0 , {vj }j6=j0 , {gi } at = 0, we have * + X j X j0 −1 c xi vj + xi vj0 ∂ E|=0 = −Expyˆi yi , ∂ Expp i
=
c − Expy−1 ˆi yi , dv Expp
X
E xji vj (xji 0 v)
j
i
*
=
=0
j6=j0
XD
−xji 0
X
dv Expp
X
xji vj
†
yi
+ c Exp−1 ˆi yi , y
v
j
i
∂vj E = −
i
xji
dv Expp
. p
Thus, X
ˆi y
X
0 xji vj 0
†
c Exp−1 ˆi yi y
j0
for j = 1, · · · , d. By a similar procedure, we get
∂gi E = − dv Expyi (gi )
†
ˆi Exp−1 yic y
for i = 1, · · · , N .
A PPENDIX C M ANIFOLD OF S YMMETRIC P OSITIVE D EFINITE M ATRICES It is well known that the set of n×n SPD matrices S++ (n) is a Riemannian symmetric space with nonpositive sectional curvature [47]. Hence, S++ (n) is a Hadamard manifold. For any p ∈ S++ (n), the tangent space at p, which we denote as Tp M, is the space of n × n symmetric matrices S(n). The inner product of two tangent vectors u, v ∈ Tp M is given by
hu, vip = Tr(p−1/2 up−1 vp−1/2 ),
(38)
where Tr(·) is the trace of matrices. The exponential map and the inverse exponential map are given by Expp (v) = p1/2 exp(p−1/2 vp−1/2 )p1/2 ,
(39)
1/2 Exp−1 log(p−1/2 qp−1/2 )p1/2 , p q =p
(40)
and
respectively, where exp(·) and log(·) denote matrix exponential and logarithm [48], respectively. The geodesic distance between any two points p, q ∈ S++ (n) is given by
d(p, q) = Tr(log2 (p−1/2 qp−1/2 )).
(41)
Let v ∈ Tp M, the parallel transport of v along the geodesic from p to q is given by
Ppq (v) = p1/2 up−1/2 vp−1/2 up1/2 ,
(42)
−1/2 where u = exp(p−1/2 · Exp−1 /2). p q·p
A PPENDIX D A DDITIONAL E XPERIMENTS In Subsection 4.2 of the main paper, we conducted experiments on a real DTI data with three different settings: no gross error, 20% manual gross error, and 20% registration error. In Fig. 10, we show the tensor deviation between training data without gross error and training data with 20% registration error. On each voxel, we compute a deviation vector consisting of 58 (the number of patients in the training data) entries, where each entry records the geodesic distance between the tensor without gross error and the tensor with 20% registration error of the same patient. The maximum, minimum, and median values of the deviation vector are shown in three heat maps, respectively. From Fig. 10, we observe that the magnitude of registration errors varies a lot among different patients and different voxels. In particular, on the boundary region of the mask the registration error can be very high, as shown in the first column of Fig. 10. However, such high registration error on the boundary happens to only a small fraction of patients, since in the same regions
23
the median deviation is usually small, as shown in the last column of Fig. 10. In addition, the last two columns of Fig. 10 demonstrate that registration errors appear in a structured manner, namely, registration errors with moderate median deviation appears in the interior of white matter while only a tiny fraction of errors with extreme magnitude appears on the boundary. Based on this observation, in our experiments, we focus on those voxels whose minimum registration errors are larger than certain threshold, that is, the interior regions of the white matter. In the main paper, we presented the distribution of prediction errors on each slice for all comparison methods. In Fig. 11, we show the comparison results of MGLM and PALMR one each voxel of all six slices, separately. On each voxel, we compare the median prediction error (measured in both the relative FA error and MSGE) on the testing data of MGLM and PALMR. If MGLM achieves smaller error than PALMR, we assign -1 to the voxel; if MGLM achieves larger error than PALMR, we assign 1 to the voxel; if the voxel is outside the mask or the error difference between MGLM and PALMR is less than 1e-3, we assign 0 to the voxel. Since for training data with 20% registration error, experiments were not conducted on the whole slice, we did not show the comparison result in this way. From Fig. 11 we have two observations: (1) When the training data contain no gross error, MGLM and PALMR have similar prediction errors with negligible difference on many voxels, while for the rest of voxels MGLM is better on half of them and PALMR is better on the other half. Such observation is from the first row of each subfigure. (2) When 20% of the training data contain gross error, PALMR outperforms MGLM on most of the voxels (i.e., a large portion of white region). This can be observed from the second row of each subfigure. D.1
Region of Interest Analysis
To further test the predictability and robustness of our model, we conducted a region of interest (ROI) analysis on the real DTI data. First, we used all 58 instances as training data and focused on white matter tensors in slice z = 32. Then, we selected two ROIs [49]: the genu of the corpus callosum (ROI 1) and posterior limb of internal capsule (ROI 2) as shown in Fig. 12. Both ROIs are believed to be affected by age or gender, and each ROI contains 8 × 8 voxels. After that, for each voxel in each ROI we trained our model and MGLM using all 58 training instances either without or with β = 20% manual gross errors with magnitude σg = 5, and applied the trained models to predict DTI tensors for different ages. Visualization results of MGLM and our model in both ROIs using different ages (age = 10, 50 and 90) are shown in Fig. 13 and Fig. 14, respectively. In subfigure (a) of each figure, we use a box with blue-dotted boundary to highlight those voxels where DTI tensors have large visual variation in both size and orientation with respect to age. To investigate the effect of gender, we apply our model to predict tensors for both male and female on the ROIs. For the sake of comparison, in each voxel we apply our model to predict two tensors, one for male and the other for female, using the same age, then compute the geodesic distance between these two tensors. Visualization results are shown in Fig. 15. In Fig. 13 and Fig. 14, we can observe that both MGLM and PALMR trained on training data without gross errors can identify a few voxels (highlighted with blue box) where the shape and orientation of tensors change significantly as age increases. Moreover, the voxels identified are contained in the region of high white matter intensity as shown in Fig. 12. This observation is consistent with findings in neuroscience [50] that with increasing age, FA values increase in the internal capsule and the corpus callosum. The posterior limb of the internal capsule and the corpus callosum show the most significant overlaps between white matter intensity and FA changes with age. We also observe that PALMR has similar predictions as MGLM when both are trained on data without gross error, but outperforms MGLM when gross errors are present in the training data in the sense that when there are 20% manual gross errors, PALMR can still predict tensors showing meaningful diffusion trends while the predictions of MGLM are nearly random. In Fig. 15, we observe that for some regions (e.g. ROI 1 and ROI 2) the diffusion tensors of male and female are quite different in the shape and orientation. Moreover, in the same region of interest the voxels significantly affected by gender are contained in the region of high FA values. All these observations imply that gender plays an important role in deciding the diffusion trend in the brain and is worth further study.
24
(a) Slice z = 32
(b) Slice x = 55
(c) Slice y = 64
(d) Slice z = 24
(e) Slice x = 64
(f) Slice y = 45 Fig. 10. Tensor deviation between training data without gross error and with 20% registration error. On each voxel, we compute a deviation vector of length 58 (the number of training data) whose maximum, minimum, and median values are shown in three heat maps, respectively.
25
(a) Slice z = 32
(b) Slice x = 55
(c) Slice y = 64
(d) Slice z = 24
(e) Slice x = 64 (f) Slice y = 45 Fig. 11. Performance comaprison of MGLM and PALMR on six slices. On each voxel, we compare the median prediction error (measured in both the relative FA error and MSGE) on the testing data of MGLM and PALMR. If MGLM achieves smaller error than PALMR, we assign -1 to the voxel; if MGLM achieves larger error than PALMR, we assign 1 to the voxel; if the voxel is outside the mask or the error difference bwtween MGLM and PALMR is less than 1e-3, we assign 0 to the voxel. In each subfigure, there are four subplots corresponding to two experimental settings under two error metrics.
26
Fig. 12. Slice z = 32 and two ROIs. For each voxel in the ROIs, the underlying intensity value indicates FA value.
PALMR without gross error
27
(b) age = 50, male
(c) age = 90, male
(d) age = 10, male
(e) age = 50, male
(f) age = 90, male
(g) age = 10, male
(h) age = 50, male
(i) age = 90, male
MGLM with 20% gross error
PALMR with 20% gross error
MGLM without gross error
(a) age = 10, male
(j) age = 10, male (k) age = 50, male (l) age = 90, male Fig. 13. Visualization of aging effect on ROI 1: First two rows show results by models trained on data without gross error. Last two rows show results by models trained on data with 20% manual gross errors. Only tensors in the white matter mask are illustrated. Better viewed in color.
PALMR without gross error
28
(b) age = 50, male
(c) age = 90, male
(d) age = 10, male
(e) age = 50, male
(f) age = 90, male
(g) age = 10, male
(h) age = 50, male
(i) age = 90, male
MGLM with 20% gross error
PALMR with 20% gross error
MGLM without gross error
(a) age = 10, male
(j) age = 10, male (k) age = 50, male (l) age = 90, male Fig. 14. Visualization of aging effect on ROI 2: First two rows show results by models trained on data without gross error. Last two rows show results by models trained on data with 20% manual gross errors. Only tensors in the white matter mask are illustrated. Better viewed in color.
29
(a) ROI 1, age = 10
(b) ROI 1, age = 50
(c) ROI 1, age = 90
(d) ROI 2, age = 10 (e) ROI 2, age = 50 (f) ROI 2, age = 90 Fig. 15. Visualization of gender effect on different ROIs using predicted tensors by PALMR. The value in each voxel denotes the geodesic distance between tensors of male and female obtained by our model using training samples without gross error. Better viewed in color.