Compute the update of the gene coordinates X given Y(lâ1), Î(lâ1), and a(lâ1). 5. Compute ... as the supporting point in the current iteration, it is guaranteed that the loss function is ... with kx = k1x + k2x + k3x + k4x and Ax = â2(A1x + A2x + A3x + A4x). Note ... can be used to accelerate convergence (see De Leeuw & Heiser,.
Mathematical details of the GENEFOLD algorithm
K. Van Deun Catholic University of Leuven, Belgium
1
1
The Loss Function
Consider n genes i = 1, ..., n whose expression was measured in m conditions j = 1, ..., m. In the unfolding model, both are represented as points in low dimensional space such that the order of the distances of a gene to the different conditions reflects the order of the expression levels in the nonmetric case, or that the differences between the distances reflect the differences between the expression levels (metric case). Such a configuration can be obtained by minimizing the following loss function, K(X, Y, Γ) = n−1
n X
Ki (xi , Y, γ i ) = n−1
i=1
n X £
¤ 1 − r2 (γ i , di ) ,
(1)
i=1
with di the vector of distances for gene i and γ i the transformed expression levels (with an inverse relation between the γij ’s and the expression levels: higher expression levels correspond to lower γij ’s). Here, we will use Euclidean distances such that the distance between gene i and condition j is given by dij = [(xi − yj )0 (xi − yj )]1/2 where xi and yj are the coordinate vectors for gene i and condition j. As shown by Kruskal and Carroll (1969) and by Van Deun, Groenen, Heiser, Busing, and Delbeke (2005), minimizing (1) or, equivalently, maximizing the squared correlation between the transformed data and the distances is equivalent to minimizing Stress-2. Van Deun et al. (2005) showed that under the condition γ 0i Jγ i = 1, loss function (1) is equivalent to K(X, Y, Γ, a) = n
−1
n X
−1
Ki (xi , Y, γ i , ai ) = n
i=1
n X
kγ i − ai di k2J ,
(2)
i=1
where ai is a scale parameter and J = I − m−1 110 a centering operator. The ai have to be restricted to be nonnegative in order to fit positive correlations between the distances and the transformed data as we want the distance from a gene to a condition to increase when the gene has a lower expression in that condition. An important aspect of the loss is that it is invariant to centering of Y because the distances di do not change. We will minimize (2) by alternating least squares and iterative majorization using the algorithm presented here and named GENEFOLD. 1. Choose initial X(0) , Y(0) , and a(0) . 2. l := 0. 2
3. l := l + 1. 4. Compute the update of the gene coordinates X given Y(l−1) , Γ(l−1) , and a(l−1) . 5. Compute the update of the condition coordinates Y given X(l) , Γ(l−1) , and a(l−1) . 6. Compute the update of the transformed data Γ given X(l) , Y(l) , and a(l−1) . 7. Compute the update of the weights a based on X(l) , Y(l) , and Γ(l) . 8. If L(l) − L(l−1) < ² or l = lmax stop. Else return to 3. Suitable updates can be easily found for Steps 7 and also for 6 in the metric unfolding case. The more difficult parts are the steps 4, 5, and, in the nonmetric unfolding case, also 6. The update of the coordinates (steps 4 and 5) will be based on majorization procedures. Below we develop each update.
1.1
Updating the coordinates
In this section, we will derive expressions that update the coordinates X and Y. As the loss function is complicated, we will replace it by majorizing functions that are easy to minimize (see De Leeuw, 1994, Heiser, 1995, and Lange, Hunter, & Yang, 2000 for a general introduction to iterative majorization). These functions are never lower than the original function and touch the original function at a supporting point. By taking the previous estimate as the supporting point in the current iteration, it is guaranteed that the loss function is nonincreasing with each new update. In practice, the values of the loss function converge to a local minimum. When the original function has a bounded Hessian, it can be majorized by a quadratic function. Kiers (2002) introduced one that can be easily updated both in constrained and unconstrained cases. This function is of the following form X g(Z) = k + trAZ + trBq ZCq Z0 , (3) q
where k includes all terms that are constant in Z with Z denoting the matrix we want to update (this is, in our case, xi or Y). So, we will look for a function that majorizes (2) and that is of the form (3). 3
As explained by Kiers (2002), when the Cq are positive semi-definite, the minimization of (3) is itself based on the majorizing function X g(Z) ≤ k + trFZ + λq trCq Z0 Z, (4) q
¡
¢ with λq the largest eigenvalue of 2−1 Bq + 2−1 B0q and with F = A + ¢ P ¡ 0 0 0 C Z (B + B ) − 2λ C Z . Without constraints, the optimal Z is q q q q 0 q 0 q given by à !−1 X 1 Z+ = − F0 λq Cq . (5) 2 q The function that we want to minimize is given by (2), that is, L(X, Y) = n
−1
n X ¡
¢ 1 + a2i d0i Jdi − 2ai d0i Jγ i .
(6)
i=1
To minimize this function, we will construct a quadratic majorizing function of the basic form (3). As majorization is closed under summation, we will majorize each of the terms that are not a linear or quadratic function of the coordinates. These are both terms involving distances, namely a2i d0i Jdi and ai d0i Jγ i . Note that we have to update two sets of coordinates X and Y where the genes enter the loss function by summing over all n genes. This allows us to update the coordinates xi for each of the genes separately (the sum is minimized by minimizing each of its components). Next, we will show how each of the terms can be expressed in the standard form (3), and this both for the condition coordinate matrix Y and the gene coordinate vector xi . For the sake of clarity, we will use a notation that follows closely the notation in (3) with numbered and lettered subscripts: we will use a subscript x and a subscript y when the term is expressed in function of respectively xi and Y. The numbers count the different terms that are in the same form. For example, k2x would be the second term that is constant in xi . How the terms a2i d0i Jdi and −2ai d0i Jγ i can be majorized by functions that are quadratic in the coordinates, is explained in appendix.
1.2
Updating X
We replace the second and third terms of equation (6) by their majorizing functions consisting of the sum of (21), (25), (29), and (32) developed in 4
appendix. The following majorizing function is then obtained for the xi g(xi ) = kx + trAx xi +
2 X
trBqx xi Cqx x0i
(7)
q=1
with kx = k1x + k2x + k3x + k4x and Ax = −2(A1x + A2x + A3x + A4x ). Note that because of the centered Y, A1x = 0. This majorizing function can in turn be majorized as done by Kiers (2002) (see equation (4)), g(xi ) ≤ kx + trFx xi +
2 X
λqx trCqx x0i xi ,
(8)
q=1
¡ ¢ with λqx the highest eigenvalue of 2−1 Bqx + 2−1 B0qx and with Fx = Ax + ¤ P £ 0 0 0 C x (B + B ) − 2λ C x qx qx qx qx i0 qx i0 . As B1x and B2x are scalars, λ1x and q λ2x equal these. This also implies that Cqx x0i0 (Bqx + B0qx ) = 2λqx Cqx x0i0 so that Fx = Ax . When there are no constraints, the optimal xi is given by à !−1 X 1 0 λqx Cqx , (9) x+ i = − Fx 2 q P where q λqx Cqx is a scalar of which the inverse is easily obtained. A relaxed update x++ can be used to accelerate convergence (see De Leeuw & Heiser, i 1980 and Heiser, 1995), x++ = 2x+ (10) i i − xi . The relaxed update still guarantees a nonincreasing sequence of loss values. Because the formulas connected to the update of xi do not involve another gene i0 , the order in which the genes are updated does not influence the outcome of the unfolding analysis.
1.3
Updating Y
The majorizing function for the Y coordinates is given by replacing the second and third terms of equation (6) by the sum of (22), (26), (30), and (33) of the appendix yielding g(Y) = ky + trAy Y +
2 X q=1
5
trBqy YCqy Y0 ,
(11)
with ky = k1y + k2y + k3y + k4y and Ay = −2(A1y + A2y − A3y + A4y ). The update is then given by à !−1 X 1 λqy Cqy Y+ = Fy , (12) 2 q ¢ P ¡ with Fy = Ay + q Cqy Y00 (Bqy + B0qy ) − 2λqy Cqy Y00 and λqy the highest ¡ ¢ eigenvalue of 2−1 Bqy + 2−1 B0qy . Given the special form P of the Bqy matrices, the eigenvalues can be set equal to respectively 1, i c1i and the maximal P −1 value P on the diagonal of i c3i Di0 and it follows that Fy = Ay . The inverse of q λqy Cqy can be easily obtained as it is a diagonal matrix. Here, too, the number of iterations can be approximately halved using the relaxed update: Y++ = 2Y+ − Y.
1.4
(13)
Updating Γ
The loss for gene i can be written as, Li (γ i ) = kγ i − ai di k2J ,
(14)
with γ i being function of δ i , e.g. in case of metric unfolding γij = b0i + b1i δij under the constraint b1i ≤ 0. This function can be minimized by (monotone) regressing the expression profiles δ i on ai di and this under a nonnegativity constraint γij ≥ 0 and a length constraint γ 0i Jγ i = 1. Let δ i be the expression profile for gene i, then fixed γ i of the following form are obtained for data with an interval level of measurement (see also Van Deun et al., 2005): γ i = k − (δ 0i Jδ i )−1/2 δ i ,
(15)
with k an arbitrary constant chosen such that all γij ≥ 0. Equation (15) clearly shows that the update is independent from the coordinate values. Therefore, the γ i ’s can be fixed right away in the metric unfolding case. In the nonmetric case, as shown by Gifi (1990), the length restriction can be imposed after the monotone regression by dividing the updated disparities 1/2 by γ 0i Jγ i . In case all γij are equal, Groenen, Os, and Meulman (2000) show how the optimal disparities are given by γ i = Sbopt with S = JLM where M is a matrix of ones on and below the diagonal and zeroes above it, L is a matrix composed of rows with zeroes and a one in the r-th column of 6
row j where r is the rank value of dij . Let G = S0 S, then bopt is a vector of all zeroes except one which equals (g)−1 ii (see also Van Deun et al., 2005). The nonzero value is chosen such that d0 d + 1 − 2b0 S0 d is minimal. An additive constant can be added without loss of generality and this property can be used to make all disparities nonnegative (since S = JLM is centered). Nonnegative γij ’s are important because the majorizing functions for X and Y depend on the assumption of nonnegative γij ’s.
1.5
Updating a
The optimal value for ai can be found by setting the partial derivative of (6) with respect to ai equal to zero: ∂L = 2gi ai d0i Jdi − 2gi d0i Jγ i = 0. ∂ai Under the constraint that ai ≥ 0, the following update is obtained, ½ 0 di Jγ i (d0i Jdi )−1 if ai ≥ 0 , ai = 0 if ai < 0 .
2
(16)
(17)
Adding constraints
To avoid degenerate solutions for the unfolding model, we will build in some constraints. The degeneracy shows itself in a lack of variation of the distances from a gene point to the condition points. Imposing a variance constraint in order to obtain a minimal amount of variance seems a straightforward solution but is insufficient given the scale dependency of the variance. In other words, it makes no sense to fix the variance to a certain value if the distances can still become very large; some sort of standardization is needed. We propose to work in a space centered at Y and with kxi k ≤ 1, this is restricting the space for the kxi k to the unit disk. Within this space, we pull the variance of the distances di towards values that are at least vm , with vm the variance of the distances of a point on the unit disk to m other points on the unit disk. v is determined by simulation as we could not find a closed form expression for it. It will become clear that this minimal value is forced only for points that fit perfectly. As explained by Kiers (2002), the equality constraint kxi k2 = 1 can be imposed by adapting step 4 (see page 2) as follows. Equation (8) can be 7
rewritten as a linear function under the constraint x0i xi = 1 as kx + trFx xi + P 2 q=1 λq trCqx . This function can be minimized by maximizing tr−Fx xi so that the update is given by x+ = Vx U0x where Vx and Ux come from the singular value decomposition of −Fx = Ux Σx Vx0 . To impose the inequality constraint kxi k2 ≤ 1, first calculate the unconstrained update. If the resulting xi has a length greater then 1, compute the constrained update. In case of constraints, the relaxed update is no longer of use as convergence cannot be guaranteed anymore. The variance constraint is imposed implicitly by an upper bound on the ai ’s: the optimal ai can be rewritten as r(γ i , di )(d0i Jdi )−0.5 and thus depends on the correlation and on the variance of the distances such that an upper bound u for the ai will result in restricting the spread of the distances. Within the unit reference space, setting u = (mvm )−0.5 will, for a maximal correlation r(γ i , di ) = 1, pull the variance of the distances towards a value vm or higher (with vm the variance of the distances on a unit disk with uniformly distributed points). The constrained update for ai becomes then 0 if d0i Jγ i (d0i Jdi )−1 < 0 , 0 0 −1 di Jγ i (di Jdi ) if 0 ≤ d0i Jγ i (d0i Jdi )−1 ≤ u , ai = (18) u if d0i Jγ i (d0i Jdi )−1 > u .
8
References De Leeuw, J. (1994). Block relaxation algorithms in statistics. In H. H. Bock, W. Lenski, & M. M. Richter (Eds.), Information systems and data analysis (p. 308-325). Berlin: Springer-Verlag. De Leeuw, J., & Heiser, W. J. (1980). Multidimensional scaling with restrictions on the configuration. In P. R. Krishnaiah (Ed.), Multivariate analysis (Vol. 5, pp. 501–522). Amsterdam, The Netherlands: NorthHolland Publishing Company. Gifi, A. (1990). Nonlinear multivariate analysis. Chicester, England: Wiley. Groenen, P. J. F. (2002). Iterative majorization algorithms for minimizing loss functions in classification. (Working paper presented at the 8th conference of the IFCS, Krakow, Poland) Groenen, P. J. F., Os, B.-J., & Meulman, J. J. (2000). Optimal scaling by alternating length-constrained nonnegative least squares with application to distance-based analysis. Psychometrika, 65, 511–524. Heiser, W. J. (1995). Convergent computation by iterative majorization: theory and applications in multidimensional data analysis. In W. J. Krzanowski (Ed.), Recent advances in descriptive multivariate analysis (pp. 157–189). Oxford: Oxford University Press. Kiers, H. A. L. (2002). Setting up alternating least squares and iterative majorization algorithms for solving various matrix optimization problems. Computational Statistics and Data Analysis, 41, 157–170. Kruskal, J. B., & Carroll, J. D. (1969). Geometrical models and badness-offit functions. In P. R. Krishnaiah (Ed.), Multivariate analysis (Vol. 2, p. 639-671). New York: Academic Press. Lange, K., Hunter, D. R., & Yang, I. (2000). Optimization transfer using surrogate objective functions. Journal of computational and graphical statistics, 9, 1–20. Van Deun, K., Groenen, P. J. F., Heiser, W. J., Busing, F. M. T. A., & Delbeke, L. (2005). Interpreting degenerate solutions in unfolding by use of the vector model and the compensatory distance model. Psychometrika, 70, 23-47.
9
A A.1
Majorizing the distance terms Majorizing a2i d0i Jdi
Here, we show how the first term of (2) involving distances can be majorized by a function of the coordinates that is of the form (3). A good reference for finding majorizing inequalities is Groenen (2002). The centering operator J equals I − m−1 110 , so that a2i d0i Jdi = a2i d0i di −
a2i 0 0 d 11 di , m i
(19)
The first term in the right hand side of (19) can be rewritten as a2i d0i di = a2i tr(1x0i − Y)0 (1x0i − Y) 0 = c1i trxi 10 1xi − 2c1i trxi 10 Y + c1i trY0 Y
(20)
with c1i = a2i . Rewriting (20) in function of xi gives, a2i d0i di = k1x + trB1x xi C1x x0i − 2trA1x xi ,
(21)
with k1x = c1i trY0 Y, B1x = c1i , C1x = 10 1, and A1x = c1i 10 Y. Note that for centered Y, A1x = [0 0]. In function of Y we get X gi a2i d0i di = k1y + trB1y YC1y Y0 − 2trA1y Y, (22) i
P P P with k1y = i c1i trxi 10 1x0 , B1y = i c1i , C1y = I, and A1y = i c1i xi 10 . Consider the second term in the right hand side of (19): −d0i 110 di is minus the squared sum of dij ’s which can be majorized using the inequality −cs2 ≤ cs20 − 2css0 , with c a positive constant. Take c = m−1 a2i and s = d0i 1, we then obtain −
a2 a2i 0 0 di 11 di ≤ i d0i0 110 di0 − m m a2i 0 = d 110 di0 − m i0
2a2i 0 0 d 11 di0 m i 2a2i 0 (1 di0 )d0i 1. m
(23)
The last term of function (23) still contains distances that depend on xi and Y. The Cauchy Schwarz inequality can be used to obtain a majorizing 10
0 function of the coordinates as −d = −kx − yk ≤ −d−1 0 (x − y) (x0 − y0 ). Set c2i = m−1 a2i (10 di0 ) so that 0 −2c2i d0i 1 ≤ −2c2i tr(1x0i − Y)0 D−1 i0 (1xi0 − Y0 ) −1 −1 = −2c2i trxi 10 Di0 (1x0i0 − Y0 ) + 2c2i trY0 Di0 (1x0i0 − Y0 ),(24)
with D−1 i0 the inverse diagonal matrix of di0 ’s. When di0 = 0, we set the corresponding majorizing term equal to zero. In function of the xi we obtain −
a2i 0 0 d 11 di ≤ k2x − 2trA2x xi , m i
(25)
0 0 0 −1 with k2x = c2i [10 di0 + 2trY0 D−1 i0 (1xi0 − Y0 )] and A2x = c2i 1 Di0 (1xi0 − Y0 ). In function of Y
−
X a2 i
i
with k2y = Y0 ).
A.2
P
m
d0i 110 di ≤ k2y + 2trA2y Y,
0 0 0 −1 i c2i [1 di0 −2trxi 1 Di0 (1xi0 −Y0 )]
and A2y =
(26) P
−1 0 i c2i Di0 (1xi0 −
Majorizing −2ai d0i Jγ i
First we replace J by it’s full expression, −2ai d0i Jγ i = −2ai d0i γ i +
2ai 0 0 d 11 γ i . m i
(27)
Both terms in the right hand side can be majorized. For the first term, which is again minus a linear function of the distances provided that the γij are nonnegative, we get by the Cauchy Schwarz inequality 0 −2ai d0i γ i ≤ −2ai tr(1x0i − Y)0 D−1 i0 Γi (1xi0 − Y0 ) −1 = −2ai trxi 10 Di0 Γi (1x0i0 − Y0 ) + −1 2ai trY0 Di0 Γi (1x0i0 − Y0 ),
(28)
where Γi is a diagonal matrix of γ i and all γ ij ≥ 0. In function of the xi this is (29) −2ai d0i γ i ≤ k3x − 2trA3x xi , 11
0 0 0 −1 with k4x = 2gi ai trY0 D−1 i0 Γi (1xi0 − Y0 ) and A4x = gi ai 1 Di0 Γi (1xi0 − Y0 ) while in function of Y X −2 ai d0i γ i ≤ k3y + 2trA3y Y. (30) i
P P −1 0 0 Here, k3y = −2 i ai trxi 10 D−1 i0 Γi (1xi0 − Y0 ) and A3y = i ai Di0 Γi (1xi0 − Y0 ). The second term, which is a positive sum of distances, can be majorized by the inequality cs ≤ c(2so )−1 s2 + c2−1 |s0 | with c a positive constant. Take c3i = m−1 ai (10 γ i ) 0 c3i d0i 1 ≤ c3i tr(1x0i − Y)0 D−1 i0 (1xi − Y) + c3i di0 −1 −1 −1 . = c3i trxi 10 Di0 1x0i + c3i trY0 D0 Y − 2c3i trxi 10 Di0 Y + c3i di0(31)
We rewrite the majorizing function (31) in function of the xi , c3i d0i 1 ≤ k4x + trB2x xi C2x x0i − 2trA4x xi ,
(32)
0 −1 with k4x = c3i (trY0 D−1 i0 Y + di0 ), B2x = c3i , C2x = 1 Di0 1, and A4x = c3i 10 D−1 i0 Y. In function of Y, X c3i d0i 1 ≤ k4y + trB2y YC2y Y0 − 2trA4y Y, (33) i
P P −1 0 0 −1 with k4y i c3i Di0 , C2y = I, and i c3i (trxi 1 D0 1xi + di0 ), B2y = P= 0 −1 A4y = i c3i xi 1 Di0 .
12