Data Visualization by Multidimensional Scaling: A Deterministic Annealing Approach Hansjorg Klock & Joachim M. Buhmann Rheinische Friedrich{Wilhelms{Universitat Institut fur Informatik III, Romerstrae 164 D-53117 Bonn, Germany fjoerg,
[email protected] October 28, 1998
Abstract
Multidimensional scaling addresses the problem how proximity data can be faithfully visualized as points in a low-dimensional Euclidean space. The quality of a data embedding is measured by a stress function which compares proximity values with Euclidean distances of the respective points. The corresponding minimization problem is non-convex and sensitive to local minima. We present a novel deterministic annealing algorithm for the frequently used objective SSTRESS and for Sammon mapping, derived in the framework of maximum entropy estimation. Experimental results demonstrate the superiority of our optimization technique compared to conventional gradient descent methods. Keywords: Multidimensional scaling, visualization, proximity data, Sammon mapping, maximum entropy, deterministic annealing, optimization.
1 Introduction Visualizing experimental data arises as a fundamental pattern recognition problem for exploratory data analysis in empirical sciences. Quite often the objects under investigation are represented by proximity data, e.g. by pairwise dissimilarity values instead of feature vectors. Such data occur in psychology, linguistics, genetics and other experimental sciences. Multidimensional scaling (MDS) is known as a collection of visualization techniques for proximity data which yield a set of representative data points in a suitable embedding space. These points are selected in such a way that their mutual distances match the respective proximity values as faithfully as possible. In the more familiar case of data represented by feature vectors, MDS can be used as a visualization tool. It establishes a mapping of these points to an informative low-dimensional plane or manifold on the basis of pairwise Euclidean distances in the original feature space. Due to the relational nature of the data representation, the visualization poses a dicult optimization problem. Section 2 provides a brief introduction to the multidimensional scaling concept. For a more detailed treatment of the subject the reader is referred to the monographs of Borg and Groenen(1) and Cox and Cox(2). Kruskal has formulated the search for a set of representative data points as a continuous optimization problem(3). Deterministic algorithms, the most frequently used candidates to solve such a problem, often converge quickly but display a tendency to get trapped in local minima. Stochastic techniques like simulated annealing treat the embedding coordinates as random variables and circumvent local minima at the expense of computation time. The merits of both techniques, speed and the capability to avoid local minima, are combined by the deterministic annealing approach. This design principle for optimization algorithms is reviewed in section 3. Sections 4 and 5 present the new algorithms for Kruskal's stress minimization and for Sammon mapping. The applicability of the novel techniques to realistic problems is demonstrated in section 6.
2 Multidimensional Scaling Multidimensional scaling refers to a class of algorithms for exploratory data analysis which visualize proximity relations of objects by distances between points in a low-dimensional Euclidean space. Proximity values are represented in the following as dissimilarity values. The reader is referred to Hartigan(4) for a detailed discussion on proximity structures. Mathematically, the dissimilarity of object i to object j is de ned as a real number ij . Throughout this paper we 2
assume symmetric dissimilarity values, i.e., ij = ji . The MDS algorithm determines a spatial representation of the objects, i.e., each object i is represented by coordinates xi 2 IRM in a M -dimensional space. We will use X = fx1; : : :; xN g to denote the entire embedding con guration. The distance between two points xi and xj of X is usually measured by the Euclidean distance dij d(xi; xj ) = kxi ? xj k: Quite often, the raw dissimilarity data are not suitable for Euclidean embedding and an additional processing step is required. To model such data transformations we assume a monotonic non-linear transformation D(ij ) of dissimilarities into disparities. Ideally, after an iterative re nement of D(:) and X , the transformation D(:) should project the dissimilarities ij to disparities that closely match the distances dij of the embedded points, i.e., dij = D(ij ). As discussed by Klock et al.(5) a transformation of the dissimilarities ij can be necessary to compensate a dimensionality mismatch between dissimilarities in the (hypothetical) data-space and Euclidean distances in the embedding space.
2.1 Objective Functions
Let us assume that the transformed dissimilarities Dij = D(ij ) match suciently well metric distances in an embedding space. Under this condition it makes sense to formulate MDS as an optimization problem with the cost function
H(fxig) =
N X N X i=1 k=1
wik (dik ? Dik )2 :
(1)
The factors wik are introduced to weight the disparities individually. This is useful in order to gauge the scale of the stress function, i.e., to normalize the absolute values of the disparities Dij . Dependent on the data analysis task at hand, it might be appropriate to use a local, a global or an intermediate normalization 1 1 : (2) wik(l) = ; wik(g) = PN 1 2 ; wik(m) = PN 2 N (N ? 1)Dik Dik l;m=1 Dlm l;m=1 Dlm The dierent choices correspond to a minimization of relative, absolute or intermediate error(6). The weighting wik might also be used to discount disparities with a high degree of experimental uncertainty. For the sake of simplicity wii = 0 for all i is assumed in the sequel. A common choice for (1) is
HMDS (fxig) =
N N X X i=1 k=1
wik
kxi ? xk k2 ? Dik 2 ;
?
(3)
as adopted by the ALSCAL algorithm (Alternating Least Squares Scaling)(7). The squared Euclidean distances are used for computational simplicity. Note that one expects Dik = ik2 if the ik suciently correspond to metric distances, i.e., the squaring of dissimilarities is subsumed into the choice of the function D. Eq. (3) is known as SSTRESS in the literature(2; 7). A more natural choice seems
HS(fxig) =
N N X X i=1 k=1
wik (kxi ? xk k ? Dik )2
3
(4)
which is referred to as Sammon mapping(8). Sammon used the intermediate normalization from (2) to search for a non-linear dimension reduction scheme, i.e., the dissimilarities Dik are computed from a set of vectors fi 2 IRn : 1 i N g in the n-dimensional input space. From the view point of an embedding problem, i.e., nding an optimal X , there is no need to distinguish between MDS and dimension reduction via pairwise distances. Despite the fact that many dierent choices of the distance function are possible, e.g. based on other metrics, we will restrict the discussion in this paper to the minimization of (3) and (4). MDS methods which minimize an objective function of this types are commonly referred to as least squares scaling (LSS) and belong to the class of metric multidimensional scaling algorithms. The term metric characterizes the type of transformation D(:) used to preprocess the dissimilarities and does not refer to a property of the embedding space(2). Fig. 1 gives an idea of how MDS might be used in practice. Starting with the dissimilarity matrix (a) of 226 protein sequences from the globin family (dark grey levels correspond to small dissimilarities), embeddings are derived by minimizing (3) with global (b), intermediate (c) or local (d) weighting. The embeddings clearly reveal the cluster structure of the data with dierent accuracy in the representation of inter- and intra-cluster dissimilarities. Often it is not possible to construct an explicit functional form D(:) such that the mapped dissimilarities Dij of an empirical data set match suciently well metric distances. In such a situation the space of possible transformations D(:) has to be enlarged and should only be restricted by the monotonicity constraint ij < lk ) D(ij ) D(lk ): Order preserving but otherwise unconstraint transformations of the dissimilarities de ne the class of non-metric MDS algorithms invented by Shepard(9) and Kruskal(10). In Kruskal's approach not the transformation D(:) but the disparity matrix is modi ed. The objective function diers slightly from (4).
2.2 Alternatives to Gradient Descent Approaches
Other algorithms discussed in the literature do not rely on explicit gradient descents. One of these methods, aimed at minimizing a stress function of the Sammon type (4), is known by the acronym SMACOF (Scaling by MAjorizing A COmplicated Function)(11?13). It is based on an iterative majorization algorithm that introduces ideas from convex analysis. Instead of minimizing the stress function H(X ) directly, a majorizing function G (X ; Y ) is derived with
G (X ; Y ) H(X ); 8X ; Y
(5)
where denotes the space of all coordinates. Equality holds for Y = X . During the iterative update, Y is the con guration from the previous step. The iterative majorization gives rise to a non-increasing sequence of stress values with linear convergence to a local minimum of the stress function(12). The approach can be extended to account for arbitrary Minkowski distances(14). The algorithms discussed up to this point are local minimizers sharing the problem of frequently getting stuck in a local minimum of the complicated energy landscape. Only a few global minimization strategies have been developed for MDS, the most prominent algorithm perhaps being the tunneling method (15) . This deterministic scheme allows the algorithm to escape local minima by \tunneling" to new con gurations X with the same stress, possibly providing a starting point for further stress reduction. A second group of papers deals with the application of stochastic optimization techniques to MDS. Among these approaches there is an application of simulated annealing(16) , sharing with 4
our approach the concept of maximum entropy inference (see below). The hybrid MDS algorithm of Mathar and Zilinskas combines genetic algorithms with local optimization procedures(17) .
2.3 De cits of Multidimensional Scaling and Sammon Mapping
An often discussed de cit of the classical multidimensional scaling techniques such as Sammon mapping is their inherent batch-character(18; 19) . A run of the program will only yield an embedding of the corresponding data without direct generalization capabilities. To project new data, the program has to be restarted on the pooled data, because a projection of additional data will modify the embedding of the old data as well. Another, perhaps more urgent de cit is the amount of proximity values that characterize large data sets. For non-linear dimension reduction, the standard technique clusters the data beforehand and visualizes the resulting cluster prototypes. This coarse-graining of a large data set by clustering, already proposed by Sammon in his original paper(8), is unsatisfactory and often unacceptable. The need to overcome this drawback has recently initiated a number of developments(18?21). These approaches share the common idea to use the Sammon stress function as relative supervisor to train a nonlinear mapping. Such mappings can be implemented as a radial basis function or a backpropagation network. If y = f (x; W) is the output vector, mapped by a function f which depends on a set of weights W, the stress becomes
H(fxig) =
N X N X i=1 k=1
wik (kf (xi; W) ? f (xk ; W)k ? Dik )2 :
(6)
Dierentiating with respect to W yields a set of equations that can be used to adjust the weights W. Besides a batch approach, updating can be performed online by presenting pairs of patterns. Although the ideas discussed in this paper apply to these approaches as well, we will subsequently restrict our attention to the derivation of a novel framework for optimization of the classical stress functions.
3 The Maximum Entropy Approach to Optimization 3.1 Stochastic Optimization and Deterministic Annealing
Since the introduction of simulated annealing in a seminal paper of Kirkpatrick et al.(22), stochastic maximum entropy approaches to optimization problems have found widespread acceptance in the pattern recognition and computer vision community as alternative to gradient descent or other deterministic techniques. In application to MDS, the search for optimal solutions is a simulated random walk through the space IRM of possible embeddings X 2 . If implemented by a Markov Chain Monte Carlo method such as the Metropolis algorithm, this process converges to an equilibrium probability distribution known as Gibbs distribution with density Z 1 1 G P (X ) = exp ? (H(X ) ? F ) ; F = ?T log exp ? H(X ) dX : (7) T
5
T
If we denote by P the space of probability densities over , then the Gibbs density P G minimizes an objective function over P called the generalized free energy
FP = hHiP ? T S (P )
Z
P (X )H(X ) dX + T
Z
P (X ) log P (X ) dX :
(8)
hHiP and S denote the expected energy and the entropy of the system with state space and
probability density P . The computational temperature T serves as a Lagrange multiplier to control the expected energy hHi. Obviously, entropy maximization with xed expected costs minimizes FP (23). Simulated and Deterministic Annealing: Eq. (8) motivates the idea to slowly reduce the temperature during an optimization process. Analogous to an experimental annealing of solids, solutions for an optimization problem are heated and cooled in simulations. To prevent the system from falling into poor local minima, one starts at a high temperature T where the free energy landscape (8) is dominated by the entropy term and appears to be smoothed out. A decrease of the temperature then gradually reveals the structure of the original cost function de ned by H. In simulated annealing, the interesting expectation values of the system parameters, e.g., the expected embedding coordinates in MDS, are estimated by sampling the Gibbs distribution P G using a Monte Carlo method. For a logarithmically slow decrease of the temperature convergence to a global optimum has been proven(24), but in practice simulated annealing is well-known for being slow compared to deterministic approaches. It is the aim of deterministic annealing to cure this drawback by exactly or approximately calculating the relevant expectation values w.r.t. the Gibbs distribution. Since the convincing success of an early application to data clustering(25; 26), deterministic annealing has been applied to other combinatorial or continuous optimization problems such as pairwise data clustering(27) , graph matching(28) and multidimensional scaling (5; 29). The interested reader is referred to(27) for a more detailed discussion.
3.2 Mean Field Approximation
The computation of the free energy F (7) and, consequently, of all other interesting expectation R values is computationally intractable due to the high-dimensional integrals f (X )dX . We, therefore, reside to an approximation technique called mean eld approximation. The Gibbs distribution P G (X ) is approximated by a factorized distribution with density P 0 (Xj) =
N Y i=1
qi (xi ji ) :
(9)
Each of the factors qi (xiji ), parameterized by a vector of mean eld parameters fi j1 i N g, serves as a model of the marginal distribution of the coordinates xi of site i. This approximation neglects correlations between optimization variables and only takes their averaged eect into account. To determine the optimal parameters i , we have to minimize the Kullback{Leibler (KL) divergence between the factorial density P 0 (X ) and the Gibbs density P G (X ),
D
?
P 0 (Xj)kP G (X )
P 0 (Xj) P 0 (Xj) log dX : P G (X )
Z
6
(10)
EM-Algorithm: The introduction of the mean eld parameters i suggests an alternating
algorithm to estimate the expectation values of the embedding coordinates. Iteratively, the parameters i are optimized given a vector of statistics i that contains all relevant information about the other sites (M-like step). This step is followed by a recomputation of the statistics k ; k 6= i on the basis of the new parameters i (E-like step). The resulting alternation algorithm can be viewed as a generalized expectation-maximization algorithm(30).
4 Derivation of the Mean Field Approximations
Utilizing the symmetry Dik = Dki and neglecting constant terms an expansion of HMDS yields the expected costs
hHMDSi =
N X i;k=1
wik 2hkxik4 i ? 8hkxik2xi iT hxk i + 2hkxik2ihkxk k2 i
? +4 T r hxixTi ihxk xTk i ? 4 Dik hkxik2 i ? hxiiT hxk i :
(11)
T r [A] denotes the trace of matrix A. Expectation values in (11) are taken w.r.t. the factorized distribution P 0 (9), i.e.,
hgi =
N Z1 Y i=1?1
dxi g (xi) qi (xiji ) :
(12)
4.1 A Statistics for any Mean Field Approximation
Before discussing computationally tractable model densities qi (xi ji ), the statistics = (1; : : : ; N ) have to be calculated for an unconstrained mean eld approximation. Using (11) we determine the Kullback{Leibler divergence between P 0 and the Gibbs density P G ? D P 0kP G =
N X i=1
hlog qi(xiji)i + T1 hHMDSi ? F :
(13)
The correct free energy F of the system does not depend on the mean eld parameters and can be neglected in the minimization problem. Variation with respect to the parameters ip of P 0 leads to a system of transcendental equations
hx4i i + h^ T @ hkxik2xii + @ T r H hx xT i P 0kP G ) = 0i @@ 0 = T @ D(@ i i i i @ip @ip ip ip hxii + T @ hlog q i ; 1 i N : + hTi @ @ i @ip ip T
7
(14)
All terms independent of the parameters ip are collected in a vector of statistics i = (0i ; hi; Hi ; h^ i) with
h^ i = ?8 hi = Hi =
N X
N
X 0i = 2 wik ; k=1
wik hxk i; k=1 N X ? 8 wik Dik hxk i ? hkxk k2xk i ; k=1 N X wik 8hxk xTk i + 4( hkxk k2i ? Dik )I : k=1
(15) (16) (17)
I denotes the unit matrix. The reader should note that the derivation up to this point does
not depend on the choice of the model density (9). i is a statistics to compute any mean eld approximation to the Gibbs density P G with cost function (3). We propose the following algorithm to compute the statistics = (1; : : : ; N ) and the parameter estimates = (1 ; : : : N ) in an iterative fashion (shown in the box below): The algorithm decreases the temperature parameter exponentially while an estimate of the statistics (E-like step) is alternated with an optimization of the parameters (M-like step). This can be carried out in parallel (with potential convergence problems caused by oscillations) or sequentially with an immediate update of the statistics . The sequential variant of the generalized EM-algorithm with a random site visitation schedule and immediate update is known to exhibit satisfactory convergence properties(31) . It converges to a local minimum of the KL divergence since the parameters i are uniquely determined by the statistics i which do not explicitly depend on i . Fig. 3a-d displays the expected coordinates for the Iris data set at four dierent temperatures. At T 0 the algorithm reaches with HS = 0:00413 the best local minimum obtained in all experiments . Of course, if the dissimilarities are Euclidean distances of points in IR2 , the algorithm nds a perfect reconstruction at T = 0.
4.2 Model Densities
The derivation of the MDS algorithm is completed by inserting the derivatives of the expectation values hxi i, hxi xTi i, hxikxi k2i, hkxik4i and the entropy S = ?hlog qi (xi ji )i into the stationary equation (14). Depending on the chosen model qi (xiji ) these values can be computed analytically or they have to be estimated by Monte Carlo integration. The rest of this section will be devoted to a detailed discussion of some variants in the choice of qi (xiji ). Exact Model: The Ansatz i = i for the factorizing density 1 exp(? 1 f (x )) ; q 0(x j ) = (18) i
i i
Zi0 =
Zi0 Z1
T i i
1 dxi exp ? fi (xi) ; T
(19)
?1 fi (xi ) = 0i kxik4 + kxi k2xTi h^ i + T r xi xTi H + xTi hi (20) Presumably the global minimum, but there is no proof. Also note that in anticipation of Sec. 5 we used the Sammon stress function here.
8
can be used in principle, since the factorial density is directly parameterized by the statistics i . From (18) the mean eld approximation F0 of the free energy F is given by
F0 = ?T
N X i=1
log Zi0 = ?T
N X i=1
log
Z1
?1
1
dxi exp(? fi (xi)): T
(21)
The Ansatz (18) exactly estimates the marginals of the Gibbs density (7) with the stress function HMDS and, therefore, is called the exact model in this paper. The moments of xi are dependent on the mean eld parameters i = i. The former are related to the free energy F0 by the so-called self-consistency equations, i.e., the derivatives of F0 with respect to the elements hi , Hi, h^ i and 0i of the eld vector @ F0 @ F0 T i ; @ F0 = hkx k2x i ; @ F0 = hkx k4i : = h x i ; = h x x i i i i i i @ hi @ Hi @0i @ h^ i
(22)
Unfortunately the integral (21) cannot be evaluated analytically. A Taylor-series expansion of the argument fi (xi) of the exponential at the minima xip with r fi jxip = 0 yields satisfactory results for low temperatures. At intermediate temperatures, however, the Gibbs distribution can be strongly skewed. The skew might introduce severe approximation errors if the number of modes is not estimated correctly, as indicated by numerical instabilities found in our simulations. Dirac Model: To derive tractable approximations for the statistics we consider the Dirac delta distribution qi (xi j i ) = (xi ? i ) ;
(23)
centered at the location i . This model can be considered as the zero temperature limit T ?! 0 of the density (18) with moments
hxii = i ; hxixTi i = i Ti ; hkxik2xii = kik2i ; hkxik4i = kik4 :
(24)
Inserting the derivatives with respect to i into the stationary equations (14) yields the gradient of an M -dimensional potential (25) T Ii (qi ) = 0i k i k4 + h^ Ti i k i k2 + T r i Ti Hi + hTi i :
Ii quanti es the partial costs of assigning site i the model qi given the statistics i. It is a fourth
degree vector polynomial that can be minimized by gradient descent methods, e.g. conjugate gradient(32) or the technique described in the appendix to explicitly compute all minima. Gaussian Models: The Dirac model for qi (xiji) is independent of the temperature and, consequently, does not exploit the smoothing eects of deterministic annealing at nite T . A re ned model based on a multivariate Gaussian with expectation value i and covariance i correctly captures nite T eects and, thereby, preserves the bene ts of deterministic annealing, 1 1 qi (xij i ; i) = exp(? T r ?i 1 (xi ? i )(xi ? i )T ) Zi 2 (26) M2 1 2 with Zi = ji j (2 ) : 9
Here jij denotes the determinant. In practice, however, the full multivariate Gaussian model can be restricted to a radial basis function model with a diagonal covariance matrix i = i2I. The moments of this isotropic model qi (xi j i ; i) are given by hxii = i ; (27) T 2 T hxixi i = i I + ii ; (28) 2 2 2 hkxik xii = Ki i + kik i ; (29) 4 4 2 2 2 2 2 hxi i = 2Mi + 4kik i + (Mi + ki k ) ; (30) (31) ?hlog qi i = M2 (1 + log i2 + log 2) with K = M + 2. Inserting these moments into the stationary equations (14) yields @D = @i
kik2(40i i + h^ i) + 2i Ti h^ i + 2Hi + 4Ki20i I i + hi + Ki2h^ i ; @D MT T = 4 0i KMi3 + (4K0i k i k2 + 2KTi h^ i + 2T r [Hi ])i ? : @i i
T
(32) (33)
As for the Dirac model the stationary equations (32, 33) can be identi ed with the gradient of the partial costs (34) T Ii (qi ) = 0i hkxik4 i + h^ Ti hxi kxik2 i + T r hxixTi iHi + hTi hxi i ? MT log i w.r.t. the mean eld parameters i and i . Note that for a xed value i2 (32) de nes the gradient of a quartic vector potential in i as in the Dirac case. On the other hand, given a xed value of i , (33) amounts to a quadratic equation in i2 with the unique solution r
2 ? 2 + p4 + qT where (35) (4K0i k i k2 + 2KTi h^ i + 2T r [Hi ]) and q = 1 p = (36) 40i KM 40iK q since 2 > 0, q > 0 and therefore ? p2 < p42 + qT for all p. Eq. (35) makes immediately clear i2 =
p
how the temperature T acts as a \fuzzi er" of the model distribution by introducing a nite variance i2 for all temperatures T > 0. In the performed MDS experiments, the system (32,33) of equations has been solved in an iterative fashion, alternating the computation i given i2 and i2 given i .
5 Approximations for Sammon Mapping In contrast to the SSTRESS cost function (3), where an optimization algorithm based on local x-point iterations exists, the usual way to minimize the costs of Sammon mapping is by gradient descent. These approaches are known to be computationally expensive and prone to local minima. Also convergence critically depends on the chosen step-size(33; 34). A method to compute an optimal step size with proven convergence can be found in the literature(35), but at the expense of large computational cost. 10
As outlined in the previous section, the interactions in the cost function HMDS can be completely eliminated introducing a vector of statistics i . These statistics capture all relevant information to nd an optimal embedding of one selected point keeping the other sites xed. This strategy is not directly applicable in the case of Sammon mapping for the following reason: Expanding the stress function (4)
HS =
X
ik
2) wik (kxi ? xk k2 ? 2Dik kxi ? xk k + Dik
(37)
and dierentiating with respect to xi N X Dik @ HS wik (xi ? xk )(1 ? = 4 ) @ xi k x ? x k i k k=1
(38)
N T X ( x ? x )( x ? x ) Dik @ 2HS i k i k = 4 wik I (1 ? kx ? x k ) + Dik : kxi ? xk k3 @ xi xTi i k k=1
(39)
ik = kxi ? xk k;
(40)
reveals that the Euclidean distance kxi ? xk k introduces a strong coupling of the sites by its occurrence in the denominator of the stationary equations. Furthermore, @ HS =@ xi is plagued by discontinuities at xi = xk . The Hessian matrix (w.r.t. to a single point) is given by
Interestingly for the one-dimensional case it is constant except for the points xi = xk , where the gradient is not de ned. To derive simple stationary equations for the moments as in case of HMDS , the denominator of the fraction in (38) cannot be approximated as a constant computed e.g. from the expectations of the previous step. The reason is as follows: The corresponding cost function
H^S = 2
X
i;k
D wik (1 ? ik )kxi ? xk k2
ik
(41)
de nes a paraboloid with respect to the coordinates of a selected point xi with a Hessian matrix given by N X @ 2H^ S Dik = 2I wik (1 ? ): T @ xixi ik k=1
(42)
The (constant) Hessian is not strictly positive de nite, e.g. as soon as Dik > ik holds for enough sites k, the right hand side of (42) might become negative de nite at the (one) extremum of H^ S , i.e., the paraboloid ips its orientation. The stationary equations describe a local maximum in this situation and a naive xed-point iteration based on (41) would perform an undesired gradient ascent.
11
5.1 Algebraic Transformation
This section describes a new approach to Sammon mapping based on a x-point preserving algebraic transformation of objective functions(36) . Originally, this transformation has been used to linearize a quadratic term in the cost function(36; 37). In the context of Sammon mapping this approach preserves the quartic nature of the cost function while discarding the inconvenient square root term in (37). The key idea is to express a convex term F (x) in a cost function by F (x) = max [yx ? F ? (y )] : y
(43)
F ? denotes the conjugate of F , derived by the Legendre-Fenchel transformation(38; 39) F ? (y ) = max [yx ? F (x)] x
(44)
of F (x). The conjugate F ? (y ) is also a convex function in the dual variable y . Geometrically, (43) de nes a representation of F by its tangent space. Applying this transformation to the cost function (37), we eliminate the convex second term of each summand by the transformation
p
Xik
?2 Xik ?! max ? ? ik ; Xik := kxi ? xk k2; 1 i; k N ; ik ik
(45)
introducing an auxiliary variable ik . Additional straightforward algebra yields the expression Xik ? ik ) = arg max (? ik
ik
p
Xik :
(46)
A second transformation is applied to the cost function (37) in order to enforce the existence of at least one local minimum. For this purpose, the rst term of (37) has to be rewritten as 2
2Xik ?! X~ ik + ~ ik : ik Optimal values of the auxiliary parameters ~ ik satisfy the condition 2
arg min ( X~ ik + ~ ik ) = Xik : ~ ik ik
(47)
(48)
The reader should note that ~ ik have to assume a minimum due to the concavity of the rst term in Xik2 . In summary, the transformations (45,47) turn the minimization of HS into the computation of a saddle point, i.e., a local maximum w.r.t. the auxiliary parameters fik g and a minimum w.r.t. the parameters f~ ik g as well as the coordinates X :
H~S =
N X i;k=1
wik
2
kxi ? xk k4 + ~ ? 2D ( kxi ? xk k2 + ) + 2D2 : ik ik ik ik ~ ik
ik
12
(49)
Inserting (46, 48) into (49) shows that the minima of the original and the saddle-points of the transformed objective function can be identi ed. The gradient of the transformed cost function H~S N X kx ? x k2 D @ H~ S = 4 (50) wik ( i ~ k ? ik ) (xi ? xk ) @ xi ik ik k=1 equals the gradient (38) of HS at = opt. But (49) has distinct computational advantages, i.e.,
in contrast to (41) which might only de ne one local maximum, (49) guarantees the existence of at least one local minimum. Consequently, the set of stationary equations can be solved to yield a currently optimal xi keeping the other points and the auxiliary parameters xed, and an iterative optimization scheme analogous to the case of HMDS can be de ned. We will denote by = fik ; ~ ik g the complete set of auxiliary variables and by opt their current optimal value as de ned by kxi ? xk k. For xed , the transformed stress (49) of Sammon mapping reveals an appealing similarity with HMDS . Neglecting the constant terms, the only dierence turns out to be the additional weighting by 1=ik and 1=~ ik , respectively. Note that in the zero-temperature limit (no stochastic variables) 2ik and ~ ik can be identi ed, as it is done for the rest of this subsection. For the derivation of the deterministic annealing algorithm, we have to distinguish strictly between both sets of variables. To understand the eects of the transformation, we analyze the cost of embedding in IR two points with mutual dissimilarity D01 = 1. The rst point is kept xed at x0 = 0. The graph in Fig.4 displays the costs of embedding the second point at locations x1. The bold line depicts the Sammon costs. Note the characteristic peak at x = 0 due to the discontinuity of the derivative of kx1 ? x0 k at x1 = x0 . The discontinuity is smoothed out by the approximation, shown for dierent values of 01 (thin lines). We note the following important properties that hold for the M -dimensional case as well: The approximation is exact for the optimal value of 01, and the approximated cost function smoothly approaches the Sammon costs at kx1 ? x0 k = 01. For large values of kx1 ? x0k, the approximation yields an upper bound on the local Sammon costs for all values of the parameter 01. This upper bound does not hold for small values ik < D201 if kx1 ? x0 k < minf D201 ; 01g, but the resulting error is always bounded. We suspect that the discontinuities of the derivative are related to the computational problems of Sammon mapping. Consider the gradient (38) of HS with respect to a single point xi . Each term of the sum will introduce a hypersphere around xk de ned by the denominator kxi ? xk k. Passing the surface of one of these hyperspheres, one encounters a discontinuity of the gradient. If the absolute value of the gradient is not too large, the discontinuity can reverse the direction of the gradient, implying the existence of another local minimum. Consequently the number of local minima is related with the number N of points to be embedded. This contrasts with the xed number of 2M +1 extrema encountered when embedding a single point with the SSTRESS function (see Appendix), where M is the embedding dimension.
13
5.2 Deterministic Annealing for Sammon Mapping
To develop a deterministic annealing algorithm, the embedding coordinates are considered to be random variables. The similarity of HMDS and H~ S strongly motivates a derivation analogous to section 4. It has to be emphasized at this point that the free energy of the transformed stress function F~ S = hH~Si ? T S (51) does not provide an upper bound for the true free energy F S de ned by HS . But since the saddle{points of F~ S and F S coincide in the zero{temperature limit, the minimization of an upper bound on F~ S will still solve our optimization problem. The auxiliary variables are now determined by a variation of the Kullback-Leibler divergence (13), entering the latter via the transformed expected costs
hH~S i =
N X
hkxi ? xk k4i + ~ ? 2D hkxi ? xk k2i + : ik ik ik ik ~ ik i;k=1 2 wik
(52)
At nite temperatures, 2ik and ~ ik will assume dierent values:
p p ~ ik = hkxi ? xk k4i ; ik = hkxi ? xk k2 i : Introducing eective weights w~ik and dissimilarities D~ ik de ned by
(53)
~ w (54) w~ik = ~ik ; D~ ik = Dik ik ; ik 2ik we can immediately identify H~ S with HMDS up to constant terms as far as the minimization
w.r.t. X is concerned. Applying the techniques treated in the previous paragraphs, we iterate the EM-loop of HMDS and the adaptation of the auxiliary parameters . This suggest the following deterministic annealing algorithm (shown in the algorithm box below) to compute an embedding based on the Sammon mapping cost function. As for HMDS , the algorithm decreases the computational temperature T exponentially while iterating in an asynchronous update scheme the estimation of the statistics (E-like step) and the optimization of the mean eld parameters (M-like step). Again the iteration proceeds until the gain in KL divergence falls below a threshold ". The update of the conjugate variables and ~ can be performed before re-entering the EM-like loop after convergence. In our practical implementation the update is performed in conjunction with the exponential decrease of T , but also with an immediate update directly before the E-like step. We did not experience any instabilities.
6 Simulation Results In the following, we discuss a number of data sets and show typical embeddings derived by Sammon mapping and SSTRESS-based metric MDS. For three of the data sets, one with very
Constant terms have been neglected.
14
low stress
(Iris),
one with intermediate stress (Virus) and another with very high stress (Random) we performed a large number of runs (1000 each) with random starting conditions in order to derive reliable statistics. The experiments consistently demonstrate the success of the deterministic annealing approach.
6.1 Iris Data
A well-known data set widely used in the pattern recognition literature is the Iris data(40). It contains three classes of 50 instances each, where each class refers to a type of iris plants. The data is four-dimensional, and consists of measurements of sepal and petal length. One class is linearly separable from the other two, but the latter are not. Two feature vectors are identical and were removed before we computed the dissimilarity matrix. We applied Sammon mapping to this data in order to derive a two-dimensional representation. The resulting embedding is shown in Fig. 3d. We compared the results of the deterministic annealing algorithm with the classical gradient descent approach. For the latter we used the program sammon supplied with the widely used LVQ-PAK(41) software package. We ran both algorithms 1000 times. Each run of the sammon program lasted 1000 iterations. There was hardly any improvement by increasing that number. Each run of the deterministic annealing algorithm performed 80 cycles through the coordinate set. Fig. 6 depicts the histogram of nal stress values. While the gradient descent optimization produces solutions with broad distribution in quality, deterministic annealing reached the best two bins in nearly 90% of the runs. Further improvement is expected for a re ned (although more time-consuming) annealing schedule.
Algorithm Mean 10?3 Std-Dev 10?3 Max 10?3 Min 10?3 Gradient 5.281 0.841 9.999 4.129 Zero 5.084 0.777 8.904 4.130 Annealing 4.255 0.289 5.208 4.129
Table 1: Statistics of an experiment of 1000 runs of Sammon mapping on the Iris data set with (1) gradient descent (LVQ-PAK), (2) zero-temperature and (3) deterministic annealing algorithm.
6.2 Virus Data
A second experiment was performed on a data set described in(34) . The data consists of 60 vectors with 18 entries describing the biochemical features of a virus under investigation. The Tomabovirus subset exhibits a large number of poor local minima when analyzed with the original Sammon mapping algorithm(34). A comparison of our (zero temperature) results with the solutions produced by the program sammon are summarized as follows: The zero temperature version of the algorithm avoided very poor local minima but produced a broad distribution. The results for sammon were marginally worse. Deterministic annealing, however, found the best solution in almost all cases (see Tab. 2). These experiments support the view that deterministic annealing eliminates the need for a good starting solution. Fig. 7 shows the histograms of the corresponding runs.
15
Algorithm Mean Std-Dev Min Max Gradient 0.04558 0.003266 0.04156 0.05827 Zero 0.04407 0.002611 0.04156 0.05184 Annealing 0.04157 0 0.04157 0.04157
Table 2: Statistics of an experiment of 1000 runs of Sammon mapping on the Virus data set with (1) gradient descent (LVQ-PAK), (2) zero-temperature and (3) deterministic annealing algorithm.
6.3 Embedding of Random Dissimilarities
Random dissimilarities pose a particularly dicult problem for embedding since a lot of Euclidean constraints are violated. We have performed this experiment in order to demonstrate the power of the deterministic annealing technique in situations where the energy landscape becomes very rugged. Dissimilarities in the data set have been randomly drawn from a bimodal Gaussian mixture with 1 = 1:0 and 2 = 2:0, both mixture components with standard deviation = 0:1. It turns out that the probability to reach the global optimum by a random starting solution shrinks signi cantly compared to the Virus data set. Histograms of deterministic annealing solutions and zero temperature solutions are shown in Fig. 8. 95 percent of the deterministic annealing solutions can be found in the top 10 percent range of the gradient descent solutions. This experiment was performed with the SSTRESS objective function (3). Algorithm Mean Std-Dev Min Max Zero 0.4652 0.00204 0.4714 0.4584 Annealing 0.4609 0.00094 0.4653 0.4577
Table 3: Statistics of an experiment of 1000 runs of the SSTRESS-based multidimensional scaling on the bi-modal random data set. (1) zero-temperature and (2) deterministic annealing algorithm.
6.4 Other experiments
Protein Data: Another real-world data set which we used for testing consists of 226 protein
sequences. The dissimilarity values between pairs of sequences have been determined by a sequence alignment program based on biochemical and structural information. The sequences belong to dierent globin families abbreviated by the displayed capital letters. The nal stress of about SSTRESS=10% is considerably higher than for the Iris and Virus data set. Fig. 1 displays both a grey level visualization of the dissimilarity matrix (dark values denote high similarity) which have been sorted according to a clustering solution and the discovered embedding which is in good agreement with the similarity values of the data. Note the signi cant dierences between the three embeddings. Results are consistent with the biochemical classi cation. Embedding of a Face Recognition Database: Another visualization experiment was motivated by the development of a face recognition system. A database of 492 persons was used to
FERET database, P.J.Phillips, U.S. Army Research Lab
16
obtain a dissimilarity matrix by a variant of a facial graph matching algorithm based on Gabortransform features(42). The images were restricted to the central region of the face and did not include signi cant background information. Additional preprocessing of the dissimilarity values was required to remove artifacts resulting from a dimensional mismatch(5) in the dissimilarity distribution. There is no prominent cluster structure, and the stress was comparatively high (around 10%), which is indicative for a too low dimension M of the embedding space. Despite these shortcomings, regions with images containing distinctive features such as a pair of glasses, smiling or an opened mouth showing the teeth can be well separated even in this low-dimensional representation as can be seen in Fig. 11. These distinctive regions are also supported by the results of a pairwise clustering of the data(27). The experiment is intended to demonstrate how MDS can serve as a tool for exploratory data analysis in data mining applications. It allows the user to get an impression what properties are selected by his (dis-)similarity measure when analyzing relational data. Together with data clustering tools this procedure might reveal additional facets of the data set under investigation.
6.5 A Note on Performance
CPU Time: Annealing techniques have the reputation of being notoriously slow. The algorithms described in this paper support a dierent picture: Tab. 4 presents the average CPU time needed to compute embeddings for three of the data sets discussed above. For, e.g. the Iris data set and depending on the convergence parameter " as well as the annealing schedule ( = 0:8 within an exponential schedule), the total execution time of 203 seconds CPU time on a 300 MHz Linux Pentium-II indeed comparable to the CPU time of sammon (149 seconds/5000 iterations). A rich potential for optimization resides in the expensive update of the site statisAlgorithm Gradient sammon Zero Sammon Annealing Sammon
[ ] 149 200 203
Iris sec
[ ] 340 353
Globin sec
[ ] 9.7 21 24
Virus sec
Table 4: Average CPU time elapsed on a standard 300 MHz Linux Pentium-II while computing three of the described embeddings with (1) gradient descent (LVQ-PAK) on Sammon mapping (5000 iterations), (2) zero-temperature as well as (3)deterministic annealing on Sammon mapping. tics i (E-like step) which is of order O(N 2). Omitting the update of those sites which do not change during the M-like step can help to reduce execution time. A systematic evaluation of such potentials by selection strategies is part of current research. Speed-Accuracy Tradeo: Despite the potential for further optimization of the CPU time requirements it is worth to consider an application of our current DA implementation even if computation time is limited. Fig. 9 shows the average Sammon STRESS obtained with gradient descent, zero temperature and annealing on the Iris data set. Both zero temperature as well as annealing do not produce acceptable results faster than in 20 seconds. But as soon as this time is available, results are better than those of Sammon on average for both zero temperature and annealed optimization. Note that also the variance has been signi cantly reduced, i.e., solutions with higher than average stress are less likely to occur (cf. Tab. 5). 17
Parameter Sensitivity: To evaluate the robustness properties of the annealing algorithm, we
performed a number of experiments with suboptimal annealing and convergence parameters in order to enforce fast convergence. Tab. 5 lists the average stress obtained in 100 runs with the respective values of the start temperature T0, the annealing parameter , the convergence threshold ". The starting temperature of course has a certain eect on the quality of the nal solution. Apparently the annealing parameter has a strong eect on the CPU time as well as the quality of the the nal result. In addition, the convergence parameter " exhibits a considerable in uence on the solution quality. Interestingly, the eect of " on the CPU time is not as prominent as that of . We suspect that waiting for true convergence before cooling is particularly important at higher temperatures when structural parameters like the orientation of complete clusters are determined (see next section). Furthermore, true convergence at higher temperatures seems to lead to faster convergence at lower temperatures afterwards.
"
0.2 0.4 0.6 0.8
"
0.2 0.4 0.6 0.8
"
0.2 0.4 0.6 0.8
T0 =10 1.0 0.1 0.01 0.001 Stress CPU Stress CPU Stress CPU Stress CPU 12.7 [1.80] 20 8.63 [2.67] 23 5.56 [1.23] 27 4.58 [0.52] 36 6.30 [1.21] 35 5.05 [0.68] 37 4.56 [0.27] 41 4.32 [0.23] 58 4.65 [0.65] 59 4.38 [0.15] 61 4.27 [0.16] 67 4.18 [0.01] 96 4.17 [0.01] 126 4.16 [0.01] 128 4.17 [0.01] 140 4.15 [0.01] 203 T0 =0.1 1.0 0.1 0.01 0.001 Stress CPU Stress CPU Stress CPU Stress CPU 21.3 [10.1] 14 7.91 [1.75] 19 5.12 [0.67] 22 4.91 [0.52] 29 8.14 [1.21] 24 6.46 [1.30] 29 4.98 [0.93] 32 4.66 [0.51] 42 5.56 [0.87] 40 5.76 [1.17] 45 5.19 [1.09] 48 4.65 [0.63] 59 4.71 [0.55] 86 5.40 [0.63] 91 5.05 [0.61] 91 4.55 [0.50] 110 T0 =0.001 1.0 0.1 0.01 0.001 Stress CPU Stress CPU Stress CPU Stress CPU 137 [62.5] 8 79.1 [63.0] 10 33.2 [27.4] 16 27.2 [21.7] 19 25.4 [20.2] 14 14.4 [13.6] 17 6.98 [1.65] 24 7.19 [1.62] 31 9.67 [1.91] 22 7.80 [1.40] 25 5.97 [1.35] 33 5.57 [1.27] 39 5.94 [1.05] 44 5.52 [0.98] 47 5.08 [0.76] 53 5.01 [0.72] 62
Table 5: Average stress and standard deviation after 100 runs of the deterministic annealing algorithm on the Iris data set subject to a variation of the annealing parameters T0, and ". For each combination of the parameters, the respective column contains the average nal stress(in 10?3 ), the corresponding standard deviation (in 10?3) and the average CPU time (in seconds) needed to compute an embedding.
18
6.6 Structural Dierences of Local Minima Con gurations
Are there signi cant structural dierences between the embedding with the smallest stress and those con gurations which are supposed to be a good local minimum in terms of the stress? To answer this question, we performed Procrustes analysis(2) on pairs of embeddings (translation, rotation and scaling) to optimally match the con gurations in a linear sense. As a typical example consider Fig. 10. For the Iris data set it displays the dierence between the best solution found in all experiments with stress H = 4:13 10?3 (presumably the optimal solution) and a local minimum con guration with stress H = 5:05 10?3. Corresponding points are connected by lines. We nd a complete re ection of the rst cluster (the separate one) between the two solutions. The two other clusters do not dier signi cantly. Nevertheless, three points of the intermediate cluster are embedded in a completely dierent neighborhood in the suboptimal con guration. The data analyst who exclusively relies on this visualization might therefore draw unwarranted conclusion in this situation. The large distance between the dierent embeddings of the three points in the intermediate cluster is supposed to be a consequence of the re ection of the rst cluster. In order to nd a better minimum, we would then have to break of the re ection, i.e., rearrange the rst cluster completely. Clearly this will not happen if the con guration is already in a stable state: The embedding process gets stuck in this local minimum. Deterministic annealing helps to avoid such situations, since the re ectional symmetry of the clusters is broken in the beginning of the experiment at high temperatures, when the points inside each cluster are still at an entropydominated position. If, e.g. the second and third cluster have just determined their symmetry, the symmetry of the rst cluster can be adjusted with little eort. At low temperatures, however, such global recon gurations are unlikely if the stress of the embedding is small compared to the transition temperature.
7 Conclusion A novel algorithm for least{squares multidimensional scaling based on the widely used SSTRESS objective has been developed in the framework of maximum entropy inference(5) . The well-known optimization principle \deterministic annealing" has been generalized to continuous optimization problems. An algebraic transformation enables us to adapt the approach for Sammon mapping. Thus it covers the two most widely used MDS criterion functions. A large number of MDS experiments support our claim that annealing techniques display superior robustness properties compared to conventional gradient descent methods both on synthetic and on real-world data sets. The computational complexity of the new algorithms is comparable to standard techniques. As the algorithms minimize an upper bound on the free energy de ned by the respective cost function, convergence is guaranteed independently of any critical parameters such as the step-size in the gradient descent approach. Our current research focuses on techniques to alleviate the computational burden posed by a large number N of objects, e.g. N 10000 ? 50000 for realistic biochemical or document databases. Active sampling strategies will enable the estimation of the statistics on the basis of a sparsely sampled dissimilarity matrix, and an integration of clustering and embedding allows the approximation of site{site by site{cluster interactions. Another line of research extends the deterministic annealing principle to alternative cost functions for MDS, e.g. other choices for the metric of the embedding space. 19
Acknowledgement: M. Vingron provided us with the protein database. H. K. would like to
thank T. Hofmann for valuable discussions. This work has been supported in part by the Federal Ministry of Education and Research.
References 1. I. Borg and P. Groenen. Modern Multidimensional Scaling. Springer Series in Statistics. Springer, 1997. 2. T. F. Cox and M. A. A. Cox. Multidimensional Scaling. Number 59 in Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1994. 3. J. B. Kruskal. Multidimensional scaling by optimizing goodness of t to a nonmetric hypothesis. Psychometrika, 29(1):1{27, Marz 1964. 4. J. A. Hartigan. Representations of similarity matrices by trees. J. Am. Statist. Ass., 62:1140{ 1158, 1967. 5. H. Klock and J. M. Buhmann. Multidimensional scaling by deterministic annealing. In M. Pellilo and E. R. Hancock, editors, Proceedings EMMCVPR'97, volume 1223 of Lecture Notes in Computer Science, pages 245{260. Springer Verlag, 1997. 6. R. O. Duda and P. E. Hart. Pattern Classi cation and Scene Analysis. Wiley, New York, 1973. 7. Y. Takane and F. W. Young. Nonmetric individual dierences multidimensional scaling: An alternating least squares method with optimal scaling features. Psychometrika, 42(1):7{67, March 1977. ALSCAL. 8. J. W. Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18(5):401{409, May 1969. 9. R. N. Shepard. The analysis of proximities: Multidimensional scaling with an unknown distance function I. Psychometrica, 27:125{140, 1962. 10. J. B. Kruskal. Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29(2):115{129, Juni 1964. 11. J. De Leeuw. Applications of convex analysis to multidimensional scaling. In J. R. Barra, F. Brodeau, G. Romier, and B. van Cutsen, editors, Recent Developments in Statistics, pages 133{145. Amsterdam: North Holland, 1977. 12. J. De Leeuw. Convergence of the majorization method for multidimensional scaling. Journal of Classi cation, 5:163{180, 1988. 13. W. J. Heiser. A generalized majorization method for least squares multidimensional scaling of pseudodistances that may be negative. Psychometrika, 38:7{27, 1991. 14. P. J. F. Groenen, R. Mathar, and W. J. Heiser. The majorization approach to multidimensional scaling. Journal of Classi cation, 12(12):3{19, 1995. 20
15. P. J. F. Groenen. The majorization approach to multidimensional scaling: Some problems and extensions. PhD thesis, Leiden University, 1993. 16. R. W. Klein and R. C. Dubes. Experiments in projection and clustering by simulated annealing. Pattern Recognition, 22(2):213{220, 1989. 17. R. Mathar and A. Zilinskas. A class of test functions for global optimization. Journal of Global Optimization, 5:195{199, 1994. 18. J. Mao and A. K. Jain. Arti cial neural networks for feature extraction and multivariate data projection. IEEE Transactions on Neural Networks, pages 296{317, 1995. 19. M. E. Tipping. Topographic Mappings and Feed-Forward Neural Networks. PhD thesis, University of Aston in Birmingham, 1996. 20. D. Lowe. Novel 'topographic' nonlinear feature extraction using radial basis functions for concentration coding in the 'arti cal nose'. In 3rd IEE International Conference on Arti cial Neural Networks. London: IEE, 1993. 21. A. R. Webb. Multidimensional scaling by iterative majorization using radial basis functions. Pattern Recognition, 28(5):753{759, 1995. 22. S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220:671{680, 1983. 23. E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106:620{630, 1957. 24. S. Geman and D. Geman. Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721{741, 1984. 25. K. Rose, E. Gurewitz, and G. Fox. Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65(8):945{948, 1990. 26. J. M. Buhmann and H. Kuhnel. Vector quantization with complexity costs. IEEE Transactions on Information Theory, 39(4):1133{1145, July 1993. 27. T. Hofmann and J. M. Buhmann. Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intellegence, 19(1):1{14, January 1997. 28. S. Gold and A. Rangarajan. A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4):377{388, 1996. 29. J. M. Buhmann and T. Hofmann. Central and pairwise data clustering by competitive neural networks. In Advances in Neural Information Processing Systems 6, pages 104{111. Morgan Kaufmann Publishers, 1994. 30. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B (methodological), 39:1{38, 1977. 21
31. R. M. Neal and G. E. Hinton. A new view of the EM algorithm that justi es incremental and other variants. In M. I. Jordan, editor, Learning In Graphical Models, NATO ASI Series D, pages 355{368. Kluwer Academic Publishers, Dortrecht, 1998. 32. W. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Numerical Recipes in C. Cambridge University Press, 2. edition, 1992. 33. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Clis, NJ 07632, 1988. 34. B. D Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996. 35. H. Niemann and J. Weiss. A fast converging algorithm for nonlinear mapping of high{ dimensional data to a plane. IEEE Transactions on Computers, C-28:142{147, 1979. 36. E. Mjolsness and C. Garrett. Algebraic transformations of objective functions. Neural Networks, 3:651{669, 1990. 37. A. Rangarajan and E. D. Mjolsness. A Lagrangian relaxation network for graph matching. IEEE Transactions on Neural Networks, 7(6):1365{1381, November 1996. 38. I. M. Elfadel. Convex potentials and their conjugates in analog mean- eld optimization. Neural Computation, pages 1079{1104, 1995. 39. G. Strang. Introduction to Applied Mathemetics. Wellesley-Cambridge, 1986. 40. R. A. Fisher. The use of multiple measurements on taxonomic problems. Annals of Eugenics, pages 179{188, 1936. 41. T. Kohonen, H. Hynninen, J. Kangas, H. Laaksonen, and K. Torkkola. LVQ-PAK: The learning vector quantization program package. Technical Report A30, Helsinki University of Technology, Laboratory of Computer and Information Science, FIN-02150 Espoo, Finland, 1996. 42. M. Lades, J. C. Vorbruggen, J. M. Buhmann, J. Lange, Ch. von der Malsburg, R. P. Wurtz, and W. Konen. Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers, 42:300{311, 1993.
Appendix: Minimization of the Partial Costs
The minimization of the potentials (25) and (34) de ning the partial cost of embedding xi with a local density model qi (xiji ) plays an important role for the convergence of the algorithm, since a straightforward minimization by Newton-Raphson or conjugate gradient is likely to nd a local minimum. We therefore present a technique to explicitly enumerate all extrema, which will be feasible at least for moderate embedding dimensions. The derivation uses the fact that the cost function (1) is invariant with respect to translations and rotations of the con guration fxig. Replacing fxig by
fx^ ijx^ i = R(xi ? y)g ; R 2 SOn; y 2 IRM 22
(55)
the costs do not change: H(fxi g) = H(fx^ i g). Given the partial costs in a general form fi ( i ) = k i k4 + k i k2 Ti h^ i + Ti H i + Ti hi ;
(56)
a canonic choice for R and y can be derived that simpli es (56) signi cantly. We discuss the case of the Dirac-model (25) here. To obtain an equation of the form (56) for the Gaussian model from (32), eective elds have to be de ned which subsume the additional terms depending on the model variance i2. The rst step is to calibrate yi such that the coecients of h^ i vanish: X w hx ? yi (57) 0 = h^ = ? 4 i
This leads to the choice
i k6=i ik k
yi = 1i
X
0 k6=i
wik hxk i :
(58)
If one translates the con guration by y, the coordinate moments change as follows (omitting the index i): hx^i = hxi + y ; hx^x^T i = hxxT i + hxiyT + yhxiT + yyT ; (59) T T 2 T T 2 T hx^x^ i = hxx xi + hjxj iy + 2hxx i + 2yy hxi + jyj hxi + yy y : The variables of the translated system are marked with a hat. Consequently, translated statistics ^ have to be computed according to (59). Rotating the coordinate system into the eigensystem of the symmetric matrix H by an orthogonal matrix V yields a diagonal matrix D i ;
D = Diag(1; : : :; n) = VT HV; (60) 1 i n being the eigenvalues of B. After division by 0i , translation and rotation, the
potential has the form
f ( ) = k k4 + T D + T h ;
(61)
omitting the index i and the hat above the for conciseness. To compute the extrema of this potential, set the components of the gradient to zero. @a f ( ) = 4a k k2 + 2aa + ha = 0 ; 1 a M :
(62)
If 6= a=2, the solution for is ha 2 4 + 2a ; = k k : For (63) to hold must ful ll the condition a = ?
= k k2 =
M X a=1
2a =
23
M X
h2a
(63)
2
a=1 (4 + 2a)
(64)
which is equivalent to M Y
M
M
X Y (4 + 2b)2 ? h2a (4 + 2b)2 = 0 : a=1 b6=a b=1
(65)
This is a polynomial of degree 2M +1 in one variable. Its roots have to be evaluated numerically, e.g. by Laguerre's method (32). Applying the inverse rotation and the inverse translation to (63), the extrema of (56) can be determined directly from the roots q ; 1 q 2M + 1. If h = 0, the obvious solutions of (62) are = 0 and = a; 1aM:
2 By rearranging (62), solutions for are obtained, p
= aea ;
1 a M;
(66) (67)
where ea is the a-th unit vector. Again, applying inverse rotation and inverse translation yields the results in the original coordinate system.
24
10
HA1
HAC
HA2 HZ, HAC
5
HA
GGG GG
HB
0
HB*
MY
HB -5
MY GG GGG MY
-10 -10
10
-5
0
5
10
10
GGG 5
5
HA1
MY*
HAC
HB
HA2 0
0
MY
HAC GG
HAZ HA MY
-5
HB
-5
GGG GG -10 -10
-5
0
5
10
-10 -10
-5
0
5
Figure 1: Similarity matrix (top-left) of 226 protein sequences of the globin family. Dark grey levels correspond to high similarity values. The other three gures show the embeddings derived by multidimensional scaling: (top-right) global (bottom-left) intermediate and (bottom-right) local normalization of the stress function H MDS.
25
10
Algorithm: MDS by Deterministic Annealing INITIALIZATION: Initialize the parameters of P 0 (Xj) randomly. T
Tmax WHILE T > Tmin DO
DO 0 FOR 1 i N in random order DO E-like step: Compute the statistics i from the expectations fhxki; hxkxTk i; hkxkk2xki : 1 k N; k 6= ig, taken w.r.t. P 0(Xj). M-like step: Minimize D(P 0 (Xj)kP G (X )) = F (; ) by a variation of i . UNTIL change of KL divergence D < " T T; 0 < < 1 END Figure 2: Algorithm I
26
t=5 T=2.467 stress=1
t=37 T=0.241 stress=0.0848
a
b
t=55 T=0.193 stress=0.0366
t=149 T=1.e-8 stress=0.00413
c
d
Figure 3: Evolution of the embedding of the Iris data set at dierent temperature levels. 27
2
Stress
∆01 = 2
1.5
∆ 01= 1 1
∆ 01 = 0.3 0.5
Sammon
0 -3
-2
-1
0
1
2
3
Location x1 Figure 4: Visualization of the approximation of Sammon mapping for the simple case of embedding two points in one dimension (see text). The rst point is kept xed at x0 = 0. The graph displays the cost of embedding the second point at locations x1. Bold: Sammon costs. Thin: Approximation for dierent values of 01.
28
4
Algorithm: Sammon Mapping by Deterministic Annealing INITIALIZATION: Initialize the parameters of P 0 (Xj) randomly. T
Tmax
WHILE T > Tmin DO Update the auxiliary parameters i from the expectations fhkxi ? xk k2i; hkxi ? xk k4i : 1 i; k N; k 6= ig w.r.t. P 0(Xj). DO 0 FOR 1 i N in random order DO E-like step: Estimate the statistics i from the expectations fhxki; hxkxTk i; hkxkk2xki : 1 k N; k 6= ig w.r.t. P 0 (Xj), and the parameters . M-like step: Minimize D(P 0 (Xj)kP G (X )) = F (; ) by a variation of i . UNTIL change of KL divergence D < " T T; 0 < < 1 END Figure 5: Algorithm II
29
number of solutions [%]
70 Deterministic Annealing
60
Gradient Descent (LVQ-PAK)
50 40 30 20 10 0 0.004
0.005
0.006
0.007
0.008
stress value Figure 6: Sammon mapping applied to the Iris data for non-linear dimension reduction. Histograms of 1000 runs with deterministic annealing (gray) and gradient descent (white).
30
number of solutions [%]
100 90
30
1 0 0 1 0 1 0 1 01 1 0 0 1 0 1
11 00 00 11
Annealed Zero Temperature LVQ-Sammon
20 10 0 0.04
0.05
0.06
stress value Figure 7: Virus data, Sammon mapping. Histograms of 1000 runs with gradient descent (dark gray), zero-temperature (light gray) deterministic annealing algorithm (white with stripes).
31
number of solutions [%]
30
20
10
0 0.458
0.46
0.462
0.464
0.466
0.468
0.47
stress value Figure 8: Bimodal random data: Histograms of the nal SSTRESS of 1000 runs of the deterministic annealing algorithm (gray) versus 1000 runs of the zero temperature version (white) with local weighting.
32
9
x 10
−3
STRESS
8 7 6 5 4
5
10
20
50
CPU time [sec]
100
200
Figure 9: Improvement of the average solution as a function of the available CPU time. Dashed: sammon Dotted: zero temperature. Solid: annealing for low and high initial temperatures (upper curve: T0 = 0:1, lower curve: T0 = 10). In the latter case CPU time has been controlled by a variation of the annealing parameter .
33
Figure 10: Dierence vectors between the coordinates of two embeddings computed for the Iris data with Sammon mapping: Optimal con guration found (stress=0.00413) and a local minimum (stress=0.00505). The ends of each line denote the position of the same object in the two embeddings. 34
Figure 11: Face recognition database: Embedding of a distance matrix of 492 faces by Sammon mapping (300 randomly chosen are used for visualization).
35