A Permutation-Translation Simulated Annealing ... - Semantic Scholar

Journal of Classification 22:119-138 (2005) DOI: 10.1007/s00357-005-0008-5

A Permutation-Translation Simulated Annealing Algorithm for L1 and L2 Unidimensional Scaling Alex Murillo

J. Fernando Vera

Costa-Rica University, Costa Rica

Granada University, Spain

Willem J. Heiser Leiden University, The Netherlands

Abstract: Given a set of objects and a symmetric matrix of dissimilarities between them, Unidimensional Scaling is the problem of finding a representation by locating points on a continuum. Approximating dissimilarities by the absolute value of the difference between coordinates on a line constitutes a serious computational problem. This paper presents an algorithm that implements Simulated Annealing in a new way, via a strategy based on a weighted alternating process that uses permutations and point-wise translations to locate the optimal configuration. Explicit implementation details are given for least squares loss functions and for least absolute deviations. The weighted, alternating process is shown to outperform earlier implementations of Simulated Annealing and other optimization strategies for Unidimensional Scaling in run time efficiency, in solution quality, or in both. Keywords: Unidimensional scaling; Simulated annealing; Permutation; Translation; Quadratic assignment.

The authors would like to thank the acting Editor, Michael Brusco, and three anonymous referees for valuable criticisms and suggestions that greatly improved the paper. We are also endebted to Michael Brusco for kindly making available his programs.(During completion of the manuscript, Willem Heiser was Fellow of the Netherlands Institute for Advanced Study in the Humanities and Social Sciences (NIAS) in Wassenaar, the Netherlands.) Authors’ Addresses: Alex Murillo, Atlantic Branch, University of Costa Rica, Turrialba, Costa Rica, e-mail: [email protected]; J. Fernando Vera, Department of Statistics and O.R., Faculty of Sciences, University of Granada, 18071 Granada, SPAIN, e-mail: [email protected]; Willem J. Heiser, Department of Psychology, Faculty of Social and Behavioural Sciences, Leiden University, The Netherlands, email: [email protected].

120

A. Murillo, J.F. Vera, and W.J. Heiser

1.

Introduction

Consider a set of n objects, denoted as O = {o1 , . . . , on } and let ∆ = (δij )n×n be a square matrix of dissimilarity data between the objects. Assuming without loss of generality ∆ to be a symmetric matrix, with non-negative elements and a zero diagonal, Unidimensional Scaling addresses the particular problem of the arrangement of such objects on a single dimension by means of a vector of n real numbers denoted by X = (x1 , . . . , xn )′ ∈ Rn . The distance between two objects is the absolute value of the difference between the coordinates, dij =| xi − xj |, ∀i, j = 1, . . . , n, and is well known to produce a serious computational problem with regard to minimizing the least squares loss function (L2 -norm), SL2 (X) =

n−1 X

n X

(δij − | xi − xj |)2 .

(1)

i=1 j=i+1

The problem of Unidimensional Scaling has been addressed by various authors, and different optimization procedures have been employed to resolve it (see Brusco, 2001, 2002, who also provides a good review of the literature). Many strategies utilize a reformulation of the least squares problem first noted by Defays (1978). Following the exposition of Hubert, Arabie and Meulman (2002), and assuming without loss of generality that the vector X is centred, the least squares loss function can be reparametrized such that (1) may be written as

SL2 (ρ, X) =

n−1 X

n X

2 δij +n

n X

(xi −

(ρ) ti )2

i=1

i=1

i=1 j=i+1

n X (ρ) (ti )2 −

!

,

(2)

where ρ(X) ∈ Ψ(n), is a permutation of the elements of X such that x1 ≤ x2 ≤ · · · ≤ xn , and (ρ)

ti (ρ) ui

=

i−1 X

(ρ)

= (ui

(ρ)

− vi )/n, (ρ)

δρ(i)ρ(j) , ∀ i ≥ 2, u1 = 0;

j=1

(ρ)

vi

=

n X

δρ(i)ρ(j) , ∀ i < n, vn(ρ) = 0.

j=i+1

The problem of obtaining the minimum of (1) can be now approached by obˆ ρ) ∈ Rn , as follows: taining the permutation ρˆ ∈ Ψ(n) and the vector X(ˆ

A Permutation-Translation Simulated Annealing Algorithm

n X

(ˆ ρ) (ti )2

= max

i=1

ρ∈Ψ(n)

( n X

(ρ) (ti )2

i=1

)

121

,

x î (ˆ ρ) = tρiˆ, ∀i = 1, . . . , n.

(3)

(4)

In this reformulation, it can be seen that the optimization problem is a combinatorial one, as (3) belongs to the class of problems called NP-hard, which have been shown to be difficult to solve for sizes of n > 23. In general, the solution is found by means of a two stage procedure. In the first stage, an appropriate permutation is sought by (3), while in the second stage the coordinates are estimated from the latter permutation, according to (4). Various authors, for example Brusco and Stahl (2000), have shown the solution to be sensitive to the initial permutation employed in the first phase. Hubert et al. (2002) have used MatLab to implement various strategies for Unidimensional Scaling. Apart from dynamic programming (uniscaldp.m), which guarantees a global optimum but is only suitable, from a practical point of view, for sizes of n < 23, the authors concluded that the solution providing best results for the greatest number of objects should be, upon the use of some number of random starts, both the iterative quadratic assignment improvement heuristic, which applies alternating optimization to each part of the second term of (2) (uniscalqa.m), or gradient-based optimization with Pliner’s (1996) smoothing strategy (plinorder.m). Another strategy which does not use the reformulation of Defays (1978), is adopted in the present paper. It consists of finding the best configuration that directly optimizes the loss function (1), without attempting to find the best permutation via (3). Therefore, it can also be applied to the loss function of least absolute deviations (L1 -norm), expressed as: SL1 (X) =

n−1 X

n X δij − | xi − xj | .

(5)

i=1 j=i+1

Brusco (2001) addressed this problem by Simulated Annealing and obtained results that were acceptable for medium sized groups of objects, in comparison with earlier methods. In the present paper, we offer an algorithm that also uses Simulated Annealing, together with a strategy based on a weighted alternating process that employs permutations and translations. Its finds the opˆ , for least squares loss functions (1) or for least absolute timum configuration X deviations (5), and is shown to be a very efficient strategy to obtain optimal solutions.

122


In order to evaluate the efficiency of the algorithm, it was implemented1 in Fortran and Matlab, and tested on Rothkopf’s confusion data (1957), the data of Shepard, Kilpatric and Cunningham (1975), as well as 17 symmetric dissimilarity matrices randomly generated by the ransymat.m program of Hubert et al. (2002), with sizes of 10 ≤ n ≤ 200. The results obtained from the data analyzed with the programs of Brusco (2001) and Hubert et al. (2002) are presented in Section 5. 2.

Simulated Annealing Background in Unidimensional Scaling

The stochastic method of Simulated Annealing (SA) originated in the work of Metropolis, Rosenbluth, Rosenbluth, Teller and Teller (1953) for the minimization of a function on a very large, finite set, although it can also be applied for optimization on a continuous set (Duflo, 1996). Kirkpatrick, Gelatt ˇ and Vecchi (1983), and independently Cerny (1985), demonstrated its utility for finding global optima in combinatorial optimization problems. The method is based on the analogy of the thermodynamic principle of crystallization, in which the material is first heated to a liquid state (state of high atomic energy) and subsequently the temperature is slowly reduced until the material has cooled (energy reduction) such that the resulting crystal is perfect (minimum energy). The cooling rate is a fundamental aspect of the procedure; if the material is cooled too rapidly, this implies the existence of impurities (the minimum energy will not be attained) and thus the optimum will not be achieved. To simulate the evolution of a physical system, Metropolis et al. (1953) introduced an iterative procedure known as the Metropolis acceptance rule, proposing a modification of the current state of the system in the following terms: • If the energy of the system, E , decreases, accept the modification. • If the energy increases in △E , the modification may be accepted with a probability of exp(−△E/T ), where T is the temperature.

In an iterative algorithm for combinatorial optimization, SA uses this principle by introducing a control parameter T ∈ R+ that corresponds to the role of the temperature in a physical system. How to determine the control parameter in practice has been discussed by various authors, including Geman and Geman (1984), Mitra and Romeo and Sangiovanni-Vincentelli (1986), Van Laarhoven and Aarts (1987) and Aarts and Korst (1989), among others. This parameter enables us to limit the number of states that are accessible from each 1. The programs and data can be obtained in a zip file from the location http://www.ugr.es/local/jfvera /PerTSAUS.zip or by e-mailing the authors.


123

given state, and the probability of accessing such a state. The optimum state is then achieved if T decreases slowly and is under good control. Aarts and Korst (1989), in the context of Markov chains, showed that the SA method asymptotically finds a global optimum. The proof of the existence of a stationary distribution associated with the Markov chain is done by modelling an ergodic, aperiodic Markov chain by means of a random path. Hàjek (1988) established the conditions on T for the convergence of the SA algorithm to a global optimum on finite spaces. Winkler (1995) extended Hàjek’s proof, and Andrieu and Doucet (1998) reported a proof of the convergence of the SA algorithm in the set-up of hidden Markov models. In practice, the control parameter is reduced by levels, generating a succession of states, thus permitting the system to approach an equilibrium for each level. The algorithm stops for small values of T , such that hardly any new generations are accepted. Therefore, among all the states of the succession that are generated, the one that optimizes the objective function is chosen as the solution to the problem. Following Defays reformulation, De Soete, Hubert and Arabie (1988) were the first to use Simulated Annealing to find the best permutation in the sense of (3). The algorithm was compared with a locally optimal pairwise interchange (LOPI) strategy, and it was concluded that although SA is less costly from a computational standpoint, LOPI is somewhat better with respect to the optimum found. Although these authors concluded that the results of SA did not significantly improve upon those of LOPI, they suggested that the optimality of the solution given by SA largely depends on the parameters of the algorithm that is implemented, and thus that SA should not be dismissed. Brusco and Stahl (2000) used SA in the context of Unidimensional Scaling to obtain initial solutions that could later be used by combinatorial procedures. Brusco (2001) addressed the problem of approximating symmetric dissimilarity data with City-Block metric, and applying a generic Simulated Annealing algorithm. By randomly moving each point to a new position, the location of the objects within a continuum was optimized, for loss functions based on either least squares or on least absolute deviation. The algorithm was proposed for moderately-sized data (20 ≤ n ≤ 100), and although the solutions obtained were considered locally acceptable, the author suggested they should merely be used as very good starting solutions for combinatorial algorithms. 3.

The PerTSAUS Optimization Strategy

In this paper, we propose a weighted randomly alternating SA algorithm. It is comprised of two stages, in each of which the loss function (1), or (5), is directly optimized with respect to X , and this algorithm is therefore different

124


from those proposed by De Soete et al. (1988) and by Brusco (2001). Given a value of X , when the algorithm enters the stage we call permutations, the loss function is minimized with respect to the position of the values in the vector X by the pairwise interchange of the coordinates of two randomly selected objects. The second phase, which we call translations, is the mechanism by which coordinates can change from their previous values. It is the principal one, which has greater weight in the algorithm. It consists of an SA procedure that minimizes the loss function by the translation of a single randomly chosen coordinate of X to a new position within its previously defined environment. This procedure differs from that of Brusco (2001) in the following aspects: (a) the partitioning of the coordinate space does not have to be previously defined by the researcher, but is as fine as the precision required for the solution, (b) the initial temperature, T0 , does not depend on a choice of the researcher but is determined objectively by random sampling, such that in the beginning a reasonable percentage of inferior solutions are accepted, (c) the size of the neighbourhoods is not always equal to that of the whole domain, but is modified dynamically during the process; in the final iterations, it decreases such that the movements are restricted within a space that is ever closer to the optimum, (d), a deterministic direction mechanism is used in each translation; this provides better results than the probabilistic direction mechanism proposed by Bearden (2001), (e) in each phase, the truncating length of the Markov chain is not static, but dynamic, depending on each principal iteration, increasing the number of translations or permutations in each stage throughout the algorithm. One of the most important issues in SA, as in all methods of combinatorial optimization, concerns the relative cost of computation. The current algorithm uses various algebraic simplifications to reduce the increase of computational burden as a function of n, enabling the loss function to be evaluated from one iteration to another with linear cost. The algorithm always minimizes (1) or (5) with respect to X , and so if ˆ (s) , the change in the we denote the value of X accepted in the iteration s as X loss function between two consecutive iterations can be calculated as follows: ˆ (s+1) ) ˆ (s) ) − SLp (X △(SLp ) = SLp (X n n−1 X X p ˆ (s) p ˆ (s+1) ψij (X ) − ψij (X ) , = i=1 j=i+1

in which, (s) (s) p p ˆ (s) ˆj − x ψij (X ) = δij − | x ˆ i | ,

∀i, j = 1, . . . , n and p = 1, 2.

(6)


125

The double summation implies a complexity of order O(n2 ), and so two reduction strategies are used in each phase to calculate the change in (6). ˆ (s+1) is obtained from X ˆ (s) , In each iteration of the permutation phase, X by the pairwise interchange of two randomly selected coordinates. Thus, it is only necessary to consider the effect caused by permuting the positions of two points. Hence, if we interchange the relative coordinates of the objects ok and ol , an efficient way of calculating the difference between the values associated with the loss function for L1 , in linear time, is: n X

△P (SL1 ) =

(ϕik,k + ϕil,l − ϕik,l − ϕil,k ) ,

(7)

i=1 i6=k,i6=l

where, (s) (s) ϕlm,k = δlm − | x ˆl − x ˆk | ,

∀l, m, k = 1, . . . , n.

For the loss function in L2 , its change can be expressed as:

△P (SL2 ) = 2

n X

i=1 i6=k,i6=l

(s) (s) (s) (s) (δik − δil ) | x î − x ˆl | − | x î − x ˆk | .

(8)

ˆ (s+1) Similarly, in each iteration during the translation phase, we obtain X (s) ˆ (s) to on displacing the coordinate related to a randomly chosen point x ˆl of X (s+1) a new position, x ˆl , within a previously defined environment that we call (s) ϑ(ˆ xl ) ∈ R. Thus, the difference between the values of the loss function in Lp can be obtained as follows:

△T (SLp ) =

n X (s) (s) (s) p (s+1) p î − x î − x ˆl | − δil − | x ˆl | . δil − | x i=1 i6=l

(9)

This formulation is efficient for implementation in languages intended for matrix calculus, such as MatLab or Mathematica, which offer a matrix expression of △(SL2 ). Hence, for the permutation phase: △P (SL2 ) = 2 (δk − δl ) d(l) − d(k) + 4δlk dlk ,

and for the translation stage:

(10)

126


△T (SL2 ) = d2l(s+1) l(s) + 2δl (dl(s+1) − d(l) )t + dl dtl − dl(s+1) dtl(s+1) ,

(11)

where δk and δl denote the vectors formed by the k -th and l-th rows of the matrix ∆, where d(k) , dl and d(l) denote the vectors formed, respectively, by ˆ s ), the k -th column and by the l-th row and column of the distance matrix D(X and dl(s+1) denotes the vector formed by: dl(s+1) = (dl(s+1) 1(s) , . . . , dl(s+1) n(s) )t .

The expressions (10) and (11), render linear complexity, which again makes them more efficient to implement than the direct use of (6). Since the algorithm proposed in this paper is based on the optimization of the loss function by Simulated Annealing, applying the two strategies described above, it has been named PerTSAUS (Permutation Translation Simulated Annealing Unidimensional Scaling). 4.

Implementation

Three phases can be distinguished in the proposed algorithm, initialization, optimization, and termination. Figure 1 shows the pseudo-code of the algorithm for a detailed description, in which the following notation is used: V : Length of the neighbourhood for each coordinate. η : Dynamic factor of V in each iteration. LC : Truncating length of the Markov chain in each phase or temperature level, which increases by a number IC every m iterations. IC : Increase in LC for m iterations. m: Number of iterations in which LC remains constant. γ : Cooling factor, which controls reductions in temperature. T : System temperature in each iteration. Tf : Final system temperature. Itmax : Maximum number of iterations of the system that the system will run. Rmax : Maximum number of iterations in which a solution remains unchanged. α: Weight of the permutation phase versus that of translations. A: Neighbourhood directioning vector, which determines the next neighbourhood of each coordinate of X . M a: Number of elements of the initial sample that worsen the value of the loss function (in the calculation of the initial temperature). M amax : Maximum number of translations of the initial sample to obtain M a. χ: Probability that worse solutions in the initial sample will be accepted when the system starts (in calculating the initial temperature).


Read data: Read data matrix ∆n×n Initialize: V (i) ← 0.95 ∗ max ∆ , for i = 1, . . . , n A(i) ← −1 or 1 randomly, for i = 1, . . . , n Generate X, with xi ∈ [0, max ∆] uniform distribution, for i = 1, . . . , n Calculate: S ← S(X), using (5) or (1) Calculate: T ← I NITIAL TEMPERATURE (see figure 2) Initialize parameters: LC, IC, γ, Tf , Rmax , α, m, η P ermV sT rasl ← 1 Itmax ← ln(Tf /T )/ ln(γ) Iter ← 1 Repeat If Random(0, 1) < P ermV sT rasl Then P ermV sT rasl ← α For 1 to LC Do (Start permutation phase) Generate a transposition X of X dp ← △P (SLp ), using (7) or (8) If dp > 0 Then X ← X; S ← (S − dp) End(For) Else For 1 to LC Do (Start translation phase) Generate X ∈ ϑ(X, A, V ), see (12) dt ← △T (SLp ), using (9) If dt > 0 Then X←X S ← (S − dt) V (l) ← V (l) + η ∗ (max ∆ − V (l)) Else Update A, so that A(l) ← −A(l) V (l) ← V (l) − η ∗ V (l) > Random(0, 1) Then X ← X; S ← (S − dt) If exp dt T End(If) End(For) T ← (T ∗ γ) End(If) If M OD(Iter, m) = 0 Then LC ← (LC + IC) Update: ContRep ← R EPETITION COUNTER Iter ← (Iter + 1) Until (Iter = Itmax ) or (ContRep = Rmax )

Figure 1. Pseudo-code of the PerTSAUS algorithm

127

128


I NITIAL TEMPERATURE: Initialize: M a, M amax , χ Average ← 0 cont ← 0 T est ← 1 Repeat Generate X ∈ ϑ(X, A, V ), randomly dt ← △T (SLp ), using (9) If dt < 0 Then Average ← (Average + dt) cont ← (cont + 1) End(If) T est ← (T est + 1) Until (T est = M amax ) or (cont = M a) Average/cont T0 ← ln(χ)

Figure 2. Pseudo-code to calculate the initial temperature in the PerTSAUS algorithm

4.1

Initialization Phase

The algorithm first calculates the initial length of the neighbourhood for each coordinate, V , and randomly starts the direction vector A and the vector of (0) (0) ˆ (0) = (ˆ real coordinates X x1 , . . . , x ˆn ). The loss function is evaluated for the ˆ 0 generated, and the estimated difference functions of linear initial solution X order are determined for later use. Next, the initial value of T is calculated, according to the scheme given in Figure 2. Initialization of T is done objectively, using simple random sampling, following a strategy similar to that of Kirkpatrick et al. (1983). Eventually, M amax translations are performed, with the objective of averaging M a possible increases of the solutions that worsen the objective function. We thus obtain a value for the initial temperature such that, in the first iterations of the process 100χ% of the worst solutions are normally accepted. Finally, the researcher sets the initial value of LC , the increase of the truncating length of the Markov chain, IC , the percentage of permutations with respect to that of translations, α, and the other parameters described above. 4.2

Optimization Phase

This phase is the main part of the algorithm, in which the optimum is found. In the first iteration, the algorithm enters the permutation stage with probability one. During the remaining iterations, the choice between stages is random, with a probability α to enter the permutation stage and with a probability 1−α to enter the translation stage. Therefore, our strategy can be considered


129

a weighted randomly alternating algorithm for Unidimensional Scaling. The subphase corresponding to the permutation stage consists of an internal cycle of length LC . As any permutation may be obtained by the composition of a finite number of transpositions, in each iteration it is only necessary to implement the specific case of transposing two randomly chosen points. Although initially different values of T were used, the best results were obtained by using a constant value T = 0, as in a gradient descent technique. Hence, ˆ (s+1) improves the value of the loss function, it is accepted and Sˆ(s+1) is if X Lp updated by the linear time formulas given in (7) or in (10). The subphase corresponding to the translation stage is made up of a cycle that is also of length LC , in which the Metropolis acceptance rule is applied to ˆ (s) , chosen within the neighbourhood, using either the linear the new value X order expression given by (9), or that given by (11). The temperature of the system is reduced by the simple cooling procedure proposed by Kirkpatrick et al. (1983), consisting of calculating, T (s+1) = γ × T (s) for each iteration. One of the aspects that most strongly influences the quality of the solution obtained in SA is the way the neighbourhood is defined for each iteration. ˆ (s) is based on the aspects direcIn the present study, the neighbourhood of X tion and length. The direction mechanism influences the construction of the ˆ (s+1) ) > SLp (X ˆ (s) ), then neighbourhood in each iteration as follows. If SLp (X the search direction will be inverted for the next time an attempt is made to translate the l-th point. The length of the neighbourhood for each coordinate (s) x ˆl represents the size of the translation domain and thus the amount of possible change in the loss function. If various consecutive translations of the l-th coordinate have decreased the loss function, the length of the neighbourhood is increased in order to accelerate the process; otherwise, length is decreased so that in the final iterations fewer improvements in the loss function occur and the translation domain tends to contract. ˆ (s) depends on the direction and Since the neighbourhood for a solution X ˆ (s) , A(s) , V (s) ), defined as the set of all length parameters, it is denoted by ϑ(X (s+1) ˆ ˆ (s) when a randomly the representations X that could be obtained from X (s) chosen coordinate x ˆl is translated to another position. The update has the following form: (s+1)

x ˆl

(s)

=x ˆl + A(s) (l) ∗ V (s) (l). (s+1)

(12) (s)

This procedure yields better results than taking x ˆl ∈ ϑ(ˆ xl ), randomly. Another important aspect proposed in this paper consists of using a dynamic truncating length of the Markov chain in each phase, depending increasingly on each principal iteration. It has turned out experimentally that, at first, the function to be optimized decreases rapidly, and so the number of iterations

130


may be lower than at the end of the algorithm, when the reduction of the objective function is slower. 4.3

Termination Phase

At the end of each stage, two criteria must be checked for the algorithm to finish. One is of a computational nature, evaluating when the maximum number of iterations is achieved, determined by: Itmax =

ln(Tf /T0 ) , ln(γ)

such that if the parameter Tf is very close to zero, the process eventually behaves like a gradient descent technique, guaranteeing that any SA method ends at a local optimum (Winkler, (1995)). The other criterion is that of convergence, when a solution remains unaltered during a number Rmax of iterations, previously established by the researcher. The optimum achieved represents the solution to the problem X ∈ Rn , which is normalized and centered for possible comparisons. 5.

Experimental Test Results

The algorithm was developed in order to obtain, in that order of priority, the best solution in the least possible time. It was implemented in Fortran and in MatLab, using the programs pertsaus2 for the loss function in L2 , (1) and pertsaus1 for the loss function in L1 , (5). In this section, first the performance of the proposed algorithm is tested in several data sets, and second, these test results are compared with those obtained from other major Unidimensional Scaling strategies. Various data sets were employed. These included the file number.dat, which contains the data from Shepard et al. (1975) analyzed by Hubert et al. (2002) and the file morsecode.dat that contains the dissimilarities obtained by the transformation δij = 2 − (cij + cji ), i, j = 1, . . . , 36, of the matrix of confusabilities, cij , of thirty-six Morse Code signals from Rothkopf (1957), analyzed by Brusco (2001). In addition to the above, we also used 17 files of data denoted as dn.dat, each of which contained a matrix of dissimilarities, symmetric and not lacking any data, generated by a uniform distribution within the interval [0, 1], by the program ransymat.m, developed by Hubert et al. (2002), with sizes ranging between 10 ≤ n ≤ 200, thus including also relatively large data sets. The procedure used to analyze each matrix was the same in every case, working on a Pentium IV 2.66 GHz computer with 512 MB of RAM. Each


131

program was run on Microsoft Windows XP, in groups of twenty independent replications associated to twenty random starts, considering the final solution for each group to be the smallest local optimum, and establishing the duration as the mean execution time. As an indicator of the algorithm’s efficiency, we also utilized the attraction rates associated with the optimum, that is, the percentage of times that the smallest optimum value was found during the tests. The most efficient algorithm was assumed to be, and in this order, the better the optimum found, the lower the time required for the search and the higher the attraction rate, taking into account that the latter is equal to zero if the smallest local optimum is not found. All problems have results that in terms of raw loss function are significant to four decimal places. It should be noted that although higher values of some of the algorithm’s parameters, such as LC and IC , and lower values of others, such as the temperature, could offer theoretically better local optima, the application of SA in any context requires a suitable equilibrium between the reduction gained in the loss function and the time necessary to achieve it. With this in mind, different combinations of parameter values were tested; of these, the ones offering the best results in most of the employed data sets, for Fortran were: LC = 150n, IC = 5n, γ = 0.95, Tf = 10−7 , Rmax = 10, α = 0.05, m = 20 and η = 0.1. The initial temperature value was calculated using the values: M a = 50n, M amax = 1000 and χ = 0.95. Some of the parameter values used, such as LC , IC , γ , Tf , Rmax , η , M a, M amax and χ, have also been found efficient in other studies, as for example in Trejos, Murillo and Piza (1998), Brusco (2001), Bearden (2001) and Piza, Trejos and Murillo (2002). For α were tested the values of 1, 0.75, 0.5, 0.25, 0.10, 0.05 and 0, and for m the values of 10, 20 and 50, to select the chosen combination. Fortran is one of the most efficient programming languages in terms of calculation time, a fact which, together with its widespread availability and acceptance, makes it a highly suitable choice. Nevertheless, the power and current facilities provided by other languages, such as Matlab, mean that use of the latter has increased and that it is recommendable to offer a version of the proposed algorithm in the latter language, too, despite its being more costly in terms of time, and thus less efficient. For the implementation in Matlab 6.5 to be efficient, it is necessary to sacrifice optimality regarding time, and so the parameter values used were LC = 10n, IC = n, M a = 5n, M amax = 500, with the others remaining unchanged. Table 1 shows for each analyzed data matrix the results obtained with pertsaus2.for and pertsaus2.m (where stress values are normalized by the sum of squared dissimilarities), and the results obtained with pertsaus1.for (where stress values are normalized by the sum of the dissimilarities). For the MatLab 6.5 results, lower attraction rates were obtained for the large matrices. Although

132


Table 1: Fortran and Matlab 6.5 implementations normalized results for the PerTSAUS algorithm.

File numb d10 d15 d20 d24 d30 morse d40 d50 d60 d70 d80 d90 d100 d120 d140 d160 d180 d200

pertsaus2.for min attr CPU (SL2 ) % time 0.13031412 100 0.19 0.18977976 85 0.18 0.23126412 85 0.32 0.26134528 100 0.48 0.30431573 40 0.63 0.32289220 30 0.89 0.23049483 35 1.27 0.36219557 20 1.42 0.35852206 5 2.04 0.37801783 30 2.79 0.38170711 5 3.67 0.38834989 75 4.56 0.39815382 5 5.71 0.40588705 5 6.85 0.41368074 10 9.63 0.42039767 5 12.84 0.42771035 5 16.85 0.43044710 5 20.94 0.43674421 5 25.76

pertsaus1.for min attr CPU (SL1 ) % time 0.28916902 100 0.20 0.35594355 50 0.20 0.41058773 60 0.34 0.45088357 95 0.52 0.47375015 10 0.68 0.48940610 70 0.97 0.41169273 30 1.36 0.53779513 30 1.53 0.53544485 5 2.16 0.55669778 10 3.00 0.55825922 10 3.88 0.56626206 5 4.89 0.57232249 5 6.14 0.57937214 5 7.55 0.58793285 5 10.56 0.59308483 5 14.05 0.60133533 5 18.07 0.60326004 5 22.76 0.60801237 5 28.10

pertsaus2.m min attr CPU (SL2 ) % time 0.13031413 100 5.34 0.18977977 75 5.63 0.23126412 65 8.50 0.26134528 100 12.05 0.30431573 65 14.88 0.32289221 60 19.64 0.23049483 40 25.87 0.36219557 20 27.44 0.35853981 5 34.70 0.37802115 5 41.06 0.38170798 10 50.64 0.38834990 20 57.72 0.39832346 5 65.89 0.40690775 5 76.84 0.41368187 5 94.36 0.42040234 5 117.98 0.42771482 5 145.68 0.43080298 5 163.01 0.43674618 5 190.68

the range of values obtained indicate that the algorithm is computationally efficient, the MatLab run times become quite high for n ≥ 90. 5.1

PerTSAUS Test Results Compared to SA in Defays Reformulation and Brusco’s Two-Step Procedure

To test the efficiency of the proposed algorithm with respect to a similar strategy by De Soete et al. (1988), we first implemented the Defays reformulation (3) for the L2 metric in Fortran, named DefaysSA.for, by means of a Simulated Annealing algorithm, whose pseudo-code is shown in a short Appendix A to this paper, using the usual parameters values of LC = 150n,IC = 5n, γ = 0.95, Tf = 10−7 , Rmax = 10, m = 20, M a = 50n,M amax = 1000 and χ = 0.95 for the nineteen data matrices. Brusco (2001) implemented the Fortran programs sseuds2.for for the loss function L2 in (1), and laduds2.for for the loss function L1 in (5), to deal with the problem of Unidimensional Scaling in a SA framework. Although the solutions obtained were considered acceptable, the author suggested they should merely be used as very good starting solutions for combinatorial algorithms, implementing in Fortran an iterative quadratic assignment post-process called udsnnlsb.for in connection with sseuds2.for, and udsnnlad.for associated to laduds2.for. To evaluate the effi-


133

ciency of the present algorithm, we have analyzed the nineteen data sets by Brusco’s SA programs as well, applying the post-processing, and compared them with the earlier results. The parameter values for Brusco’s SA program were: P S = 400, γ = r = 0.95, LC = T L = 200n, Tf = Tmin = 0.00001. The results in terms of the quality of the solutions can be described as follows. Normalized stress values obtained for DefaysSA.for were very close to the pertsaus2.for values (they were equal in 3 decimal places, while in 2 out of 19 cases the latter was better than the former in the fourth decimal place). Similarly, the results of sseuds2.for were very close to pertsaus2.for, as were the results of laduds2.for compared to pertsaus1.for, both in minimum normalized stress and in attraction rates (here, we find equality in four decimal places). Therefore, it appears that the current method is equivalent to the benchmark methods in the quality of the solutions found. Somewhat unexpectedly, the post-processing with quadratic assignment (udsnnlsb.for) did not give any improvement over the SA results for all cases tested (numb up to d90). With respect to run-time efficiency, the situation was quite different. In Figure 3, the mean CPU times achieved for one run with the programs DefaysSA.for and Brusco’s two-step procedures are given, compared with the comparable PerTSAUS methods (CPU times exploded for Brusco’s two-step procedures for n ≥ 90 in the L2 case and n ≥ 70 in the L1 case). Clearly, PerTSAUS outperforms the other methods in run time efficiency, especially when n is larger than 50. In addition, the other methods have accelerated growth, in contrast with the linear growth of PerTSAUS. 5.2

PerTSAUS Test Results Compared to Two Other Optimization Strategies

Using the Defays reformulation, Hubert et al. (2002), implemented several optimization heuristics for Unidimensional Scaling in Matlab 6.5, among which the iterative quadratic assignment improvement heuristic called uniscalqa.m provided the best results for n ≥ 23. They also obtained quite good results with another implemented heuristic based on Pliner’s (1996) gradient smoothing technique called plinorder.m. As was indicated earlier, for the implementation in Matlab to be efficient in terms of CPU time, the parameter values used for the PerTSAUS Matlab programs were LC = 10n, IC = n, γ = 0.95, Tf = 10−7 , Rmax = 10, α = 0.05, m = 20, η = 0.1, and for the initial temperature, M a = 5n, M amax = 500, χ = 0.95, offering solutions than are slightly worse than the Fortran implementation. Table 2 and Table 3 give the results and the comparisons obtained for PerTSAUS2.m versus uniscalqu.m and for PerTSAUS2.m versus plinorder.m,

134


Figure 3. CPU-time (in seconds) for the least absolute residual methods (left panel) and the least squares methods (right panel).

respectively; thus, the PerTSAUS Matlab implementation is compared here with the Hubert et al. (2002) Matlab programs. In these tables, a positive number in boldface indicates an improvement of the PerTSAUS result over the competing program. Of the distribution of normalized stress values, not only the minima are reported, but also the maxima found. With respect to quality of the local minima found, in the large majority of cases the range of normalized stress has smaller minima and maxima for PerTSAUS than for the others, although the differences are again not large (except for the smaller datasets with plinorder.m). With respect to run time efficiency, the results are different from those described in section 5.1. For datasets up to d100, uniscalqa.m and plinorder.m clearly outperform PerTSAUS2.m. But for larger data sets, their run times accelerate rather severely (for plinorder.m this process starts later than for uniscalqa.m), compared to the steady increase of PerTSAUS2.m. 6.

Conclusions

This paper presented a Simulated Annealing algorithm for Unidimensional Scaling using a strategy based on a weighted alternating process that uses random permutations and translations to locate the optimal configuration, both for least squares loss functions and for least absolute deviations. In principle, the present strategy can be readily extended to other loss function, although


135

Table 2. uniscalqa.m results and comparison with PerTSAUS2.m. File numb d10 d15 d20 d24 d30 morse d40 d50 d60 d70 d80 d90 d100 d120 d140 d160 d180 d200

min (SL2 ) 0.130314 0.189780 0.231264 0.261345 0.306959 0.322892 0.230495 0.362656 0.358623 0.378091 0.381805 0.388710 0.399150 0.405947 0.416490 0.421989 0.429494 0.431397 0.438707

△(min) (SL2 ) 0.000000 0.000000 0.000000 0.000000 0.002644 0.000000 0.000000 0.000460 0.000084 0.000070 0.000097 0.000360 0.000826 -0.000961 0.002808 0.001587 0.001779 0.000594 0.001961

max (SL2 ) 0.130314 0.206319 0.277329 0.294503 0.334007 0.352526 0.230495 0.381788 0.375147 0.395527 0.400337 0.402452 0.407798 0.415990 0.422077 0.427261 0.436707 0.436438 0.442352

△(max) (SL2 ) 0.000000 0.000000 0.045934 0.033158 0.020393 0.020976 -0.003470 0.019118 0.008036 0.015420 0.015070 0.011417 0.008147 0.006449 0.005384 0.003644 0.007068 0.003823 0.004441

attr % 100 70 55 75 5 5 100 5 5 5 5 5 5 5 5 5 5 5 5

CPU time 0.03 0.04 0.07 0.14 0.21 0.45 0.66 1.16 2.53 6.01 11.46 20.48 36.75 58.76 151.72 338.37 627.28 1045.46 1594.02

CPU △(t) -5.31 -5.59 -8.43 -11.92 -14.67 -19.18 -25.21 -26.29 -32.17 -35.05 -39.19 -37.23 -29.14 -18.07 57.36 220.39 481.59 882.45 1403.34

one would have to find suitable expressions of linear complexity for calculating changes in function values. It turned out that the availability of such expressions is critical for run time efficiency, and the ones used in the present paper, although valid for all Minkowski metrics, would no longer apply without alteration to loss functions outside this class. The extension of the method to Multidimensional Scaling is currently being developed by the authors. The test results showed that the proposed method is at least as good or better than competing methods in terms of the quality of the minima found, although the differences are generally small. Contrary to the expectation in the literature, it turned out that SA might be suitable for use as a stand-alone method, because post-processing by iterative quadratic assignment methods did not improve the quality of the solutions found in the range of data sets tested here. In terms of run time efficiency, the current method was slower in small data sets compared to quadratic assignment (Hubert et al., 2002) for n ≤ 100, and compared to gradient smoothing (Pliner, 1996) for n ≤ 160, but it was considerably faster for data sets above these bounds. Compared to the SA method formulated by De Soete et al. (1988), and Brusco’s (2001) two-stage method, run times of the proposed method were superior in all datasets, and much better for growing n. The latter effect is due not only to the simplicity of the coordinate updating operations but especially to the substantial reduction of complexity in the formulas for updating the loss function values during the process. It makes the present method better scalable for very large datasets.

136


Table 3. plinorder.m results and comparison with PerTSAUS2.m. File numb d10 d15 d20 d24 d30 morse d40 d50 d60 d70 d80 d90 d100 d120 d140 d160 d180 d200

min (SL2 ) 0.130314 0.189780 0.231264 0.261345 0.304661 0.322911 0.230806 0.363062 0.359584 0.378025 0.381717 0.389113 0.398523 0.407104 0.414156 0.422011 0.430310 0.430862 0.437256

△(min) (SL2 ) 0.000000 0.000000 0.000000 0.000000 0.000346 0.000019 0.000311 0.000867 0.001044 0.000004 0.000009 0.000763 0.000199 0.000196 0.000474 0.001609 0.002595 0.000059 0.000510

max (SL2 ) 0.130314 0.243577 0.294131 0.304319 0.320496 0.346275 0.239639 0.369217 0.369849 0.387279 0.386084 0.393876 0.399033 0.411931 0.418510 0.422482 0.430685 0.433471 0.439536

△(max) (SL2 ) 0.000000 0.037258 0.062736 0.042973 0.006882 0.014725 0.005674 0.006547 0.002738 0.007172 0.000817 0.002841 -0.000618 0.002390 0.001817 -0.001135 0.001046 0.000856 0.001626

attr % 100 30 45 75 30 20 5 5 20 30 5 5 55 30 5 5 5 5 5

CPU time 0.08 0.08 0.15 0.26 0.42 0.70 1.06 1.53 2.93 5.25 8.59 14.54 18.23 28.01 47.75 104.65 138.41 195.42 301.98

CPU △(t) -5.27 -5.55 -8.35 -11.79 -14.47 -18.93 -24.81 -25.91 -31.77 -35.80 -42.05 -43.18 -47.67 -48.83 -46.61 -13.33 -7.28 32.41 111.29

Appendix A: Simulated Annealing Implementation of Defays’ Method Figure 4 shows the pseudo-codes for the calculation of the initial value of temperature while Figure 5 shows the SA algorithm for Unidimensional Scaling in the Defays reformulation, called DefaysSA.for, using the same notation as the one used in PertSAUS. I NITIAL TEMPERATURE: Initialize: M a, M amax , χ Average ← 0 cont ← 0 T est ← 1 Repeat Generate a transposition ρ of ρ dp ← (F (ρ) − F ), If dp < 0 Then Average ← (Average + dp) cont ← (cont + 1) End(If) T est ← (T est + 1) Until (T est = M amax ) or (cont = M a) Average/cont T ← ln(χ)

Figure 4. Pseudo-code for the initial temperature in DefaysSA.for.


137

Read data: Read data matrix ∆n×n Initialize: ρ, random permutation of {1, . . . , n} Calculate: F ← F (X) using (3) T ← I NITIAL TEMPERATURE Initialize parameters: LC, IC, γ, Tf , Rmax , m IterM ax ← ln(Tf /T )/ ln(γ) Iter ← 1 Repeat For 1 To LC Do Generate a transposition ρ of ρ dp ← (F (ρ) − F ), If dp > 0 or exp dp > Random(0, 1) Then T ρ←ρ F ← (F + dp) End(If) End(For) T ← (T ∗ γ) If M OD(Iter, m) = 0 Then LC ← (LC + IC) Update: ContRep ← R EPETITION COUNTER Iter ← (Iter + 1) Until (Iter = Itmax ) or (ContRep = Rmax ) Calculate: X, using (4)

Figure 5. Pseudo-code of the DefaysSA.for algorithm.

References AARTS, E. and KORST, J. (1989), Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing, Chichester [UK]; New York: Wiley ANDRIEU, C. and DOUCET, A. (1998), Simulated Annealing for Bayesian Estimation of Hidden Markov Models, Tr 317, Department of Engineering, University of Cambridge. BEARDEN, J.N. (2001), SASCAL: A Simulated Annealing Algorithm for Multidimensional Scaling, Technical report, Department of Psychology. The University of Maryland at College Park., College Park, MD 20742, USA. BRUSCO, M.J. (2001), “A Simulated Annealing Heuristic for Unidimensional and Multidimensional (City-Block) Scaling of Symmetric Proximity Matrices,” Journal of Classification, 18, 3–33. BRUSCO, M.J. (2002), “Integer Programming Methods for Seriation and Unidimensional Scaling of Proximity Matrices. A Review and Some Extensions,” Journal of Classification, 19, 45–67. BRUSCO, M.J. and STAHL, S. (2000), “Using Quadratic Assignment Methods to Generate Initial Permutations for Least-Squares Unidimensional Scaling of Symmetric Proximity Matrices,” Journal of Classification, 17, 197–223. ˇ CERNY, V. (1985), “Thermodynamical Approach to the Traveling Salesman Problem: An Efficient Simulation Algorithm,” Journal of Optimization Theory and Applications, 45, 41– 51.

138


DE SOETE, G., HUBERT, L.J., and ARABIE, P. (1988), “The Comparative Performance of Simulated Annealing on Two Problems of Combinatorial Data Analysis,” in Data Analysis and Informatics (Vol. 5), Ed., E. Diday, Amsterdam: North Holland, 489–496. DEFAYS, D. (1978), “A Short Note on a Method of Seriation,” British Journal of Mathematical and Statistical Psychology, 31, 49–53. DUFLO, M. (1996), Random Iterative Models. Applications of Mathematics. I., Volume 34, Berlin; New York: Springer-Verlag. GEMAN, S. and GEMAN, D. (1984), “Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. ` HAJEK, B. (1988), “Cooling Schedules for Optimal Annealing,” Mathematics of Operations Research, 13, 311–329. HUBERT, L.J., ARABIE, P., and MEULMAN, J. (2002), “Linear Unidimensional Scaling in the L2-norm: Basic Optimization Methods Using MATLAB,” Journal of Classification, 19, 303–328. KIRKPATRICK, S., GELATT, D., and VECCHI, M.P. (1983), “Optimization by Simulated Annealing,” Science, 220, 671–680. METROPOLIS, N.A., ROSENBLUTH, M., ROSENBLUTH, A., TELLER, A., and TELLER, E. (1953), “Equation of State Calculations by Fast Computing Machines,” Journal of Chemical Physics, 21, 1087–1092. MITRA, D., ROMEO, F., and SANGIOVANNI-VINCENTELLI, A. (1986), “Convergence and Finite-Time Behaviour of Simulated Annealing,” Advances in Applied Probability, 18, 747–771. PIZA, E., TREJOS, J., and MURILLO, A. (2002), “Combinatorial Optimization Heuristics in Partitioning with Non Euclidean Distances,” Investigación Operacional, 23 (1), 3–14. PLINER, V. (1996), “Metric Unidimensional Scaling and Global Optimization,” Journal of Classification, 13, 3–18. ROTHKOPF, E.A. (1957), “A Measure of Stimulus Similarity and Errors in Some Paired Associate Learning,” Journal of Experimental Psychology, 53, 94–101. SHEPARD, R.N., KILPATRIC, D.W., and CUNNINGHAM, J.P. (1975), “The Internal Representation of Numbers,” Cognitive Psychology, 7, 82–138. TREJOS, J., MURILLO, A., and PIZA, E. (1998), “Global Stochastic Optimization for Partitioning,” in Advances in Data Science and Classification, Eds., A. Rizzi, M. Vichi, and H.-H. Bock, Heidelberg: Springer, 185–190. VAN LAARHOVEN, P.J. and AARTS, E.H.L. (1987), Simulated Annealing: Theory and Applications, Dordrecth, The Netherlands: D. Reidel Publishing Company. WINKLER, G. (1995), Image Analysis, Random Fields and Dynamic Monte Carlo Methods, New York: Springer-Verlag.