The other uses of Multidimensional Scaling in

0 downloads 0 Views 287KB Size Report
3. Model (1) reduces to the classic regression model when the x variables are of the ..... This extension was proposed by Cuadras (1974) and a list of applications can ..... and J = (cij) is the incidence matrix (an n m matrix with entries cij = 0 i Ei =2 Cj ..... Fields for the Student and Research Worker, Springer series in Statistics.
The other uses of Multidimensional Scaling in Statistics C. M. Cuadras

J. Fortiana

Departament d'Estadstica. Universitat de Barcelona.

Abstract: Multidimensional Scaling is a useful tool to represent a nite set in an appropriate

graphical display. But MDS can do much more in statistics, classi cation and data analysis. It is shown in this contribution that MDS and related methods based on distances provide techniques and solutions to a wide eld of topics: distance based regression with mixed variables and nonlinear regression; MDS interpretation of ridge regression; MDS representation of Hoe ding's maximum correlations; representing parametric estimable functions and comparing models in MANOVA; examining principal dimensions of a tree scaled in a Euclidean con guration; MDS representation of a statistical model; distance based approach in discrimination and classi cation; MDS representation of a random variable. All these applications show that MDS opens the way to new applications of this method and helps to give us a better understanding of the structure of data.

1 Introduction The earliest books on multivariate analysis, e.g., Kendall (1957), Anderson (1958), Seal (1964), Morrison (1967), Dempster (1969) and Kshirsagar (1972), did not contain chapters or sections on multidimensional scaling (MDS). Later, due to the impact of papers on MDS by R. N. Shepard, J. C. Gower, J. B. Kruskal, J. C. Lingoes and others as well as the books (Shepard, Romney, and Nerlove 1972; Romney, Shepard, and Nerlove 1972), the interest on this method increased. Thus MDS is present in books on MA such as Bertier and Bouroche (1975), Cuadras (1981b), Greenacre (1984), Mardia et al. (1979), Seber (1984) and is the main subject of e.g., Kruskal and Wish (1978) and Davison (1983). Generally, the aplications of MDS contained in these books illustrate the representation of objects in a low{dimensional Euclidean space and the interpretation of the dimensions. But in Davison (1983) we nd and interesting and di erent application. Davison gives an MDS alternative to Thurstone's model of preferences. If s1 ; : : :; sn are n stimulus to be scaled in a preference dimension, the scale values x1; : : :; xn are obtained from the model by: pij = (xi ? xj ); where pij = Pr[si is chosen over sj ] is estimated on a sample of subjects and (  ) is the N(0; 1) cdf. Davison proposed the following distance between si and sj : ij = jp ? 0:5j; Work supported in part by CGYCIT grant PB93{0784. Authors' address: Departament d'Estadstica, Universitat de Barcelona, Diagonal 645, 08028{BARCELONA, Spain. E{mail: C. M. Cuadras, [email protected]; J. Fortiana, [email protected] Date: July 10, 1995.

1

which is a monotonically increasing function of jxi ? xj j. The rst MDS dimension provides the scale of preferences, giving a quite similar solution but obtained in a simpler way. Recently, Martn and Galindo (1994) also use MDS to build up a utility function, re ecting a scale of preferences on the set of possible consequences in a selective process under ambient risk. Throughout this paper, we work with n objects labelled as E1 ; E2; : : :; En (representing observations, individuals, populations, estimable functions, etc.) and an n  n distance matrix  = (ij ), obtained in an appropriate way, e.g., by taking measurements of p observable variables on the experimental units. X will indicate the n  m matrix of Euclidean coordinates obtained via metric or non metric MDS. Hence, the Euclidean distance between points whose coordinates are rows of X reproduces (exactly or approximately) the matrix . If X is a principal coordinate solution, then B = T    T0 , where B is the inner product matrix related to  and  = diag(1 ; : : :; m ) contains the positive eigenvalues of B in decreasing order, and X = T  1=2 .

2 The distance{based regression model 2.1 Metric MDS and regression Regression analysis is possibly the most widely used, but also abused, statistical technique. A clear example of the latter is the linear regression of a continuous response variable y on a set w of p mixed (continuous, binary, nominal, ordinal) explanatory variables: even if theoretical diculties are ignored, grossly divergent results can be obtained as a result of di erent ways of coding the categorical predictors. Arbitrariness can also cause trouble is found in the following regression framework: predictors are continuous, a linear model has a poor adjustment, and context knowledge of the data (e.g. a physical model) does not suggest a clear candidate f to build a nonlinear model y = f(w; ) + e: The distance{based (DB) model for prediction is an MDS alternative in regression proposed and studied by Cuadras (1989) and Cuadras and Arenas (1990). Given n experimental units and a distance matrix , computed on the basis of the (mixed) explanatory variables, the matrix of principal coordinates X is obtained and y is regressed on X, i.e., the DB model is y = 0 1 + X(k)  (k) + e(k)

(1)

where y is the n  1 vector of observations on y, 0 and (k) are parameters, etc., and X(k) is a matrix containing a suitable set of k columns extracted from X. The main features of model (1), for X(k) = X are given below. Similar formulas hold when X(k) is a subset of columns of X. 1. The coecient of determination is given by: R2 = y0  X  ?1  X  y =

X

(yi ? y)2

2. The prediction of y for a new observation En+1 is given by: ybn+1 = y + 12 (b ? d)0  B?  y; 2

(2) (3)

where d = (12 ; : : :; n2 )0 is the vector of squared distances from En+1 to each of E1 ; E2; : : :; En, b = (b11; : : :; bnn)0 contains the diagonal elements of B and B? is the Moore{Penrose g{inverse of B. 3. Model (1) reduces to the classic regression model when the x variables are of the continuous type and the Euclidean distance is used. The same is true when the explanatory variables are categorical, each state is coded as a binary variable and the distance based on the matching coecient is chosen. 4. DB is quite general, as we can use any suitable distance between observations. The real advantage appears with mixed variables. In this case, the distance function associated to Gower's similarity coecient (Gower 1971a) is a good choice, and has the additional bene t of being able to handle missing values. 5. The DB model with the distance ij2 = jwi1 ? wj 1j +    + jwip ? wjpj; can replace a nonlinear regression model yi = f(wi1 ; : : :; wip ; ) + ei ; i = 1; : : :; n: For p = 1, the DB model is equivalent to an ordinary regression on Chebyshev polynomials of the rst kind. The advantage of this model is manifested when p > 1 and f is unknown (Cuadras and Fortiana 1993). 6. If ij (k) is the Euclidean distance between Ei and Ej using the selected coordinates X(k) and 1 ; : : :; k are the related eigenvalues, then we have the important equality of metric MDS: n X

i;j =1

ij2 (k) = 2 n

k X l=1

l :

(4)

The squared distance between and Ei and Ej , considering the vector yb of tted values acording to (1), is (ybi ? ybj )2. Then it is satis ed that k n X X (ybi ? ybj )2 = 2 n bl2 l ; (5) i; j =1

l=1

where bl are the OLS estimates of the regression parameters. This formula can be understood as the DB version of (4).

Example 1 To illustrate the method, as well as its comparison with a nonlinear regression model, we consider a data set taken from (Draper and Smith 1981, p. 284). The model is

y = ( 0 + 1 z + 2 z 2 ) + ( 0 + 1 z + 2 z 2 ) w1 + ( 0 + 1 z + 2 z 2 ) w2 + ; where z = log(w3 + 1). Table 1 shows the tted values unsing nonlinear regression (f known) and DB regression (f unknown).

3

Table 1: Comparison of predictions obtained with a nonlinear model and the DB model. w1 47.1 72.9 47.1 60.0 60.0 72.9 47.1 72.9 60.0 72.9 47.1 60.0 60.0 60.0 39.0 60.0 60.0 60.0 81.0 60.0

w2 33.9 33.9 8.1 21.0 21.0 8.1 8.1 33.9 21.0 8.1 33.9 21.0 21.0 21.0 21.0 0.0 21.0 42.0 21.0 21.0

w3 y by (Nonlin.) by (DB, k = 3) by (DB, k = 10) 7.5 11.97 11.8972 12.614 12.032 750.0 8.63 8.6716 8.7416 8.6155 750.0 8.80 8.8416 8.7416 8.8706 75.0 10.73 10.6758 10.655 10.678 75.0 10.69 10.6758 10.655 10.678 7.5 13.12 13.0472 12.614 12.922 7.5 12.58 12.6422 12.614 12.604 7.5 12.24 12.3022 12.614 12.349 75.0 10.64 10.6758 10.655 10.678 750.0 9.09 9.0716 8.7416 9.1881 750.0 8.46 8.4416 8.7416 8.2980 75.0 10.65 10.6758 10.655 10.678 3000.0 7.60 7.5757 7.6038 7.5884 3.0 13.06 13.0815 12.445 13.060 75.0 10.51 10.5608 10.733 10.510 75.0 11.22 11.1658 10.733 11.220 75.0 10.67 10.6758 10.655 10.678 75.0 10.24 10.1858 10.733 10.240 75.0 10.74 10.7908 10.733 10.740 75.0 10.69 10.6758 10.655 10.678 Residual S.S. 0.03989 1.9451 0.10209

2.2 Non metric MDS and ridge regression As formulated in the previous section, the DB model requires that  veri es the Euclidean property, i.e, that the associated inner product matrix B is positive semi{de nite. However, a non{Euclidean  can occur in practice, a notorious example being that of processing a data set with missing values using Gower's similarity coecient. To deal with this problem, Cuadras et al. (1994) propose the following transformation (2) (6) (2) a =  ? 2 a (J ? I); where (2) = (ij2 ), J is the matrix of ones, I is the identity matrix and a is a suitable real constant. Transformation (6) is well{known in the context of MDS (Lingoes 1971; Mardia 1978). Each eigenvalue i of B changes to i + a, the eigenvector being the same. Hence, if a  jm j, where m is the smallest (possibly negative) eigenvalue of B, we obtain a positive semi{de nite matrix B(a) = B + a H, where H is the centring matrix. The prediction (3) of y on En+1 can now be expressed as (7) ybn+1 (a) = y + 12 (b ? d)0  T  ( + a I)?1  T0  y where B = T    T0 . This prediction depends on a and has the following advantages: Firstly, we avoid a non{Euclidean matrix. Secondly, if B has very small eigenvalues, (3) can yield erroneous predictions since it depends on the inverse of the eigenvalues. This problem is easily overcomed if we add a to each eigenvalue. Finally, we can recognize in (7) a procedure similar to the ridge regression approach, a useful technique to improve the accuracy of the estimation of the parameters (Hoerl and Kennard 1970a, 1970b). Hence, Lingoes' transformation (6) yielding (7) is related to the ridge regression (Euclidean distance, classic regression) and, generally, can be interpreted under this perspective when applied to DB with any other distance. 4

It is necessary to remark that (6) is not recomended to obtain a Euclidean distance matrix, when the aim is to represent objects along axes, as there are better non{metric MDS techniques. The purpose of (6){(7) is not to represent objects but to predict a response variable.

3 MDS of a maximum correlation matrix Let X and Y be second order random variables with cumulative distribution functions F and G respectively. Hoe ding (1940) showed that if the joint cumulative distribution function of (X; Y ) is H + (x; y) = minfF(x); G(y)g; called the upper Frechet bound of F and G, then the following functional relation holds (a.s.): F(X) = G(Y ); (8) + and the correlation coecient (X; Y ) attains the maximum value,  (F; G), of the set of correlation coecients of bivariate distributions having F and G as marginals. This maximum correlation can perform as a measure of agreement between distribution functions, up to an ane transformation in the (X; Y ) space, since it equals 1 i F  = G , where F  and G are the distribution functions of the variables standardized to mean 0 and variance 1. The maximum correlation between standardized variables can be computed by means of + (F; G) =

Z1 0

F ?1(p) G?1(p) dp;

(9)

as easily follows from (8). With the exception of some special cases (Moran 1967; Fujita 1979), a closed formula for + is quite dicult to derive. Some calculations, via simulation, were given by Shih and Huang (1992). For a lower approximation to + , when F and G are absolutely continuous, see (Cuadras 1992a). In general, a numerical evaluation of the improper integral (9) will be needed. A set fF1; : : :; Fng of distribution functions has an associated n  n matrix R+ of maximum correlations. The positive semide niteness property ensures that the result of performing a metric scaling on R+ is a Euclidean representation of the given set. As an illustration, Table 2 contains the matrix R+ for the distributions U = Uniform, E = Exponential, N = Normal, G = Gamma (with shape parameter p = 8), L = a log{normal distribution (the log of a standard normal distribution), as well as the empirical distribution I of the variable \sepal length" of Iris setosa in Fisher's Iris data. The values of omitted parameters are irrelevant to the results. Suppose now that we wish to ascertain the distribution of I and we consider U; E; N; G; L as the candidates (Cuadras and Fortiana 1994). Performing a metric scaling on R+ , we obtain Figure 1, revealing that possibly the sample I comes from a Normal distribution. Of course, we reach the same conclusion noting that + (N; I) is the largest correlation, but this graph is a helpful tool to understand the proximity relations between I and the other distributions.

4 MDS in MANOVA Let

Y = X?+E

5

(10)

! U E N G L I

Table 2: Matrix R+ of maximum correlations U E N G L I 1 0:8660 1 0:9772 0:9032 1 0:9472 0:9772 0:9730 1 0:6877 0:8928 0:7628 0:8716 1 0:9738 0:8985 0:9871 0:9660 0:7452 1

Figure 1: Principal Coordinate representation of the distributions uniform (U), Exponential (E), Normal (N), Gamma (G), Log{Normal (L) and the empirical distribution (I). E

0:20 0:15 0:10 0:05

G

?0:5

L

?0:4

?0:3

?0:2

?0:1

?0:05

?0:10

?0:15

6

0:1

0:2

N

0:3 U I

be the multivariate linear model (see e.g. Mardia et al. 1979, p. 157), where Y (n  p) and X (n  p) are known matrices, ? (p  q) and E (n  p) are unknown matrices containing regression parameters and random errors, respectively. We suppose that the rows of E are iid as N(0; ). MDS or Principal Coordinate Analysis can be performed on linear model (10) in two ways:

4.1 MDS on parametric estimable functions Suppose that (10) is the model for a multivariate one{way classi cation. Then the rows of Y can be regarded as random samples from the populations j = Np (j ; ), j = 1; : : :; g. That is, if n = n1 +    + ng , we have g samples, each consisting of nj observations from j . A technique of great interest is Canonical Variate Analysis, which represents the estimated means b1; : : :; bg by assigning canonical coordinates. The populations, described by the means, are projected on the (orthogonal) canonical variates, giving a useful representation in a low{dimensional space (Rao 1952; Seal 1964; Mardia et al. 1979; Seber 1984). Although the formulation of Canonical Variate Analysis is made in terms of the eigenvectors and eigenvalues of B (between samples SSP matrix) with respect to W (whithin samples SSP matrix), similar results are reached if a MDS is performed on the Mahalanobis distance matrix between the means (Gower 1966). A generalization is as follows. Let (11) i = p0i  ?; i = 1; : : :; k; be a set of parametric estimable functions. Therefore, each row p0i is a linear combination of the rows of X. We de ne the square distance between i and j as ij2 = ( i ? j )0  ?1  ( i ? j ):

(12)

If ?b is the OLS estimation of ?, this distance can be estimated replacing i (and similarly j ) by the Gauss{Markov estimator bi = p0i  ?b and  by b = Y0  (I ? X  (X0  X)?  X0)  Y=(n ? r), where r = rank(X) (see Rao 1973). Performing a MDS on the distance matrix , we have a geometric representation of (11). This extension was proposed by Cuadras (1974) and a list of applications can be found in Cuadras (1981b). See also Coll et al. (1980) and Ahrens and Lantner (1974).

Example 2 Figure 2 is the canonical representation of 4 treatments (A = control, B = Urea,

C = Calcium and Potasium nitrates, D = Ammonium and Sulphates) given to 48 apple trees divided into four blocks. The 6 observable variables were contents of several minerals. The data was taken from Ratkowski and Martin (1974). MANOVA revealed a signi cative di erence between treatments and Figure 2 is a graphic way of showing this di erence.

Example 3 Figures 3 and 4 show classic canonical representations of 8 populations of Coleoptera

of the genus Timarcha , males and females, respectively, obtained by taking measurements on ve biometrical variables (Adapted from Petitpierre and Cuadras 1977; Cuadras 1981a). Population 6 is closer to 4 and 5 for males but closer to 3 for females. Both relations seem contradictory from a Taxonomic point of view. Using a crossed classi cation sex  population (with interaction) and performing a representation of the main e ects related to the population, we obtain Figure 5, in which population 6 appears as separated from 3, 4 and 5, independently of sex.

7

"#

Figure 2: Canonical representation of four treatments in an experimental design (treatments and blocks) with 6 observable variables. Circles represent 95% con dence regions for the parametric functions i =  + i 3

A

2 1

?5

C

?4

?3

?2

?1

?1

B

?2

1

2

3

4

D

?3 ?4

Figure 3: Canonical variate analysis of 8 populations of males of the genus Timarcha (Coleoptera) from Petitpierre and Cuadras (1977) Circles represent 95% con dence regions for the means of the populations. 45

?6

?2 4

1

2

6

?2

3

8

1

?1

2

4

6

8

10

?2 ?3

7

?4

8

12

$ %

Figure 4: Canonical variate analysis of 8 populations of females of the genus Timarcha (Coleoptera) from Petitpierre and Cuadras (1977) Circles represent 95% con dence regions for the means of the populations

?6

?4 2 1

?2

8

2

45

1

6 3

?1

2

4

6

8

10

12

?2 ?3

7

?4

Figure 5: Canonical variate analysis of 8 populations of of the genus Timarcha (Coleoptera) from Petitpierre and Cuadras (1977) (Males + Females), removing the in uence of the sex e ect. Circles represent 95% con dence regions for the parametric functions i =  + i, i = 1; : : :; 8. 45

?6

1

?42

6 ?2 3

8

2 1

?1

2

4

6

8

?2 ?3

7

?4

9

10

12

4.2 Comparing models Suppose now that we have several multivariate linear models Yi = Xi  ?i + Ei ; i = 1; : : :; k;

(13)

Assuming Xi = X, i = 1; : : :; k, and a common , a square distance between two models is given by ij2 = tr[(?i ? ?j )0  X0  X  (?i ? ?j )  ?1 ];

(14)

where, in the applications, the parameters must be estimated from the data. Note that vec(Y) is N(vec(Xi  ?i ); In ), hence (14) is essentially a Mahalanobis distance. Aspects on this distance (inference, case of di erent X's, heteroscedasticity, etc.) were studied by Ros and Cuadras (1986). Ros et al. (1992) presented further aspects, emphasizing connections with Rao distance. The Euclidean distance matrix  allows us to carry out both a classi cation and an ordination of models (13). Ros et al. (1992) showed a medical application consisting of longitudinal observations of glucose/insuline from 171 children, discussing and justifying the necessity of a linear model for each observation, as it has a parsimonious e ect on the data, providing some advantages with respect to a more classic approach. The authors found 7 groups, which can be identifyied according to standard diabetes terminology.

5 MDS on a tree Let us start this section remembering that if  = (ij ) is an ultrametric distance matrix de ned on a nite set E = fE1; : : :; Eng, with n  3, i.e., ij  maxfik ; jkg; 1  i; j; k  n; we can construct a Hierarchical Clustering Scheme (C ; ), where C is a class of nested clusters and is a level function directly related to . Then E can be represented by an ultrametric tree. Hierarchical clustering and MDS have been widely used in combination for studying proximity data. A recent example, a comparative study of 13 European languages (Oliva et al. 1993) is depicted in Figures 6 and 7. Several authors have studied the inter{relations between both models (Kruskal 1977; Ohsumi and Nakamura 1981; Pruzansky, Tversky, and Carroll 1982; Bock 1987; Critchley 1988). Gower (1971b) conjectured that  is a Euclidean distance matrix. This conjecture was proved by Holman (1972), showing that if  is de nite, then E can be embedded in Rn?1. Consequently, low{dimensional representation of a tree seems to be in con ict with the full dimension n ? 1 needed according to Holman. In contrast, the experience with real data shows that such con ict does not exist. Two{dimensional Euclidean representations are often good enough to display E , even when n is large. Under this motivation two explanations were given. The rst one (Critchley and Heiser 1988) showed that a tree can be scaled, in an special way, using only one dimension. The second one (Cuadras and Oller 1987), explored the Principal Coordinates solution X obtained from U and showed that the rst coordinates essentially represent maximal clusters E 1 ; : : :; Em , while each E i is represented by a system of independent coordinates. 10

&' Figure 6: Principal Coordinate representation of 13 European languages.

Ba Ct Da Du En Fr Ga Ge Hu It No Pl Sp

Pl

Basque Catalan Danish Dutch English French Galician German Hungarian Italian Norwegian Polish Spanish

3 2 1

?2

Du

Ct

Ba 2

Ge 1

Sp ?1 It Ga

En DaNo

Fr

Hu

?1 ?2

Figure 7: Ultrametric tree constructed from data of 13 European languages (UPGMA method) 5:5 5:0 4:5 4:0 3:5 3:0 2:5 2:0 1:5 1:0 0:5

Da

No

En

Du

Ge

Fr

11

Ct

Sp

Ga

It

Hu

Pl

Ba

5.1 One{dimensional representation Critchley and Heiser (1988) introduce an order in E such that  satis es the Robinson property i  j  k ) maxfij ; jk g  ik ; Then the elements of  increase when moving away from the diagonal. This property (initially used in seriation) essentially means that the objects can be ordered \chronologically". The one{ dimensional scale 0 = x1 < x2 <    < x n (15) is obtained from a (n ? 1){vector t with entries tj = xj +1 ? xj . The matrix of one{dimensional distances D = (dij ) is given by dij = dji =

j ?1 X k=i

tk ;

for i < j;

and there is a monotonic relation between D and . Hence E can be represented in dimension one. The solution, however, is not unique and the authors give suggestions to obtain a good approximation D for .

5.2 Full{dimensional representation Cuadras and Oller (1987) also reorder E in such a way that E = E1 +    + E r + Er +1 +    + Em ; (16) where r veri es that each E i , 1  i  r, is a maximal cluster with ni > 1 equidistant objects and, if r < m, each Ej for j > r contains an isolated object. Let hi be the common distance between any pair of objects in E i , i = 1; : : :; r. A metric MDS on  shows that i = 21 h2i  ; i = 1; : : :; r; where  is the largest eigenvalue of B and the inequality is strict in nondegenerate con gurations. Additionally,  = minf1 ; : : :; r g = minf 21 ij2 ; i 6= j g > 0 is the minimum eigenvalue of B. The metric MDS solution X can be partitioned as X = (X0 j X1 j : : : j Xr )

where: a) Each Xi , 1  i  r, describes E i and spans an (ni ? 1){diemnsional eigenspace of B with eigenvalue i . b) X0 contains the remaining columns of X and describes the clusters E 1 ; : : :; Em . The Euclidean distance d0 obtained by using the coordinates X0 does not separate objects inside Ei if i  r, but the isolated objects, if any, are well{represented. 12

c) The squared distance d20 satis es the additive inequality (the four{points condition). Thus E1 ; : : :; Em can be represented as an additive tree. Bock (1987) relates a distance matrix  with a given classi cation, clustering or partition of E . Given a classi cation C = (E 1 ; : : :; Em ), Bock nds a simultaneous Euclidean representation of the n objects of E and the m clusters of C , described by the matrices of coordinates X? and Y? , respectively, such that the related scalar products t the best possible to B. The solution is X? = Q?1=2  X and Y? = M0  X? , where Q and M are de ned below.

Example 4 Critchley and Heiser (1988) illustrated their approach with an ultrametric matrix  of order n = 10, related to the ultrametric tree represented in Figure 8. Here m = 6, r = 4 and E has two isolated objects. The partition (16) of E is: E

= f1; 2g + f3; 4g + f6; 7g + f9; 10g + f5g + f8g

(17)

and it is found that 1 = 21 < 2 = 2 < 3 = 92 < 4 = 18 <  = 134: The link between the Critchley{Heiser and Cuadras{Oller results may be established noting that X0 already separates the clusters in (17), being enough the rst principal dimension related to  to separate these clusters. Figure 9 (a) gives the scale (15) for E and 9 (b) is the Euclidean representation using MDS. Further, applying Bock's results to the partition (17) and using standard weights, we have Q = (I + M  M0 ), where M = J  N?1, N = diag(2; 2; 1; 2; 1;2) contains the sizes of the clusters, and J = (cij ) is the incidence matrix (an n  m matrix with entries cij = 0 i Ei 2= Cj and cij = 1 i Ei 2 Cj ). Figure 9 (c) gives the representation using the rst column of Y? .

6 MDS on a statistical model Let fp (x)g be a class of probability density functions (pdfs), possibly described as a parametric statistical model S = fp(x; )g2 . In an early paper, Rao (1945) understood S as having a Riemannian manifold structure, the metric tensor being given in suitable coordinates by the Fisher information matrix, allowing the de nition of a geodesic distance between pdfs, called Rao distance. This distance was computed for several parametric families by Atkinson and Mitchell (1981), Mitchell and Krzanowski (1985), Burbea (1986), Oller and Cuadras (1985) and others. The geodesic distance between two pdfs p(x; 1 ) and p(x; 2 ) of the model provides a way of representing a nite set of pdfs via MDS. The simplest multivariate example arises when we have normal populations with common covariance matrices. The distance between means reduces to the Mahalanobis distance and then we have the Canonical Variate Analysis (see section 4.1). But this approach is so rich that it provides also an appropriate framework for a study of other statistical aspects of the model (large sample properties of estimators, curvature, ancillarity, intrinsic biass, su ciency). See Amari (1987) and the recent papers by Barndor -Nielsen and Jupp (1993), and Oller (1993). Note that a distance is constructed from an statistical model, i.e., the pdfs are known up to a parameter. Is it possible to construct a pdf from a distance? To a certain extent, the answer is armative, as we show in the next section. 13

()

Figure 8: The ultrametric tree example used by Critchley and Heiser 9 8 7 6 5 4 3 2 1 0

1

2

3

4

5

6

7

8

9

10

Figure 9: One{dimensional representation of an hierachical tree. (a) Following Critchley and Heiser, (b) Using MDS, (c) following Bock. (a)

(b)

(c)

12 3 4 5

6

7

8

9 10

13 5

68

9

24

7

10

86

9

7

10

13

5

24

14

7 The distance{based approach in discrimination 7.1 A proximity function Let x be a random variable de ned on a probability space ( ; A; P), with values on some E  Rp and let f be the probability density function of x with respect to a given measure . Assume that (; ) is a square integrable symmetric function on E (usually it will be a dissimilarity function), verifying that (x; y) = (y; x)  0, 8x; y 2 E and that

Z 1 V (x) = 2  2 (x; y) f(x) f(y) d(x) d(y); (18) E E is nite. This quantity is a measure of dispersion of x with respect to (; ) and was called the geometric variability of x in (Cuadras and Fortiana 1995). When (; ) is the Euclidean distance, V (x) = var(x). The following function of !0 2 , ;x(!0 ) =

Z

E

 2(x0 ; x) f(x) d(x) ? V (x);

(19)

where x0 = X(!0 ), can be understood as a proximity function of x0 to E(x). This interpretation can be justi ed on the rst of the properties stated below (Cuadras 1989; Cuadras 1991; Cuadras 1992b) 1) If there is representation of , i.e. a function Rp ! L, where L is a Euclidean (or Hilbert) space, such that  2(x; y) = k (x) ? (y)k2 , then ;x (!0) = k (x0 ) ? E( (x))k2 ; (20) assuming the existence of the moments E( (x)) and E(k (x)k2 ). 2) If z = (x; y), where the components x and y are independent, and we take the square distance z2 = x2 + y2 , then z ;z (!0) = x;x (!0 ) + y;y (!0): 3) If x1 ; : : :; xn are iid observations of x, an estimation of (19) is given by n n X X  2 (xi ; xj ): b(!0 ) = n1  2(x0 ; xi ) ? 2 1n2 i=1 i; j =1

(21)

In this case, the representation space L is either Rs or Rs  iRt, the map is obtained P by ordinary Metric Scaling, and the expected value in (20) reduces to the centroid (x) = n1 ni=1 (xi ) 4) When the pdf of x belongs to a regular statistical model S = fp(x; )g2 , a distance between two observations x and y may be de ned as  2 (x; y) = (Dx ? Dy )0  F?1  (Dx ? Dy ); (22) where Dx = @@ log p(x; ) and F is the Fisher information matrix (Cuadras 1989; Oller 1989; Mi~narro and Oller 1992). 15

Substituting (22) in (20) gives

(!0) = Dx0 0  F ?1  Dx0 ;

(23)

a Euclidean distance in Rm (m = number of parameters in the model = dimension of ). (23) reduces to the Mahalanobis distance between x0 and  when x  N(; ). 5) It is possible to extend (22) to the case of a compound statistical model for z = (x; y), when x and y not independent but only the marginals are known (Cuadras 1992a). (!0) can still be obtained. However, the main interest of (21) is that it can be constructed when the pdf of x in unknown or dicult to specify. For example, when one or more of the following conditions is found

 x is a mixture of continuous and discrete variables,  Incomplete data,  Observations are encoded as character strings and/or the number of variables may change from one observation to another,  The number of variables is greater than the number of observations.

In general, (21) can be implemented for ill{conditioned statistical problems, where building a probabilistic model would rely upon arbitrary or unveri ed hypotheses, while often a distance function (; ) between individuals appears naturally.

7.2 The distance{based discriminant rule Suppose that we have two populations 1 , 2 (a generalization to g  2 populations is easy). On the basis of a random vector x and a distance function (; ), we use (19) to construct 1 , 2 , related to 1, 2, respectively. The distance{based rule of discrimination is (Cuadras 1989; Cuadras, Fortiana, and Oliva 1994): Allocate !0 to  if  (!0 ) = minf1(!0); 2(!0 )g:

(24)

Some of its properties/advantages are:

 When i = N(i ; ), i = 1; 2, and  is the Mahalanobis distance, (24) coincides with the classic linear discriminant. The classic quadratic discriminant for 1 = 6 2 is also obtained

with a small modi cation of .  From (20), it turns out that (24) can be understood as a Matusita rule for discrimination, i.e. !0 is allocated to the nearest population (Note, however, that this fact follows from the underlying theory: only distances between observations are explicitely needed for computations).  If observable variables are of mixed type, possibly with missing values, we can use the distance based on Gower's similarity coecient (Gower 1971a) to construct the DB discriminant functions (Krzanowski 1993; Ciampi, Hendricks, and Lou 1993).  DB discriminant is specially useful when information on individuals is constituted by (variable length) strings, e.g. words, DNA sequences. 16

Table 3: Classi cation with ve categorical variables. Comparison between the distance{based (DB) approach, the linear discriminant (LDF), the quadratic discriminant (QDF) and the log{linear model (LL). The numbers of misclassi cations were obtained by the leaving{one{out procedure. 1 Sample size

136

2 Total 43

279

Number of misclassi cations LDF 39 15 QDF 36 2 DB 7 1 LL fails fails

54 38 8 -

Example 5 Table 3, from (Cuadras 1992a), reports an application to a problem in Linguistics

(Roses 1990): to decide whether a diphthong whose rst vowel is an atonic i , appearing immediately after a consonant in a Catalan word, should be pronounced as monosyllabic (1 ), or bisyllabic (2). A random sample of 136 and 43 words, respectively, whose pronunciation is known, were selected from a dictionary, and each word was encoded according to ve categorical variables. The linear (LDF) and quadratic (QDF) discriminant functions were used for comparison purposes. The log{ linear model could seem to be quite appropriate (Krzanowski 1988, p. 352), but one of the variables (the consonant preceding the i ) has many states and LL gives null likelihood in both groups for many words. In contrast, DB presents no problems and classi es words quite correctly.

8 MDS solution for a random variable Any nite set U , together with a Euclidean distance matrix  = (ij )i;j 2U , can be represented as a set of points in some Rm . This property is usually applied as a method of dimensional reduction of the data. The (approximate) low{dimensionality of the representation space is not, however, an intrinsic requirement. Other interesting applications arise if emphasis is given instead to its structure as a Euclidean, or more generally, a Hilbert space. This section is devoted to one such application. Let us consider the set

U = f0; 1; 2; : ::; ng; consisting of n + 1 equidistant points on the real line R, with the distance 17

(25)

p

ij = ji ? j j: The rows of the (n + 1)  n matrix

00 BB 1 B1 U=B BB .. B@ . 1

0 0 1 .. . 1 1 1

0 ::: 0 ::: 0 ::: ...

0 0 0 .. . 0 1 ::: 1

(26)

1 CC CC CC CA

(27)

provide a con guration of n+1 points in Rn representing U . Another con guration can be obtained via MDS. An explicit MDS solution was found by Cuadras and Fortiana (1993), with applications in non{linear regression. Our aim is now to represent a continuous real random variable X following a similar approach. Let us start with a uniform (0; 1) random variable U. In order to represent U with respect to the distance

p

(x; y) = jx ? yj

x; y 2 (0; 1);

(28)

we observe that the continuous version of (27) is the stochastic process

U = fUt 0  t  1g; where Ut is the indicator of the interval (t; 1)  (0; 1), following a Bernoulli distribution B(1; t). (U; ) is well described by U since U=

Z1 0

Ut dt;

and the squared Euclidean distance between two trajectories xt and yt of U is given by

Z1 0

(xt ? yt )2 dt = jx ? yj:

Hence U can be interpreted as a continuous Euclidean con guration representing (U; ). As in the nite case, we are interested in nding an MDS solution, i.e., as the distance is Euclidean, a Principal Coordinate solution. The duality between Principal Coordinates and Principal Components suggests us to nd the principal components of the stochastic process U. The covariance function of U is K(s; t) = Cov(Us ; Ut) = min(s; t) ? s t K is a well{known kernel and the following decomposition holds: 18

0  s; t  1:

(29)

K(s; t) =

p

1 X j =1

j fj (s) fj (t);

where j = 1=(j )2 and fj = 2 sin(j  t). The principal components of U are the random variables

p

Z1

Ut fj (t) dt = j 2 [1 ? cos(j  U)]: 0 Standardizing Zj to mean 0 we have Zj =

p

Cj = j 2 cos(j  U): The Continuous Metric Scaling solution for (U; ) can be expressed by noting that each value x of U is represented by the (countable) sequence (C1(x); C2(x); : : :; Cj (x); : : :);

(30)

which satis es the equality 1 X j =1

(Cj (x) ? Cj (y))2 = jx ? yj:

The analogy with the nite MDS case is also supported by the fact that C1 ; C2; : : : are centered uncorrelated random variables, whose sum of variances equals 1 X j =1

Var(Cj ) = tr(K) = 61 :

Note that the Principal Coordinates sequence (30) is associated to both U and . For instance, if the ordinary Euclidean distance (x; y) = jx ? yj had been chosen, instead of its square root (28), then the result is a degenerate sequence with only one term, which equals U. This MDS solution can be used in prediction and goodness{of{ t assessment (Cuadras and Fortiana 1993), following an approach analogous to the distance{based regression (section 2), Example 6 We take the n = 50 sample of the variable \Sepal length" of Iris setosa from the famous Fisher data set, as shown in the rst column of Table 1.2.2 of Mardia et al. (1979). Suppose we want to ensure that the underlying distribution of CD is Normal. If it is, the probability integral transformation y = (x) on the standardized sample, where  is the cumulative distribution function N(0; 1), provides a sample from a (0; 1) uniform distribution U. If Fn is the empirical distribution of the transformed sample and FU is the cumulative distribution function of U, then + (Fn ; FU ) is a general measure of agreement between the sample and the normality hypothesis. Additional indications are obtained from the correlation coecients between Fn and C1 , C2; : : :, with respect to H + (Fn ; FU ). Results are shown in the second column of Table 4. For comparison, the theoretical values of + (Cj (U); FU ), j = 1; : : :; 4, and the estimated values, assuming a uniform underlying distribution, are given in rst and third column, respectively. 19

Table 4: Correlations between Iris data and two possible underlying distributions . Estimated under distribution: Theoretical Normal Uniform + 1 0.9915 0.9738 1 0.9927 0.9792 0.9435 2 0 0.0019 -0.0102 3 0.1103 0.1713 0.2847 4 0 -0.0277 -0.0324 Finally, this representation can be generalized to any continuous second{order random variable x (Cuadras and Fortiana 1995). This extension opens the way to obtaining new insights and applications of the classical MDS methodology.

References Ahrens, H. and J. Lantner (1974). Mehrdimensionale Varianzanalyse. Akademic{Verlag, Berlin. Amari, S.-I. (1987). Di erential Geometrical theory in Statistics. In Di erential Geometry in Statistical Inference, Number 10 in IMS Monograph Series, pp. 19{94. IMS, Heyward, CA. Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. John Wiley & Sons, New York. Andrews, D. F. and A. M. Herzberg (Eds.) (1985). Data, A Collection of Problems from Many Fields for the Student and Research Worker, Springer series in Statistics. Springer Verlag, Berlin. Atkinson, C. and A. F. S. Mitchell (1981). Rao's distance measure. Sankhya. The Indian Journal of Statistics, Series A 43, 345{365. Barndor -Nielsen, O. E. and P. E. Jupp (1993). Statistical inference and di erential geometry: some recent developments. See Cuadras and Rao (1993), pp. 385{396. Bertier, P. and J. M. Bouroche (1975). Analyse des Donnees Multidimensionnelles. Presses Univ. de France, Paris. Bock, H. H. (1987). On the interface between Cluster Analysis, Principal Component Analysis, and Multidimensional Scaling. In H. Bozgodan and A. K. Gupta (Eds.), Multivariate Statistical Modeling and Data Analysis, pp. 17{33. D. Reidel Pub. Co. Burbea, J. (1986). Informative geometry of Probability spaces. Expositiones Mathematicae 4, 347{ 378. Ciampi, A., L. Hendricks, and Z. Lou (1993). Discriminant analysis for mixed variables: Integrating trees and regression models. See Cuadras and Rao (1993), pp. 3{22. Coll, M. D., C. M. Cuadras, and J. Egozcue (1980). Distribution of human chromosomes on the methaphase plate. Symmetrical arrangement in human male cells. Genetical Research 36, 219{234. 20

Critchley, F. (1988). The Euclidean structure of a dendrogram, the variance of a node and the question: \how many clusters really are there?". In H. H. Bock (Ed.), Classi cation and Related Methods of Data Analysis, pp. 75{84. Elsevier Science Publishers B. V. (North{Holland). Critchley, F. and W. Heiser (1988). Hierarchical trees can be perfectly scaled in one dimension. Journal of Classi cation 5, 5{20. Cuadras, C. M. (1974). Analisis discriminante de funciones parametricas estimables. Trabajos de Estadstica e Investigacion Operativa 25 (3), 3{31. Cuadras, C. M. (1981a). Analisis y representacion multidimensional de la variabilidad. In Inter. Symp. Concpt. Meth. Paleo. Barcelona, pp. 287{297. Cuadras, C. M. (1981b). Metodos de Analisis Multivariante. EUNIBAR, Barcelona. Cuadras, C. M. (1989). Distance analysis in discrimination and classi cation using both continuous and categorical variables. See Dodge (1989), pp. 459{473. Cuadras, C. M. (1991). A distance based approach to discriminant analysis and its properties. Math. Preprint Series 90, Univ. de Barcelona. Cuadras, C. M. (1992a). Probability distributions with given multivariate marginals and given dependence structure. Journal of Multivariate Analysis 42, 51{66. Cuadras, C. M. (1992b). Some examples of distance based discrimination. Listy Biometryczne { Biometrical Letters 29, 3{20. Cuadras, C. M. and C. Arenas (1990). A distance based regression model for prediction with mixed data. Communications in Statistics A. Theory and Methods 19, 2261{2279. Cuadras, C. M., C. Arenas, and J. Fortiana (1994). Some computational aspects of a distance{ based model for prediction. (submitted). Cuadras, C. M. and J. Fortiana (1993). Continuous metric scaling and prediction. See Cuadras and Rao (1993), pp. 47{66. Cuadras, C. M. and J. Fortiana (1994). Ascertaining the underlying distribution of a data set. In R. Gutierrez and M. J. Valderrama (Eds.), Selected topics on Stochastic Modelling, pp. 223{230. World Scienti c, Singapore. Cuadras, C. M. and J. Fortiana (1995). A continuous metric scaling solution for a random variable. Journal of Multivariate Analysis 52, 1{14. Cuadras, C. M., J. Fortiana, and F. Oliva (1994, December). The proximity of an individual to a population with applications in discriminant analysis. Math. Preprint Series 162, Universitat de Barcelona. Cuadras, C. M. and J. M. Oller (1987). Eigenalysis and metric multidimensional scaling on hierarchical structures. Questiio 11, 37{57. Cuadras, C. M. and C. R. Rao (Eds.) (1993). Multivariate Analysis, Future Directions 2. Elsevier, Amsterdam. Davison, M. L. (1983). Multidimensional Scaling. John Wiley & Sons, New York. Dempster, A. P. (1969). Elements of continuous multivariate analysis. Addison{Wesley, Reading, Mass. Dodge, Y. (Ed.) (1989). Statistical Data Analysis and Inference. North{Holland Publishing Co., Amsterdam. Draper, N. R. and H. Smith (1981). Applied Regression Analysis (second edition). John Wiley & Sons, New York. 21

Fujita, K. (1979). The range of correlation coecients obtainable from m  n correlation tables with given marginal distributions. Sci. Rep. Fac. Educ. Gifu Univ. (Nat. Sci.) 7, 394{406. Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325{338. Gower, J. C. (1971a). A general coecient of similarity and some of its properties. Biometrics 27, 857{874. Gower, J. C. (1971b). Statistical methods of comparing di erent multivariate analysis of the same data. In F. R. Hodson, D. G. Kendall, and P. Tautu (Eds.), Mathematics in the Archaeological and Historical Sciences, pp. 138{149. Edinburgh University Press. Greenacre, M. J. (1984). Theory and Applications of Correspondence Analysis. Academic Press, London. Hoe ding, W. (1940). Masstabinvariante Korrelationtheorie. Schriften des Math. Inst. Angew. Math. Univ. Berlin 5, 181{233. Hoerl, A. E. and R. W. Kennard (1970a). Ridge regression: Applications to non{orthogonal problems. Technometrics 12, 69{82. (correction, ibid., p. 723). Hoerl, A. E. and R. W. Kennard (1970b). Ridge regression: biased estimation for non{orthogonal problems. Technometrics 12, 55{67. Holman, E. W. (1972). The relation between hierarchical and Euclidean models for psychological distances. Psychometrika 37, 417{423. Kendall, M. G. (1957). A Course in Multivariate Analysis. Grin, London. Kruskal, J. B. (1977). The relationship between multidimensional scaling and clustering. In J. Van Ryzin (Ed.), Classi cation and Clustering, pp. 17{44. Academic Press, New York. Kruskal, J. B. and M. Wish (1978). Multidimensional Scaling. Number 11 in Sage University Paper series on Quantitative Applications in the Social Sciences. Beverly Hills and London: Sage Publications. Krzanowski, W. J. (1988). Principles of Multivariate Analysis: a User's Perspective. Clarendon Press, Oxford. Krzanowski, W. J. (1993). The location model for mixtures of categorical and continuous variables. Journal of Classi cation 10, 25{49. Kshirsagar, A. M. (1972). Multivariate Analysis. Marcel Dekker, Inc., New York. Lingoes, J. C. (1971). Some boundary conditions for a monotone analysis of symmetric matrices. Psychometrika 36, 195{203. Mardia, K. V. (1978). Some properties of classical multidimensional scaling. Communications in Statistics A. Theory and Methods 7, 1233{1241. Mardia, K. V., J. T. Kent, and J. M. Bibby (1979). Multivariate Analysis. Academic Press, London. Martn, A. M. and P. Galindo (1994). Evaluation of utilities by means of non{metric multidimensional scaling. (Submitted to Operations Research ). Mi~narro, A. and J. M. Oller (1992). Some remarks on the individuals{score distance and its applications to statistical inference. Questiio 16, 43{57. Mitchell, A. and W. J. Krzanowski (1985). The Mahalanobis distance and elliptic distributions. Biometrika 72, 464{467. Moran, P. A. P. (1967). Testing for correlation between non{negative variates. Biometrika 54, 385{394. 22

Morrison, D. F. (1967). Multivariate Statistical Methods. McGraw{Hill, New York. Ohsumi, N. and T. Nakamura (1981). Some properties of monotone hierarchical dendrogram in numerical classi cation (in japanese). Proc. Institute of Statistical Mathematics (Tokio) 28, 117{133. Oliva, F., C. Bolance, and L. Diaz (1993). Aplicacio de l'analisi multivariant a un estudi sobre les llengues europees. Questiio 17, 139{161. Oller, J. M. (1989). Some geometrical aspects of data analysis and statistics. See Dodge (1989), pp. 41{58. Oller, J. M. (1993). On an intrinsic analysis of statistical estimation. See Cuadras and Rao (1993), pp. 421{437. Oller, J. M. and C. M. Cuadras (1985). Rao's distance for negative multinomial distributions. Sankhya. The Indian Journal of Statistics, Series A 47, 75{83. Petitpierre, E. and C. M. Cuadras (1977). The canonical analysis applied to the taxonomy and evolution of the genus Timarcha Latr. (Coleoptera, Chrysomelidae). Mediterranea 1, 13{28. Pruzansky, S., A. Tversky, and J. D. Carroll (1982). Spatial versus tree representations of proximity data. Psychometrika 47 (1), 3{24. Rao, C. R. (1945). Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Math. Soc. 37, 81{91. Rao, C. R. (1952). Advanced Statistical Methods in Biometric Research. John Wiley & Sons, New York. (2nd ed. published by He ner, 1971). Rao, C. R. (1973). Linear Statistical Inference and its Applications, 2nd Ed. John Wiley & Sons, New York. Ratkowski, D. A. and D. Martin (1974). The use of multivariate analysis in identifying relationships among disorder and mineral element content in apples. Australian J. Agric. Res. 25, 783{790. (reprinted in (Andrews and Herzberg 1985, pp.355{356)). Ros, M. and C. M. Cuadras (1986). Distancia entre modelos lineales normales. Questiio 10, 83{92. Ros, M., A. Villarroya, and J. M. Oller (1992). Rao distance between multivariate linear normal models and their application to the classi cation of reponse curves. Computational Statistics and Data Analysis 13, 431{445. Romney, A. K., R. N. Shepard, and S. Nerlove (Eds.) (1972). Multidimensional Scaling: Theory and Applications in the Behavioral Sciences. Vol. 2, Applications. Seminar Press, New York. Roses, F. (1990). Estudi de la \i" atona en posicio post{consonantica. Univ. de Barcelona. Unpublished manuscript. Seal, H. L. (1964). Multivariate Statistical Analysis for Biologists. Methuen and Co., Ltd., London. Seber, G. A. F. (1984). Multivariate Observations. John Wiley & Sons, New York. Shepard, R. N., A. K. Romney, and S. Nerlove (Eds.) (1972). Multidimensional Scaling: Theory and Applications in the Behavioral Sciences. Vol. I. Theory. Seminar Press, New York. Shih, W. J. and W.-M. Huang (1992). Evaluating correlations with proper bounds. Biometrics 48, 1207{1213.

23