Pearson Pearson, 1901] and, later, Hotelling Hotelling, 1933] realized that the ... the method of principal components or component analysis, by Harold Hotelling.
Dimensionality reduction - A Primer DRAFT VERSION by Leonid Peshkin Department of Computer Science Brown University Providence, RI 02912, USA November, 1995
Contents 1 Introduction
1
2 Principal Components Analysis
1
3 Projection Pursuit
5
1.1 The Problem : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.2 Literature Survey : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
2.1 Method : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.2 Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
1 1
1 4
3.1 Method : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 3.2 Experiments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10
A Statistical Notation and De nitions
11
1 INTRODUCTION
1
1 Introduction 1.1 The Problem
While working with high dimensional vectors, the curse of dimensionality becomes the main factor aecting the analysis. The curse of dimensionality emerges due to inherent sparsity of high dimensional spaces. In recent years this has led researchers to construct methods that attempt to avoid this problem. In cases where important structures in the data actually lie in much smaller dimensional subspace, it becomes reasonable to try to reduce the dimensionality before attempting the classi cation. This approach can be successful if the dimensionality reduction/feature extraction method loses as little information as possible in the transformation from the high dimensional space to the low dimensional one. In this paper we discuss two techniques for overcoming the curse of dimensionality with the purpose of data analysis { Principal Component Analysis and Projection Pursuit.
1.2 Literature Survey Statistical data analysis has been an area of a high research activity for a long time. Most fundamental works in this eld are [Harman, 1976] and [Duda and Hart, 1986]. The principal component analysis was introduced in [Pearson, 1901] and later in [Hotelling, 1933]. The state of the art is well described in [Jollie, 1986]. An application of principal component analysis to the face recognition was analyzed in [Peshkin, 1994]. [Huber, 1985] is a comprehensive overview of projection pursuit approach. A basic paper introducing projection pursuit is [Friedman and Tukey, 1974]. More recent works in this area are [Friedman, 1987], [Hall, 1989] and [Jones and Sibson, 1987].
2 Principal Components Analysis
2.1 Method
The objective of principal component analysis is to represent a variable in terms of several underlying factors. The simplest theoretical model for describing the variable in terms of several others is the linear one. A typical usage of this comes in the reduction of a large body of data to a more manageable set. The measurements that vary the most among the individual objects are of the greatest interest. Therefore, if a small number of linear combinations of the original variables which account for most of the variance can be found, then considerable eect is gained.
2 PRINCIPAL COMPONENTS ANALYSIS
2
We assume that data are in a standardized form. As it is described later on in Section 3, for multidimensional data, the majority of the projections are normal. As noted in [Harman, 1976], when the point representation of a set of variables is used, the loci of uniform frequency density are essentially concentric, similar, and similarly situated ellipsoids, which means that the data point can be enclosed by equal-density contours, and these contours form (hyper)ellipses. If the variables are uncorrelated, the ellipses will describe a sphere; if perfectly correlated, they will degenerate into a line. Pearson [Pearson, 1901] and, later, Hotelling [Hotelling, 1933] realized that the axes of the hyperellipsoid can be found from the eigenvectors of the correlation matrix1. The axes of these ellipsoids correspond to the principal components. Thus, the method of component analysis involves the rotation of coordinate axes to a new frame of reference in the total variable space { an orthogonal transformation where each of the n original variables is describable in terms of the n new principal components. An important feature of the new components is that they account for a maximum amount of variance of the variables. More speci cally, the rst principal component is that linear combination of the original variables which maximally contributes into their total variance; the second principal component, uncorrelated with the rst, contributes a maximum to the residual variance; and so on until the total variance is analyzed. The sum of the variances of all n principal components is equal to the sum of the variances of the original variables. Since the method depends on the total variance of the original variables, it is most suitable when all the variables are measured in the same units. Otherwise, by change of units or other linear transformations of the variables, the ellipsoids could be squeezed or stretched so that their axes (a principal components) would have no special meaning. Hence variables are expressed in standard form, i.e., the unit of measurement for each variable is selected so that its sample variance is one. Then, the analysis is made on the correlation 1
The equation of the hyperelipsoid is
w0 R?1w = c; where R is the correlation of covariance matrix of the data, w is a vector of coordinates for the points on the ellipsoid, and c is a constant. The major axis of the ellipsoid is determined by the points on the ellipsoid farthest away from the centroid. To locate these points, you need only to nd the point on ellipsoid whose distance to the origin is a maximum. The squared distance from the point, w, on the ellipsoid to the origin is w0w, so we have to maximize w0w subject to w0R?1 w = c. Using simple calculus to solve this maximization problem leads us to the result: Rw = w; with w0w = . Hence, w must be an eigenvector of R, corresponding to the largest eigenvalue . The eigenvector w may be scaled to unit length; it de nes the direction of the major axis of the ellipsoid. The length of the major axis, that is, the distance from the origin to the point on the ellipsoid furthest away from the origin, is 1=2 . Using a similar line of reasoning, it can be demonstrated that the second eigenvector corresponds to the largest minor axis of the ellipsoid, and so on.
2 PRINCIPAL COMPONENTS ANALYSIS
3
matrix, with the total variance equal to n. For such a matrix, all n principal components are real and positive. An empirical method for the reduction of a large body of data so that a maximum of variance is extracted was rst proposed by Karl Pearson [Pearson, 1901] and fully developed as the method of principal components or component analysis, by Harold Hotelling [Hotelling, 1933]. The model for component analysis is simply:
zj = aj1F1 + aj2F2 + : : : + ajnFn (j = 1; 2; : : : ; n); where each of n observed variables is described linearly in terms of n new uncorrelated components F1; F2; : : : ; Fn. An important property of this method is that each component makes a maximal contribution into the sum of the variances of the n variables. For practical problems only a few components are retained, especially if they account for a large percentage of the total variance. However, all the components are required to reproduce the correlations among the variables 2. The interpretation of the eigenvalues and eigenvectors of covariance and correlation matrices leads to several generalizations3: 1. The location of eigenvectors along the principal axes of the hyperellipsoids in eect positions the eigenvectors so as to coincide with the directions of maximum variance. The eigenvector associated with the largest eigenvalue determines the direction of maximum variance of the data points; the eigenvector associated with the second largest eigenvector locates the direction of maximum variance orthogonal to the rst, and so on. 2. The location of eigenvectors may be viewed as a problem of rotation or transformation. The original axes, representing the original measured variables, are rotated to new positions according to the criteria reviewed previously. The rows of the matrix of eigenvectors are the coecients that rotate the axes of the variables to positions along the major and minor axes. The elements of the vectors can be considered as the cosines of the angles of rotation from the old to a new system of coordinate axes. The columns of the matrix of eigenvectors give the direction cosines of the eigenvectors themselfs. 2In contrast to the maximum variance approach, the traditional or the classical factor analysis model is designed to maximally reproduce the correlations (there are various methods to accomplish this). The basic factor analysis model may be written in the form: zj = aj 1F1 + aj 2F2 + : : : + ajm Fm + uj Yj (j = 1; 2; : : :; n); where each of n observed variables is described linearly in terms of m (usually much smaller than n) common factors and a unique factor. The common factors account for the correlations among the variables, while each unique factor accounts for the remaining variance of that variable. 3For basic description of eigenvalues and eigenvectors see Appendix A.
2 PRINCIPAL COMPONENTS ANALYSIS
4
3. Eigenvectors are linearly independent vectors that are linear combinations of the original variables. Thus, they may be viewed as \new" variables with the desirable property that they are uncorrelated and account for the variance of the data in decreasing order of importance. 4. The sum of the squared projections of data points onto the eigenvectors is proportional to the variance along the eigenvectors. This variance is equal to the associated eigenvalue. The square root of the eigenvalue can therefore be used as the standard deviation of the \new" variable represented by the eigenvector. 5. It follows from the previous paragraph that an eigenvalue of zero indicates that the corresponding minor axis is of zero length. This suggests that the dimensionality of the space containing the data points is less that the original space { which, in turn, implies that the rank of the data matrix is less than its smaller dimension. In summary, the eigenvectors associated with the nonzero eigenvalues of a covariance or correlation matrix represent a set of orthogonal axes to which the data points may be referred. They are a set of linearly independent basis vectors the positions of which are located so as to account for maximum variance, which is given by the associated eigenvalues. The dimensionality of the space is given by the number of nonzero eigenvalues. An example of eigen-analysis applied to the face recognition can be found in [Sirovich and Kirby, 1987, Peshkin, 1994]. In this case, face images intensity matrices were used as vectors and faces were projected onto so-called eigenfaces basis as dimensionality reduction preprocessing to classi cation procedure4.
2.2 Algorithm We are looking for the model of the data matrix Y in a form:
Y = FA0 + E; where F is a matrix of factor scores, A is the matrix of factor loadings, and E is a matrix of residuals5. Here is the three-step procedure which gives an approximation with accuracy depending on the number k of largest eigenvalues. 1. Compute the covariance matrix S ; 2. Find the k largest eigenvalues Vk and the corresponding eigenvectors Uk of S ; 4 5
Dierent approaches to classi cation or clustering were studied in [Peshkin, 1993]. See Appendix A for the detailed description.
3 PROJECTION PURSUIT
5
3. Compute F^ = Y Uk ; An important issue is how many components one should extract to account for \satisfactory" amount of the total variation. There is no exact solution to this problem, but the following approaches are often used. A useful way to start is to compute the cumulative percentage variance contribution obtained for successive values of k = 1; 2; : : : and to stop when this is suciently large, for example, bigger than some threshold like 75%, 90%, or 95%. How far one can get usually depends to a large extent on the nature of the data, that is, the degree of collinearity and redundancy in it. The cumulative percentage variance contribution for a given value k is computed as Pk Pk i i =1 i=1 i : = (1) Pp tr S i=1 i
The residual values k+1 ; : : :; p, should be small and approximately equal. In order to discover the existence of such a situation, it is often informative to plot the values of i against their order and look for the attening out of the tail. A variant of this is to plot the percentage of the trace. One test is to look for the equality of a certain number of the smallest eigenvalues of covariance matrix. For example, in the three-dimensional case, if all the eigenvalues are equal, we are dealing with a spherical distribution. If the largest eigenvalue is dierent, and two smallest eigenvalues are equal, the cross section in the plane of the second and third axes is a circle. In the general p-dimensional case, if the p ? k smallest eigenvalues are equal, there is no use taking out more than k principal components, since the remaining may not be distinguishable. The test for the equality of the p ? k smallest eigenvalues is due to Bartlett(1954)6. Principal component analysis does not give us an exhaustive characteristic of data because it captures only a linear dependence. In the next section we are going to consider more advanced technique which belongs to a general class of dimensionality reduction methods { projection pursuit.
3 Projection Pursuit
3.1 Method
A general class of unsupervised dimensionality reduction methods, called exploratory projection pursuit, is based on seeking interesting projections of high dimensional data points 6
See [Harman, 1976].
3 PROJECTION PURSUIT
6
cloud [Friedman and Tukey, 1974]. The notion of interesting projection is motivated by an observation made by Freedman that for most high-dimensional clouds, most low-dimensional projections are approximately normal. This nding suggests that the important information in the data is conveyed in these directions whose single dimensional projected distribution is far from Gaussian. Various projection indices dier on the assumptions about the nature of deviation from normality, and in their computational eciency. Thus, the idea behind projection pursuit is to pick interesting low-dimensional projections of high-dimensional point cloud by maximizing an objective function called projection index. The most commonly used dimensionality reduction transformations are linear projections. This is because they are among the simplest and most interpretable. A linear projection from Rd to Rk is any linear map A, or k d matrix of rank k:
Z = AX; X 2 Rd; Z 2 Rk : The projection is orthogonal if the row vectors of A are orthogonal to each other and have length 1. If X is d-dimensional random variable with distribution F , then Z = AX is a k-dimensional random variable with distribution FA. Any structure seen in projection is a shadow of an actual structure in the full dimensionality. In this sense those projections that are the most revealing of the high-dimensional data distribution are those containing the sharpest structure. It is of interest then to pursue such projections. Friedman and Tukey [Friedman and Tukey, 1974] presented an algorithm for attempting this goal. The basic idea is to assign a numerical index Q(FA) to every projection A. That index characterizes the amount of structure present (data density variation) in the projection. This index is then maximized via numerical optimization with respect to the parameters de ning the projections. We are interested not only in absolute, but also in local, extrema. The projection index can (and often should) be ane invariant [Huber, 1985]. It means that for a real random variable Z , (nonrandom) real numbers s; t and index Q, we have
Q(sZ + t) = Q(Z ); s 6= 0: Therefore unlike other methods of data analysis (principal components analysis or factor analysis) this one is unaected { and thus not distracted { by the overall covariance structure of data, which often has little to do with clustering and other nonlinear eects. (See for example Figure 1.) The projection index forms the heart of projection pursuit method. It de nes the intent of procedure. Our intent is to discover interesting structured projections of multivariate data set. This sophisticated goal must be translated into a numerical index that is a functional of
3 PROJECTION PURSUIT
7 Y
X
Figure 1: Well known example of two data clusters which can be separated by projecting to the X axis and can not be separated by projecting to the y axis, although the variance in the y axis is larger.
the projected data distribution. This function must vary continuously with the parameters de ning the projection and have a large value when the (projected) distribution is de ned to be \interesting" and small value otherwise. The notion of interesting obviously varies with the application [Huber, 1985]. We cannot expect universal agreement on what constitutes an interesting projection. A projection in which the data separates into distinct, meaningful clusters would certainly be interesting. But there are also interesting features that are not of the distinct cluster type (e.g. an edge, or jump of density, at the boundary of some region). Let us now mention a few widely used projection indices. The Legendre and Hermite indices are based on the expansion of density functions in terms of orthogonal polynomials; the Friedman{Tukey and Entropy indices are based on kernel density estimates. The Legendre index is constructed by inverting the density through a normal cumulative distribution function, and using Legendre polynomials for the expansion. This expansion is then compared with a uniform density on the interval [?1; 1] by an L2-distance. This is a good general index, fast to compute [Friedman, 1987]. The Hermite index is constructed using Hermite polynomials for the same expansion as underlines the Legendre index. This expansion is compared to a standard normal density using an L2-distance. This index works best when few terms are used in the expansion. The idea was proposed by [Hall, 1989]. The Friedman{Tukey index is based on an L2-norm of a local kernel density estimate [Friedman and Tukey, 1974]. The Entropy index is an extension of the Friedman{Tukey index constructed using the negative entropy of a kernel density estimate [Jones and Sibson, 1987]. Along this work we use the Legendre index. In a one-dimensional exploratory projection pursuit we seek a linear combination Z = T X such that the probability density p(Z ) is relatively highly structured. We regard the normal density as the least structured and we are concerned with nding structures that appear in the main body of the distribution rather
3 PROJECTION PURSUIT
8
than in the tails. We begin by performing a transformation R = 2(Z ) ? 1; with (Z ) being ZZ 1 ?1=2t2 dt: (Z ) = p e (2) 2 ?1 Clearly, R takes values in the interval ?1 R 1; and if Z follows a standard normal distribution then R will be uniformly distributed in this interval. Speci cally, R + 1 R + 1 1 g ?1 2 ; (3) pR(R) = 2 p ?1 2 with g() being the standard normal density. Thus, a measure of nonuniformity of R corresponds to a measure of nonnormality of X . We take as a measure of nonuniformity the integral{squared distance between the probability density of R, pR(R), and the uniform probability density, pU (R) = 21 ; over the interval ?1 R 1: Z1 ?1
2
pR (R) ? 21 dR =
Z1
2 (R) dR ? 1 : p 2 ?1 R
(4)
Our projection index Q(Z ) is taken to be a moment approximation of 4. Expanding pR(R) in Legendre polynomials we have 3 Z 1 2X 1 4 aj Pj (R)5 pR (R) dR ? 1 ;
2
?1 j =0
where the Legendre polynomials are de ned by P (R) = 1; P (R) = R; P (R) = [(2j ? 1)RPj?1 (R) ? (j ? 1)Pj?2 (R)] ; 0
1
j
j
(5)
for j 2. The coecients are given by Z1 aj = 2j 2+ 1 ?1 Pj (R)pR (R) dR = 2j 2+ 1 ER[Pj (R)]: Finally, Legendre projection index is obtained by truncating the sum at order J :
Q(Z ) =
J (2j + 1)E 2 [P (R)] X R j j =1
2
:
(6)
[Friedman, 1987] notes that the results are insensitive to the value chosen for the order of polynomial expansion (J ) over a wide range (4 J 8). The computation increases linearly with J for 1D projection pursuit and quadratically for 2D. This projection index can be computed very rapidly. Fast approximations for the normal integral 2 exist and are provided as built-in function by many programming language
3 PROJECTION PURSUIT
9
compilers. The Legendre polynomials to order J are quickly obtained via the recursion relation 5. The Legendre projection index for a two-dimensional projection pursuit is developed in direct analogy with the one-dimensional index [Friedman, 1987]. An important detail of the algorithm is a technique used to optimize projection index function. The power of the method is re ected in its ability to nd (for a given sample size and data dimension) substantial maxima of the projection index. Usually a hybrid optimization strategy is employed. It begins with a simple coarse step optimizer that is designed to get close very rapidly to a substantial maximum. A gradient method (quasi-Newton) is then used to converge quickly7 to the solution. Interesting projections will be those which exhibit bimodality or multimodality as opposed to projections based on variance criteria (Section 2) or other criteria not explicitly tied to cluster separation. After one has found some \interesting" projections, he can do one of the following actions: 1. Identify clusters, isolate them and investigate them separately. 2. Identify clusters and locate them (i.e. classify all points according to their membership to a cluster). 3. Find a parsimonious description (separate structure from random noise). Obviously, there is a oating boundary between the items. The output of an exploratory projection pursuit is a collection of views of the multivariate data set. These views are selected to be those that independently best represent the nonlinear aspects of the joint density of the measurement variables as re ected by the data. The nonlinear aspects are emphasized by maximizing a robust ane invariant projection index. Finally the data analyst has at his disposal the values of the parameters that de ne each solution as well as the projected data density. This information can be used to try to interpret any nonlinear eects that might be uncovered. A visual representation of the projected data density FA can be inspected to determine the nature of the eect. Another possibility is to apply some other technique to the data with reduced dimensionality. In particular, it can be one of the clustering algorithms8. It would be appropriate to note that one should not be too optimistic about projection pursuit { we cannot hope that this method will carry us more than two or three dimensions beyond the limits set by human endurance [Huber, 1985]. This estimate was given in 1985 taking into account limitations of computer capacity9. For overview of the optimization methods like steepest descent, conjugate gradients and quasi-Newton which involve the use of rst derivatives see [Gill et al., 1981]. 8See [Peshkin, 1993] for the overview of clustering algorithms. 9In the discussion of [Huber, 1985] P. Diaconis gives an estimation for lower and upper bounds of the 7
3 PROJECTION PURSUIT
10
3.2 Experiments The power of the projection pursuit algorithm to nd important structure decreases with decreasing sample size and increasing dimension. As remarked earlier, the covariance structure (linear associations) often does not align with the nonlinear structure (clustering, nonlinear relationships) that we are seeking with our projection pursuit algorithm. However, the typical exception to this has to do with the existence of a subspace containing only a tiny fraction of the data variation. Obviously, if subspace contains no data variation, it can not contain any structure. In this case the covariance matrix is singular, and the dimension of the search space is reduced to the rank of the covariance matrix. If there exists a subspace for which the data variation is very small compared with the complement subspace, then this subspace is usually dominated by the noise in the system and contains little data structuring. In this case the power of the projection pursuit procedure can be enhanced and computation reduced by restricting the pursuit search to the complement space. We have to restrict the projection pursuit search to the subspace spanned by few largest principal component axes. [Friedman, 1987] recommends that if a high-dimensional projection pursuit is unsuccessful in nding an interesting structure, one should restrict the search dimension and try again. Let us now consider an example corresponding to Figures 2. We are running projection pursuit on some data set to see what interesting structure can be revealed. Figure 2 shows the process of Legendre projection index optimization stepwise. Small triangles mark optimization intervals from these where algorithm was trying to escape from local extremum. We start with extremely low valued index corresponding to the homogeneous cloud which is close to bivariate normal distribution (see the leftmost snapshot). Very rapidly algorithm comes to the view where the data get separated into two distinct clusters and converges to local extremum (see the second snapshot). In order to not get trapped in a local extremum we have to take some big random jumps even having worsen the projection index value. Following optimization leads us to two better separated clusters with much higher index value (see the third snapshot). Having repeated this procedure, we end up with the largest projection index value, corresponding to the projection where one cloud is split into smaller ones. We have to mention that in the discussion to [Huber, 1985] the three referees, who tried to use exploratory projection pursuit in practice, were extremely pessimistic about its potential for nding clusters. There is still no clear theoretical framework of adjusting the projection P=3 number of two-dimensional projections needed: P=4 P=5 P=6 P=7
Lower bound Upper bound 70 270 3260 51680 143500 0:9 107 0:6 107 0:2 1010 0:2 109 0:2 1012
11
2.33e-02
Legendre(5) Index
7.82e-01
A STATISTICAL NOTATION AND DEFINITIONS
Figure 2: The value of Legendre projection index along the optimization period with projection snapshots.
Small triangles mark optimization intervals from these where algorithm was trying to escape from local extremum.
index appropriate for particular data domain.
A Statistical Notation and De nitions A particular variable is denoted by Xj , which may be any one of n variables: 1; 2; : : : ; n. The index i is used to designate any one of the N individual objects: 1; 2; : : : ; N . Then the value of variable Xj for object i is represented by Nji, with the order of subscripts being of importance. A particular Xji is called an observed value. For a sample of N observations, the mean of any variable is de ned by X j = P Xji=N . The observed values of the variables may be transformed to more convenient form by xing the origin and the unit of measurement. When the origin is placed at the sample
A STATISTICAL NOTATION AND DEFINITIONS
12
mean, a particular value xji = Xji ? X j is called a deviate. The sample variance of variable Xj is de ned by
s2j =
X 2 xji=N:
The population variances are denoted by j2. Taking the sample standard deviation as the unit of measurement, the standard deviate of variable Xj for individual object i is given by zji = xji=sj . The set of all values zji(i = 1; 2; : : : ; N ) is called a variable zj in standard form. For any two variables Xj and Xk the sample covariance is de ned by
sjk =
X
xjixki=N:
An eigenvector of matrix R is a vector u with elements, not all zero, such that Ru = u, where is an unknown scalar. Essentially, we need to nd a vector such that the vector Ru is proportional to u. Another form of this equation is (R ? I )u = 0;
(7)
where 0 is a null vector. This implies that the unknown vector u is orthogonal to all row vectors of (R ? I ). Equation [7] represents a system of homogeneous equations. The rst step in solving [7] requires setting the determinant of (R ? I ) to zero, thus jR ? I j = 0. The general form of this equation is known as the characteristic equation. In general case, the characteristic equation will yield a polynomial of a degree p and p roots, or eigenvalues 1; 2; : : :; p. These p eigenvalues may not all be dierent and some may be equal to zero. If two or more eigenvalues are equal, it is multiple eigenvalue; otherwise, the eigenvalue is distinct. Once the eigenvalues have been obtained, the associated eigenvectors may be found from the equation [7]. A unique solution can not be obtained for an eigenvector. That is, if u is a solution, so is cu, where c is a scalar. By convention, eigenvectors are always normalized; they are considered to be of unit length. There is an eigenvector that is unique, except for its sign, associated with each distinct eigenvalue, and eigenvectors associated with dierent eigenvalues are orthonormal. For multiple eigenvalues there are multiple solutions to [7], but one can always choose an eigenvector that is orthogonal to all other eigenvectors. Consequently, whether or not there are multiple eigenvalues present, one can always associate an eigenvector of unit length to each eigenvalue such that eigenvector is orthonormal to all of the others.
REFERENCES
13
References [Duda and Hart, 1986] Duda, R. O. and Hart, P. E. (1986). Pattern Classi cation and Scene Analysis. A Wilet - Interscience publication. [Friedman, 1987] Friedman, J. (1987). Exploratory projection pursuit. J. Am. Stat. Assoc., 82:249{266. [Friedman and Tukey, 1974] Friedman, J. and Tukey, J. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, 23:881{889. [Gill et al., 1981] Gill, P., Murray, W., and Wright, M. (1981). Practical Optimization. London: Academic Press. [Hall, 1989] Hall, P. (1989). Polynomial projection pursuit. Annals of Statistics, 17:589{605. [Harman, 1976] Harman, H. H. (1976). Modern Factor Analysis. The University of Chicago Press, third edition. revised edition. [Hotelling, 1933] Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417{520. [Huber, 1985] Huber, P. J. (1985). Projection pursuit. Annals of Statistics, 13:435{475. [Jollie, 1986] Jollie, I. (1986). Principal Component Analysis. Springer-Verlag, New York. [Jones and Sibson, 1987] Jones, M. and Sibson, R. (1987). What is projection pursuit ? Journal of the Royal Statistical Society, 150:1{36. Series A. [Pearson, 1901] Pearson, K. (1901). On lines and planes of closest t to systems of points in space. Phil. Mag., 6:559{572. ser. 2. [Peshkin, 1993] Peshkin, L. (1993). Clustering algorithms. Research Project. Weizmann Inst. of Science. [Peshkin, 1994] Peshkin, L. (1994). Eigenfaces for recognition. Research Project. Weizmann Inst. of Science. [Peshkin et al., 1994] Peshkin, L., Kovaleno, M., and Ullman, S. (1994). Dominant orientation for computing the distance between face images. In Proceedings of Japan-Israel Workshop on Computer Vision and Visual Communication, CVVC'94, pages 15{20, Haifa, Israel. [Rissanen, 1978] Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14:465{471. [Rissanen, 1983] Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Ann. Statist., 11:416{431. [Sirovich and Kirby, 1987] Sirovich, L. and Kirby, M. (1987). A low dimensional procedure for the characterization of human faces. J. Opt. Soc. Amer. A., 4(3):519{524.