Projection Pursuit Indices Based On Orthonormal Function Expansions yz , Andreas Bujay , Javier Cabreraz
Dianne Cook
y Bellcore, 445 South St,Morristown, NJ 07962-1910
z Dept of Statistics, Hill Cntr, Busch Campus, Rutgers University, New Brunswick, NJ 08904
[email protected]
Abstract
Projection pursuit describes a procedure for searching high dimensional data for \interesting" low dimensional projections, via the optimization of a criterion function called the projection pursuit index. By empirically examining the optimization process for several projection pursuit indices we observed dierences in the types of structure that maximized each index. We were especially curious about dierences between two indices based on expansions in terms of orthogonal polynomials, the Legendre index (Friedman, 1987) and the Hermite index (Hall, 1989). Being fast to compute these indices are ideally suited for dynamic graphics implementations. Both Legendre and Hermite indices are weighted L2-distances between the density of the projected data and a standard normal density. A general form for this type of index is introduced, which encompasses both of these. The form clari es the eects of the weight function on the index's sensitivity to dierences from normality, highlighting some conceptual problems with the Legendre and Hermite indices. A new index, called the Natural Hermite index, which alleviates some of these problems, is introduced. As proposed by Friedman (1987) and also used by Hall (1989), a polynomial expansion of the data density reduces the form of the index to a sum of squares of the coecients used in the expansion. This drew our attention to examining these coecients as indices in their own right. We found that the rst two coecients and the lowest order indices produced by them are the most useful ones for practical data exploration since they respond to structure that can be analytically identi ed, and because they have \long-sighted" vision which enables them to \see" large structure from a distance. Complementing this low order behaviour the higher order indices are \short-sighted". They are able to \see" intricate structure but only when close to it. We also show some practical use of projection pursuit using the polynomial indices, including a discovery of previously unseen structure in a set of telephone usage data, and two cautionary examples which illustrate that structure found is not always meaningful.
1
1
Introduction
The term \projection pursuit" was coined by Friedman and Tukey (1974) to describe a procedure for searching high (p) dimensional data for \interesting" low (k = 1 or 2 usually, maybe 3) dimensional projections. The procedure, originally suggested by Kruskal (1969), involves de ning a criterion function, or index, which measures the \interestingness" of each k-dimensional projection of p-dimensional data. This criterion function is optimized over the space of all k-dimensional projections of pspace, searching for both global and local maxima. The resulting solutions hopefully reveal low dimensional structure in the data not necessarily found by methods such as principal component analysis. Searching for interesting projections can be equated to searching for the most non-normal projections. One reason is that for many high dimensional data sets most projections look approximately Gaussian, so to nd the unusual projections one should search for non-normality (see Andrews et al (1971) and Diaconis and Freedman (1984) for further discussion). Huber (1985) equates R interesting to structured or non-randomness and discusses entropy (measured by f (x) log f (x)dx, for example) as a measure of randomness: the lower the value of entropy indicates more randomness. Notably, this also suggests searching away from normality because entropy, as de ned above with xed scale, is minimized by the Gaussian distribution. Using normality as the null model is highlighted in indices proposed by Jones (1983, e.g. moment index) and Huber (1985, e.g., negative Shannon entropy, Fisher information). This approach also suggests discarding structure such as location, scale and covariance, which are found reasonably well with more conventional multivariate methods, by sphering the data before beginning projection pursuit. Consequently we have a framework for considering a family of projection pursuit indices based on a description of the data in terms of an IRp-valued random vector, Z , satisfying EZ = 0 and CovZ = Ip. We want to construct a k-dimensional projection pursuit index, that is a real valued function on the set of all k-dimensional projections of IRp. For simplicity, let k = 1, so we consider all 1-dimensional projections of Z , Z 0! X = 0 Z 2 IR ( 2 S p01 );
where S p0 is a unit (p 0 1)-sphere in IRp, and X is a real-valued random variable. (We use 2 S p0 because only direction is important and we search over all possible N (0; 1). Let the random directions in IRp.) In the null case if Z dis N (0; Ip) then X dis variable X have distribution function F (x) and density f (x). An index, I , can be constructed by measuring the distance of f (x) from the standard normal density. A practical index of this kind was proposed by Friedman (1987), although he detoured from the above route, by rst mapping X into the bounded interval [01; 1] by the transformation Y = 28(X ) 0 1, where 8 is the distribution function of a standard normal. By doing this he hoped to concentrate attention on dierences in the center, 1
1
2
producing an index robust to tail uctuations. In the null case if X dis N (0; 1) then dis Y U [01; 1]. In general, let Y have distribution function G(y ) and density g (y ). Friedman's proposed index is an L -distance of g(y) from the density of U [01; 1]: Z 1 L I = fg (y ) 0 g dy 2 0 We call this the Legendre index because g(y) will later be expanded in terms of the natural polynomial basis with respect to U [01; 1], namely the Legendre polynomials. This is the starting point for the work presented in this paper, but before we continue we note two basic details about the use of an index, I , in projection pursuit: 1: I is a functional of f (in Friedman's case I L is a functional of g). 2: f (or g) depends on the projection vector, , so that projection pursuit entails the search for local maxima of I over all possible . 2
1
2
1
2
Transformation Approach
Friedman's detour can be generalized by considering an arbitrary strictly monotone and smooth transformation T : IR ! IR on the random variable X so that Y = T (X ). Then if X has distribution function F (x) and density f (x), let Y have distribution function G(y) and density g(y). Given that the null version of the density, f (x), is (x) we denote the null version of the density g (y ) to be (y ). A general family of indices is now de ned by Z
fg(y) 0 (y)g (y)dy (1) IR which specializes to I L for T (X ) = 28(X ) 0 1. (Note that we integrate with respect I=
1 2
2
to (y)dy, which becomes dy for the Legendre index, I L.) This family of indices incorporates the idea of a distance computation between \observed" f (x) or g(y) and \null" (x) or (y) under the assumption of the null distribution. The transformation, T , can be considered to transform the index into a form suitable for estimation by an alternative orthonormal basis (see below), and to adjust the index's sensitivities to particular structures. In somewhat reverse logic, now start with I , de ned in its transformed state, and map it back through the inverse transformation: Z ( f (x) (x) ) 0 T 0(x) (x)dx I = IR T 0(x) Z = ff (x) 0 (x)g 0(x) dx (2) T (x) 1 2
2
2
2
IR
This form clearly shows that the index is a weighted distance between f (x) and a standard normal density, with weighting function (x)=T 0(x) . 2
3
Using this formulation the Legendre index, I L, proposed by Friedman (1987) becomes: Z 1 dx; I L = ff (x) 0 (x)g (3) 2(x) IR since T (X ) = 28(X ) 0 1 ) T 0(X ) = 2(X ). Ironically the mapping proposed by Friedman to reduce the in uence of tail uctuations does exactly the opposite. The term 1=(x) eectively upweights tail observations, leaving the Legendre index very sensitive to dierences from normality in the tails of f (x). This is more a conceptual stumbling block than a practical de ciency because the problem is somewhat moot for nite function expansions. Just the same, equation (3) illustrates an unintended eect of an otherwise innocuous-looking data transformation. Through dierent considerations Hall (1989) also observed the same phenomenon of upweighted tails in the Legendre index. It motivated him to propose an alternative index that measures the L -distance between f (x) and the standard normal density with respect to Lebesgue measure: 2
2
IH =
Z
ff (x) 0 (x)g dx 2
IR
Interestingly, this index is also a member of the family of indices (1) as it can be obtained through a suitable transformation. Equating qthe implicit weight 1 with (x)=T 0(x) in 2 we nd T 0(x) = (x), or T 0(x) = (x) and hence T (X ) / 8 p (X ) . Such a transformation seems unnatural at rst, and may not contribute any additional insight beyond the obvious one that I H gives equal weight to all dierences along IR. Aside from this, Hall's motivation for the design of the index is from an established approach in density estimation. We return then to Friedman's original idea of giving more weight to dierences in the center. Going a step beyond Hall's approach, we propose to use T (X ) = X , the identity transformation, giving: 2
=
2
2
Z
ff (x) 0 (x)g (x)dx
(4) We call this index the \Natural Hermite" index, and Hall's index the \Hermite" index because both use Hermite polynomials in the expansion of f (x), but I N is \natural" because the distance from the normal density is taken with respect to Normal measure. The class of T that we have allowed is exibly broad, to entertain various constructions which may not be entirely sensible for practical purposes. We make use only of the following one-parameter family of transformations, I = N
IR
2
p
T (X ) = 2 (8 (X ) 0 1=2)
which allows us to see the three indices in a natural order. The elements of the family are scaled to achieve T0 (0) = 1, for all > 0. The limit for ! 1 is T!1 (X ) = X . 4
For T we get essentially the Legendre index, I L, for T p we have an index proportional to the Hermite, I H , and for T1 we get the Natural Hermite index, IN . The proper way to interpret these transformations is according to their ability to thin out the tail weight for increasing . The smaller , the more the tail weight is in ated and allowed to exert in uence on the projection pursuit index. As mentioned above, the problem of tail weight is more conceptual than practical. The upshot of this section then is conceptual consistency in a framework that allowed us to devise a new index which is simple and more radical in its treatment of tail weight. In the following sections, we will mostly work with our new index and give cursory attention to the Legendre and Hermite indices when appropriate. =1
3
=
2
Density Estimation
For the purposes of projection pursuit index estimation, the empirical data distribution needs to be mapped into a density estimate to which the de nition of an index can be applied. A natural approach in this context is polynomial expansion (Friedman, 1987). Density estimates obtained in this way are usually not very pleasing for graphical purposes since the non-negativity constraint is impossible to enforce. However, in the present context no graphical presentation of such estimates is intended. In addition, considerable analytical simplicity and computational eciency is achieved by this approach. In each of the indices described above, the density f (x) (or g(y) in the transformed version) is expanded using orthonormal functions: f (x) =
1 X i=0
ai pi (x)
In the Natural Hermite index, fpi (x); i = 0; 1; : : :g is the set of standardized Hermite polynomials orthonormal with respect to (on. wrt) (x). (Note that (x) is also called the weight function of the polynomial basis. In the notation of Thisted (1988), 1 0 pi (x) = (i!) 2 Hei (x); p (x) = 1; p (x) = x; p (x) = x 0 1. The subscript \e" is a convention used to distinguish this Hermite polynomial basis from the basis on. wrt (x).) In addition, to estimate I N , (x) is expanded as P1i bipi (x). Inserting both expansions into I N (4) gives: 0
1
2
2
2
=0
I
N
= = =
Z (X 1 IR
i=0
IR
i=0
Z (X 1 1 X i=0
ai pi (x) 0
1 X i=0
)2
bi pi (x) (x)dx
)2
(ai 0 bi)pi (x) (x)dx
(ai 0 bi)
2
5
since the pi 's are on. wrt (x). The Fourier coecients of the expansion, ai and bi, are as follows: ai = bi =
Z
ZIR IR
f (x)pi (x)(x)dx =
Z
(x)pi (x)(x)dx
IR
pi (x)(x)dF (x)
The coecients, bi, can be analytically calculated from Abramowitz and Stegun (1972, 22.5.18, 22.5.19): i ((2i)!) 21 1 ; b = 0; i = 0; 1; 2; : : : ( 0 1) bi= p i i! 2 i Because of its dependence on f , the coecient ai is unknown and must be estimated in order to estimate I N . Reinterpreting ai as an expectation, 2
2 +1
2 +1
ai = EF fpi (X )(X )g
leads to the obvious sample estimate: n 1X p (x )(x ) a^ = i
i
n j =1
j
j
The index I N is estimated by using ^ai and truncating the sum at M terms, I^MN =
M X i=0
(^ai 0 bi) : 2
The asymptotic theory for the choice of M as a function of sample size for I L and I H is the subject of Hall's (1989) paper. Using Hall's approach we nd that the choice of M for I N is the same as that for I H . Note also, that the truncation at M constitutes a smoothing of the true index, I N . The approximations in both Legendre and Hermite indices are similarly constructed. In the Legendre index, the expansion is made on g(y), after X is transformed from IR to [01; 1] by Y = 28(X ) 0 1, with fpi (y ); i = 0; 1; : : :g being the set of standardized Legendre polynomials (Friedman, 1987). The Hermite index uses Hermite polynomials on. wrt (x) (Hall, 1989). 2
4
Structure Detection
Our interest in the structure sensitivity of the indices stems from implementing projection pursuit dynamically (Cook et al, 1991) in XGobi, which is dynamic graphics software being developed by Swayne et al (1991). Included in the implementation are controls for steepest ascent optimization of a variety of 2-dimensional projection 6
pursuit indices. The main feature is that the procedure is visualized by sequential plotting of the projected data as the optimization ensues, and the interactive nature of the implementation enables the optimization to be readily started from multiple points. For many long but interesting hours we watched projections of various types of data as they were steered into local maxima of dierent projection pursuit indices. In the course we found that indices truncated as low as M = 0 or M = 1 were the most interesting and also useful. This ies in the face of the natural idea that M should be chosen as large as possible, within limits dictated by the sample size. The usefulness of these low order indices arises from a \long-sighted" quality which enables them to see large structure, such as clusters, from a distance. Speci cally, we found that low order Hermite and Natural Hermite indices often nd projections with a \hole" in the center, whilst the low order Legendre index tends to nd projections containing skewness. Higher (4, 5, : : : ) indices lose the long-sightedness and become \short-sighted". They are receptive to ner detail in the projected data although they need starting points much closer to the structure to nd it. The intrigue induced by observing these behaviours led to the qualitative results in the next few sections. 5
One-dimensional Index
For simplicity we begin with the 1-dimensional index, using the Natural Hermite as an example and then extending the results to both the Legendre and Hermite indices. We are interested in maxima of I N = (a 0 b ) , I N = (a 0 b ) + (a 0 b ) and its second term (a 0 b ) . Because each (ai 0 bi) involves a quadratic in ai it is maximized by minimizing or maximizing ai. Write ai as EF fpi(X )(X )g and the problem reduces to nding the types of distribution functions, F (x), which minimize or maximize this expectation. Now F (x) needs to be absolutely continuous for the integral form (4) of the index, N I , to exist but once the expansion is truncated this restriction may be dropped and F (x) may be discrete. In the execution of projection pursuit, F (x) is restricted to the set of distribution functions of all 1-dimensional projections of Z : 1
1
2
0
0
0
2
2
1
0
0
2
1
1
2
F (x) 2 FZ = fF (x): X = 0 Z; 2 S p01 g
and EZ = 0 and CovZ = Ip are assumed as usual. However, to understand the general types of distributions to which ai responds, consider F (x) belonging to the expanded set:
F = fF (x): EF X = 0; EF X = 1g 2
Now F is convex, but it is not closed in the weak topology. In order to consider minimizing or maximizing ai over F we need to consider all F (x) in the weak closure of F , F , which happens to be: 7
F = fF (x): EF X = 0; VarF X 1g (3) This set is furthermore weakly compact (33). (Proofs of these two statements and Lemma 5.1, Propositions 5.2 and 5.3 are left to the Appendix.) Since F is also convex,
this is a natural domain for optimizing the projection pursuit coecients, ai. They are weakly continuous linear functionals, and their extrema in F are taken on at the extremal elements: are the union of: Lemma 5.1: The extremal elements of F (i) the 3-point masses, F , satisfying EF X = 0, EF X 2 = 1; (ii) the 2-point masses, F , satisfying EF X = 0, EF X 2 1; (iii) the 1-point mass F = 0 .
(where x denotes a unit point mass at x.) Thus, the set of extremals is not very expressive, but it is sucient to give insight into the behaviours of the lower order coecients, a and a . While it is true that ai takes on its extrema on these elements for all i = 0; 1; 2; 3; : : :, none of its ability to respond to high frequency structure in F is revealed by studying this extremal behaviour for larger orders of i. 0
5.1
Truncation at First Term -
1
I0N
Consider the simplest but, in our experience, most useful index I N = (a 0 b ) , where 0
Z
a0 =
IR
0
2
(x)dF (x) since p0 (x) 1
1 (= a when f ) p 2
b0 =
p
0
0
so that I N = (a 0 1=(2 )) : As mentioned earlier, we should expect two types of (local) maxima for I N ; one for a minimum of a and one for a maximum. 0
0
2
0
0
Proposition 5.2:
p
(i) a0 is minimized, with a value 1= 2 e, by a distribution with equal masses of weight 0:5 at 61. (Call this distribution type CH, or a \central hole".) p (ii) a0 is maximized with a value 1= 2 , by a point mass at p0. 2This N distribution actually maximizes I0 with a value of (1 0 1= 2) =2 . (Call this distribution type CM, or a \central mass" concentration.)
8
0.015 0.010
Central Mass
N 0.005
I0
0.0
Central Hole 0.0
Figure 1:
I0N .
0.2
0.4
0.6
0.8
1.0
γ Symmetric interpolation (5) between distribution types CH and CM, showing
Intuitively, (i) says to minimize a mass should be placed as far out as possible, because of the shape of (x), but the mean and variance constraints impose limits on how far the mass can be from zero. Conversely, (ii) says that because (x) is unimodal with maximum at zero, to maximize a place all mass at zero. This is not a proper element of F , but it is clear that a does not take on its maximum in F . For example, the distribution 1 1 0 + 1 0 ;
=10 (5) F = + 0x x 2 2 x w is centered, has unit variance, and F ! as x ! 1. This simple family of discrete distributions serves as an interpolation between the type CM distribution for ! 1, x ! 1, and the type CH distribution type for = 0, x = 1. Figure 1 shows a plot of I N for this interpolating family. Despite the greater relative magnitude of I N for central mass concentration shown in this gure, the nature of the intermediate dip in the curve demonstrates that I N will also respond to central holes. In fact, I N will more often respond to central holes since the range of -values which ascend to type CH is larger (about 0.6) than that for type CM (about 0.4). The Hermite index of order 0, I H , behaves identically to I N with the exception of a constant factor. The Legendre index, on the other hand, doesn't have an equivalent term: I L = 0, always. 0
0
0
0
2
0
0
0
0
0
0
0
0
5.2 5.2.1
Truncation at Second Term -
I1N
Second Term Alone
The second term (a 0 b ) , where 1
1
2
9
Z
x(x)dF (x) since p1 (x) x a1 = IR b1 = 0 (= a1 when f )
reduces to a . For this quantity we need only consider maximizing a , because the skew symmetry of a about x = 0 implies that the minimal value of a will be equal in magnitude to its maximal value, and obtained by a re ection through x = 0 of the maximal distribution. 2 1
1
1
1
Proposition 5.3:
The second coecient, a1, q is maximized by q the two point distribution with mass ; (1 0 ) at (1 0 )= ; 0 =(1 0 ), respectively, q where is found by maximizing (1 0 )(0e0(10 )= + e =(10 ) ). ( approximately equals 0.838.)
The maximum value of I N , is achieved equally by this right-skewed distribution and the left-skewed distribution produced by its re ection through x = 0. Call these distributions type SK. As above, it is useful to embed the distributions of interest in a one-parameter family: 1
F = (1 0 )x + y ;
q
q
x = 0 =(1 0 ); y = (1 0 )=
(6) The members of this family are again centered and scaled to unit variance. They are skewed, except for the type CH distribution at = 0:5, and for the type CM distribution obtained when ! 1. The type SK maximum occurs in between the two extremes, at approximately = 0:838. Figure 2(a) plots a for this family as a function of , on the interval (0:5; 1). The single mode of the curve occurs at the distribution given in Proposition 5.3. The second terms of the Hermite and Legendre indices behave very much like these. In the Legendre index this is the lowest order index, because I L = 0, so that I L responds exclusively to skewness in the data and the next section does not apply. 1
0
1
5.2.2
Piecing First and Second Terms Together
p
Truncating the sum at two terms gives the order 1 index, I N = (a 01=(2 )) +a , so the behaviour of I N depends on the interactive behaviour of the two terms. Figure 2(b) plots I N for the skewed interpolation of form (6). The distribution which maximizes I N is of type SK but not exactly the same as that which maximizes a alone, because the interaction with the rst term draws it towards a type CM distribution. It is characterized by having approximate masses 0:13 and 0:87 at 02:59 and 0:387, respectively. The result is intuitive. The maximum value of a is greater than the maximum value of the rst term, and the distribution of type CM which maximizes the rst term has no skewness, so the maximal distribution cannot be type CM. On 0
1
2
1
1
2 1
1
2 1
10
2 1
0.015
(b)
0.015
(a)
0.010
0.010
CM N 0.005
I1 0.005
a1
CH CH 0.5
0.6
0.7
0.8
0.9
0.0
0.0
CM 1.0
0.5
0.6
0.7
0.8
0.9
1.0
γ γ Figure 2: Skewed interpolation (6) between distribution types CH and CM, showing (a) a1 and (b) I1N .
the other hand the SK type distributions that are \close" to maximizing a also get a contribution from the rst term which favours all mass at 0. Clearly the behaviour of I N is to respond to skewness when it is present. This behaviour is also seen in the Hermite index, I H . 2 1
1
1
5.3
Higher Order Indices
There is a natural trend in that odd order indices respond to skewness, type SK distributions, whilst even order indices respond to central mass concentration, type CM distributions. As the order increases the type CM distribution tends to dominate for the Natural Hermite index. This is also true for very high order Hermite and Legendre indices, although it is not as easy to see due to increased variation and number of in ection points, at least in orders 3 to 5. This is shown in Figure 3 where the values of (a) Natural Hermite, (b) Hermite and (c) Legendre indices are plotted for orders 0 to 5 for the skewed interpolation of form (6). (Note that an increase in order means an increase in index value hence the lowest line (solid) is order 0 and the highest line (dash-dot) is order 5 in each plot.) The extremal behaviour of the higher order indices is not so interesting because we know we can collect these extremal types of structure easily with the 0; 1 order indices. The practical power of high order indices stems from the increased modality of the polynomials which enables them to detect higher-frequency deviations from normality, that is, ner structure. There is a tradeo because the increased modality makes it necessary to be very close to the structure to nd it by projection pursuit optimization.
11
Natural Hermite
Hermite
Legendre
1.2
0.5
0.08
5
4 1.0
0.06
0.4
4
4
Ik
L
Ik 2
1 0
0.0
0.0
0 0.5
0.7
0.9
0.5
0.7
0.9
6.1
1-dimensional
0.7
0.9
γ
Figure 3: Skewed interpolation (6) for indices of orders k Hermite, (b) Hermite, (c) Legendre. Illustrations of
1 0.5
γ
γ
6
0.2
0.1
1
2
0.0
0.02
0.4
0.2
2
3 0.6
H
3
0.3
N
0.8
3
0.04
Ik
1.4
5
5
= 0; 1; : : : ; 5 (a) Natural
Pro jection Pursuit
Low Order Indices
Now that we have examined some aspects of the behaviour of the polynomial indices over the ideal expanded set, F , we would like to examine their behaviour, in practice, that is, over the set, FZ , of distributions of 1-dimensional projections of Z . For most dimensions, p, of Z this is not possible, but it is practical for the simple case of p = 2. The following type of plot has been used by Huber (1990): project a two-dimensional distribution onto lines parametrized by angles of unit vectors in the plane, = (cos ; sin ); = 0o ; 1o ; : : : ; 179o
and calculate I^N for each , and display its values radially as a function of . Figure 4(a) shows an example of this for a bivariate distribution whose X and Y variables are independent and distributed according to a near type CM and an exact type CH distribution, respectively. The data points (diamonds) are plotted in the center of the gure. (Note that overplotting obscures the relative frequency at each 0
12
(c) Y-axis
30 20 0
0
10
+
10
20
30
40
(b) X-axis 40
(a) Natural Hermite(0)
-3
-2
-1
0
1
2
3
-1.0
0.0
0.5
1.0
Figure 4: Bivariate distribution with X near type CM, Y exactly type CH, independently distributed (a) I0N , (b) Projection (X ) which maximizes I0N , distribution near type CM, (c) Projection (Y ) corresponding to a local maximum of I0N , distribution type CH.
(b) Global Max
(c) Local Max
6 4
8 6
2
4
... ....... .........
0
0
2
. . .... .......... . .. .. .. .. . . + ... ......... .. .. .
10 12
8
(a) Natural Hermite(0)
-2
0
1
2
-1
0
1
2
10 12
(f) Local Max
8 4
6
10
0
0
2
5
. . .... .. ..... .......... ........... .. + .. .... . ... ......... . .. ..
-2
(e) Global Max 15
(d) Natural Hermite(1)
-1
-2
-1
0
Figure 5:
1
2
-2
-1
0
1
2
Two dimensions of the ea beetle data (a) I0N , (b) Projection which maximizes I0 (c) Projection corresponding to local maximum of I0N , (d) I1N , (e) Projection which maximizes I1N , (f) Projection corresponding to rst local maximum of I1N . N,
13
point. In fact, the two points on X = 0 each contain 36 observations whilst the other four points contain 1 observation each.) The solid line represents the index value, I^N , plotted in relative distances from the center, in both positive and negative directions, along the projection vector, . The dotted circle is a guidance line plotted at the median index value. The maximum value is attained at o , that is the projection of the data into the dashed line at Y = 0, which is the near type CM (Figure (b)). However from this global maximum the index dips low and rises to a smaller maximum at o , that is the projection of the data into the dotted line at X = 0, which corresponds to the type CH distribution (Figure (c)). Each of these maxima tells us something interesting about the data, so ideally we would want the optimizer to return both. (Note that all projections between o and o constitute interpolations between the near type CM distribution and the type CH distribution.) Figure 5(a) contains a similar plot of 2-dimensional data generated by taking a 2-dimensional projection of 6-dimensional data. The 6-dimensional data consists of 6 measurements on each of 74 ea-beetles including 3 dierent species (Lubischew, 1969). The 2-dimensional data shown is the best three species separating projection of the 6-dimensional principal component space. Figure (a) shows I^N calculated for each 1o incremental projection, and Figure (b) contains a histogram of the projected data corresponding to the maximum index value. In this case it is a near CH type distribution. Figure (d) shows I^N , order 1 index, calculated for each 1o incremental projection. The maximizing distribution, which separates one cluster from the other two, is near SK type, seen in the histogram in plot (e). There are two other local maxima, but the angular dierence between these and the global maximum is close enough to 60o to see that these are also near SK type distributions, each separating o one of the three clusters. This example illustrates the power of the myopia of the order 0 and 1 indices for separating gross clusters. An order 0 index will capture projections with two relatively equal clusters whilst an order 1 index will capture projections with one large cluster and a smaller cluster (perhaps obtained by projection of 2 clusters on top of each other and a single cluster, as in this example). 0
0
90
0
90
0
1
6.2
High Order Indices
To illustrate the sensitivity of higher order indices to ne structure we switch to the Legendre index, because it needs considerably fewer terms than the Hermite and Natural Hermite indices to capture the same depth of structure. A rationale for this is given by asymptotic results in Hall (1989) (from bounds on each term given in Sansone, (1959, p.199, 369)). The data used for this example is generated by the infamous random number generator, RANDU, which is based on the multiplicative congruential scheme: xn = (2 + 3)xn (mod 2 ). RANDU was widely used in the 1970's but unfortunately fails most 3-dimensional criteria for randomness (Knuth, 1981). Points generated by RANDU lie on 15 parallel planes de ned by 9xn 0 6xn + xn 0 (mod 2 ), when sequentially placed into a 3-dimensional cube. The 2dimensional data used for the analysis is obtained by projecting the 3-dimensional +1
+2
16
31
+1
31
14
(b) Global Max
30 20 0
10
.................................... ............................................... ..................................... . .............+............. .............................................. .................................... . . . . . . . ..
40
50
60
(a) Legendre(2)
-2
1
2
(d) Global Max
60 40 0
20
. . .. . .............. ....................... ............................................. .......................................... ........... .......... ...................+ . . . .............................................. ....................................... .. . .. ... .. . .. .. . .
0
80 100
(c) Legendre(25)
-1
Figure 6:
-2
-1
0
1
2
Legendre index on RANDU data (a) I2 (b) Projection which maximizes
L , (d) Projection which maximizes I L . I2L, (c) I25 25
L,
data into a plane containing the normal vector (9; 06; 1)=k(9; 06; 1)k, so that the planar structure in the points is visible. Figure 6(a) displays the order 2 Legendre index, I^L. Its long-sighted quality is apparent in that it simply nds the \edge eects" of the square, where the projected data looks like a sample from a uniform distribution. In contrast the global maximum of the order 25 index, I^L , corresponds to the planar structure but because it is such a narrow spike it would be dicult to nd by optimization. This is the tradeo which must be borne with the higher order indices; the ne structure can be found but one needs to be much closer to see it. (This higher order behaviour was rst observed by Huber (1990).) 2
25
7
Two-dimensional Index
Our experience with the implementation of projection pursuit in XGobi is with 2dimensional indices. This is the most natural lower dimensional projection to examine for dynamic graphics implementations, and perhaps general human visual perception, so it is particularly pertinent to extend the results of the 1-dimensional indices to the 2-dimensional indices. The construction of the 2-dimensional Natural Hermite index follows closely the 1-dimensional construction so for the purposes of brevity we will 15
only point to the dierences and refer the reader back to Sections 1-3 for complete treatment. Consider a bivariate projection of Z , Z 0! (X; Y ) = (0 Z; 0Z ) 2 IR2 (; 2 S p01; 0 = 0)
then the Natural Hermite index has the following form: I = N
Z
IR2
ff (x; y) 0 (x; y)g (x; y)dxdy 2
Bivariate Hermite polynomials that are orthonormal with respect to (x; y) (Jackson, 1936) are used to expand f (x; y) and (x; y) giving: IN
92 Z 8 1 =