Kernel-based Projection Pursuit Indices in XGobi 1 - CiteSeerX

1 downloads 0 Views 1MB Size Report
niques to improve the speed of the indices based on kernel density estimation. ... as it is done in the Grand Tour (Asimov 1985), but as the dimensions increases.
Kernel-based Projection Pursuit Indices in XGobi 1 Sigbert Klinke

Institute of Statistics, Catholic University of Louvain 34, Voie du Roman Pays B{1348 Louvain{la{Neuve

Dianne Cook

Department of Statistics Iowa State University Ames, IA 50011

Abstract

The software XGobi o ers a selection of di erent index functions to use for Exploratory Projection Pursuit (EPP). EPP is a method for nding interesting lowdimensional (in this case 2-dimensional) projections of high-dimensional data. We discuss the inclusion of two more index functions which use several standard techniques to improve the speed of the indices based on kernel density estimation. The speed improvements cannot be applied to derivative calculations, unfortunately. Additionally we discuss the question of bandwidth selection for both indices and propose the use of a bandwidth computed from the the Mean Squared Error (MSE) rather than a multivariate \rule-of-thumb" based on a Gaussian density. The result is a smaller \default" bandwidth which re ects earlier results that an index function based on smaller bandwidths are better detectors of structure.

1 1

Diese Arbeit ist im Sonderforschungsbereich 373 \Quanti kation und Simulation O konomischer Prozesse", Humboldt-Universitat zu Berlin, entstanden und wurde auf seine Veranlassung unter Verwendung der ihm von der Deutschen Forschungsgemeinschaft zur Verfugung gestellten Mittel gedruckt.

1 Motivation The analysis of multivariate data is an interesting and dicult problem in statistics because structures with dimension greater than 3 are not be easily visualized. One approach to this problem is to start with a known structure, using a parametric model, and examine how well it ts. Since we do not know the structure we may have to consider a lot of such models. Non-parametric models, which do not x the structure in the data, for example, kernel density estimation and smoothing, su er from the \curse of dimensionality", which means we need increasingly many observations to be con dent that the estimate is reasonable as the dimension increases. Non-parametric methods based on projections escape the curse. The power of using projections in exploratory data analysis is implied by the theorem of Cramer-Wold (Mardia, Kent & Bibby 1979): The distribution of a random p-vector X is completely determined by the set of all one-dimensional distributions of linear combinations T X, where 2 IRp ranges through all xed p-vectors.

That implies that a multivariate distribution can be de ned completely by specifying the distribution of all its linear projections. Of course if we x the distribution in all 2 dimensional projections of a multivariate distribution, we also determine the multivariate distribution. Thus we have to visit all 2-dimensional projections, as it is done in the Grand Tour (Asimov 1985), but as the dimensions increases we need a longer time to view a dense set of projections. A a shortcut around this problem is suggested in theory by (Diaconis & Freedman 1984): under suitable conditions, most projections are approximately Gaussian. Thus the viewing time can be shortened by using an index of non-normality to direct the sequence of views. We put on each projection an index value, which describes the departure from non-normality. If we try to maximize this index function, we will end up with some most non-normal projections. This method is called Exploratory Projection Pursuit and reduces the amount of projections we have to visit. But we should not view Projection Pursuit only as extension to the Grand Tour. In practice the combination of both has very fruitful results, for example, Projection Pursuit can be used to reduce the dimension of a large data set and the smaller dimensional subset may be viewed using a Grand Tour.

2 Exploratory Projection Pursuit and XGobi Interactive and dynamic graphical methods provide data analysts with the power of motion, high interaction and rapid response for exploring and understanding high dimensional relationships among many measurements. In XGobi (Swayne, Cook & Buja 1991) 2-dimensional Projection Pursuit is implemented in an interactive manner so that a user can actively control and watch the results of the procedure. 1

XGobi, more generally, provides a framework for exploring high dimensional data graphically by providing tools for manipulationg scatterplots. The tools include a variety of plotting methods (textured dotplot by Tukey & Tukey (1990), pairwise plot, 3 variable rotation, grand tour on 3 or more variables) and manipulation methods (scaling, brushing, identi cation line drawing). There are currently eight Projection Pursuit indices available in XGobi:  Two indices based on kernel density estimation; Friedman-Tukey (Friedman & Tukey 1974) and Entropy (Huber 1985).  Three indices based on orthonormal function expansions; Legendre (Friedman 1987), Hermite (Hall 1989) and Natural-Hermite (Cook, Buja & Cabrera 1994).  Three indices based on isolated speci c terms of orthonormal function expansions which are tailored to detect particular types of structure in projections as suggested by their names: Holes, Central Mass and Skewness (Cook et al. 1994). The latter ve indices have been the most practical in the dynamic implementation because they can be computed fast enough to be readily watchable in \real-time". On the other hand the Friedman-Tukey and Entropy indices, although implemented, are useless for any practical sized data set because their computation needs n2 kernel computations (compare also Figure 2 on page 12). The aim of this paper is to describe the implementation of new indices of these two forms which are practical to use because their computation is hastened by using binning for the kernel density estimation.

3 Estimating kernel based indices The Friedman-Tukey-Index measures IFT =

Z

fY2 (y)dy

IR where Y = ( T X; T X), and X is a random p-vector. If the data are centered (E(X) = 0) and sphered (Cov(X) = Ip ), this index is maximized by a parabolic density which looks similar to a truncated bivariate normal density. To estimate the index note that it can be rewritten as 2

IFT = E(fY (y)): The unknown density, fY (y), is replaced by a kernel density estimate, n y ? y X 1 ^ fY (y) = nh2 K j h ; j =1

2

and the expectation is estimated by the sample mean:

 n n  ^IFT = 21 2 X X K yj ? yi : n h i=1 j =1 h

It is reasonable to take the same bandwidth in both directions, since the data are sphered. Whereas this approach was chosen to catch \local densities", the index

suggested by Huber (1985) was based on the fact that the negative entropy is minimized by the standard normal distribution. The Entropy-Index is de ned by

Z

f (y) log(fY (y))dy IR Y which leads with the same arguments as for the Friedman-Tukey-Index to an estimate IE =

2

0 1   n X X y ?y I^E = n1 log @ nh1 K j h i A: 2

i=1

j =1

3.1 Improving computational speed

Both kernel indices rely on the Rosenblatt-Parzen-estimator: n y ? y X f^Y (y) = nh1 2 K j h : j =1

and we need to calculate it at the points yi . For the Friedman-Tukey-Index the formula is

n X n y ? y  X I^FT ( ; ) = n21h2 K jh i : i=1 j =1 If the data yi = (yi1; yi2 ) are sorted in ascending order in yi;1, the kernel K is symmetric (K(?y) = K(y)) and has a compact support this can be simpli ed to

1

0

 n i?1  ^IFT ( ; ) = 21 2 @nK(0) + 2 X X K yj ? yi A ; nh h i=1 j =ji

(1)

with j starting at ji . Since we are interested in nding maxima of the index function we could neglect some terms i.e. n 1h , nK(0) and the 2, which would lead to an estimate 2 2

 n i?1  ~IFT ( ; ) = X X K yj ? yi : h i=1 j =ji 3

The optimization routine uses the di erence between two following projections I( new ; new ) ? I( old ; old ) to control the optimization, so we restrict ourself to formula 1, because adding the additional term and multiplying by constants are easily done at the end of the index calculation (Programs can be found in Appendix B.2 and B.3). Nearly the same can be done with the Entropy-Index:

0 0 11   n n BX B CCC X y ?y K j h i ? log(nh )C I^E ( ; ) = n1 B B@ log BB@K(0) + CACA : i j 2

=1

=1

j 6= i

With the use of additional memory we can estimate the Entropy-Index nearly as ecient as the Friedman-Tukey-Index. For this compare the programs B.4 and B.5. Some common bivariate kernels are given in Table 1. De nition Name K(y) = I(j y j< 1) 1 Uniform K(y) = I(j y j< 1) 3 (1? j y j) Triangle K(y) = I(j y j< 1) 2 (1? j y j2) Epanechnikov K(y) = I(j y j< 1) 3 (1? j y j2)2 Quartic K(y) = I(j y j< 1) 4 (1? j y j2)3 Triweight 1 K(y) = exp( ?j2yj ) Gaussian 2  j y j 1 Cosine K(y) = I(j y j< 1) 4(1?2=) cos( 2 ) 2

Table 1: Common bivariate kernels In XGobi the Triweight kernel is used, because of the existence of second derivatives, which results in smooth rst derivatives. The next step is to use Binning, that means to tesselate the IR2 -plane with squared bins with binwidth , although other forms, for example, hexagonal bins, are possible. An algorithm of the use of hexagonal bins for density estimation is derived in the appendix C. We replace every observation by the center of the bin in which it falls,

  y i = int yi ;

while int(y) computes the largest integer smaller than y. The kernel function K will be replaced by a stepwise function over the bins:  y) = K (int(jj y j +0:5 j)) ; K( 4

 with j w j= (j w1 j; j w2 j). We will get the following values for K:  1 ; y2 ) y1 y2 K(y 0 j y1 j< 0:5 0 j y2 j< 0:5 ! K(0; 0) 0:5 j y1 j< 1:5 0 j y2 j< 0:5 ! K(; 0) 0 j y1 j< 0:5 0:5 j y2 j< 1:5 ! K(0; ) 0:5 j y1 j< 1:5 0:5 j y2 j< 1:5 ! K(; ) .. . The advantages is that the di erences yj ?yi can be expressed as integer multiples of . Another advantage is that the kernel function K has only to be evaluated at (i; j) with i; j = 0; :::; int(h=). This could be done once and the result could be stored in the memory. Unfortunately we can not use this advantage, because the binwidth has been xed to 0:01 in XGobi, which assures that value of the indexfunction will not vary much if we would rotate the data yi in the projection plane. It would lead to an enormous number of bins if the bandwidth is big enough.

3.2 Choosing the bandwidth

XGobi is a program that allows the user to choose the bandwidth for both indices interactively. Nevertheless XGobi has to provide a starting bandwidth. To choose this bandwidth we can use an \optimal" bandwidth for kernel density estimation. The approximated mean integrated squared error can be easily calculated following Scott (1992): Z 4 Z h 1 2 ^ tr2 (r2f(w))dw K (w)dw + 4 AMISE(fh ) = nh2 IR IR which leads to an \optimal" h of R 2 IR K 2 (w)dw 6 R : h =n IR tr(r2f(w))dw Analogously to the rule-of-thumb in the univariate case and the rule of Scott for the multivariate case, we substitute the bivariate standard normal density for the unknown density and use the Triweight kernel to get a starting bandwidth: 2

2

2

2

hopt = 3:12n?1=6 The range of bandwidths the user can choose is 0:05hopt  h  5hopt : It is well known that this \optimal" bandwidth oversmoothes if the underlying density is multimodal. It seems to be more interesting to explore lower bandwidths rather than higher ones (Polzehl & Klinke 1995). 5

3.3 Another approach for choosing the bandwidth for the Friedman-Tukey-Index

The choice of the bandwidth which was done in last section is very crude. Therefore we can try to improve the choice for the Friedman-Tukey-Index. We can calculate the Mean Squared Error (MSE) of the index and derive a better choice for the bandwidth. It holds MSE(I^FT ) = Ch40 + Ch21 + C2 + C3 h2 + O(h4) with constants C0; :::; C3 which can be found in the Appendix A. To get an estimate for an \optimal bandwidth" we drop the O(h4 )-term and to nd the minimum for the Approximated MSE (AMSE) we compute the derivative d AMSE(I^FT ) = ?4C0 + ?2C1 + 2C h = 0 3 dh h5 h3 and multiply with h5

?4C0 ? 2C1h2 + 2C3h6 = 0:

By replacing h2 by g we get a reduced polynomial of degree 3, which can be solved by the formula of Cardano (Gellert, Kustner, Hellwich & Kastner 1977). The discriminant is always positive and converges to zero (n ! 1). The positive discriminant allows only one unique solution of the equation. We can compute a minimum by plugging in the Triweight kernel and the standard normal gaussian density for the unknown density. As result we get Table 2. In principle the same could be done with the Entropy-Index. But the comparable calculations as in the Appendix A will be much more complicated. We would have to replace the log-function by its Taylor-expansion. Obviously a linear approximation is not sucient if n (or h) varies. This can be seen by the following inequality    4  log n42h2  I^E < log nh : 2

4 Estimating derivatives of the kernel based indices 4.1 Calculating the derivatives

The derivatives of the indices are used in the optimization routine of XGobi. XGobi uses a steepest ascent-method to nd a maxima. The derivatives for the FriedmanTukey-Index 6

n 2 5 10 20 50 100 200 500 1000 2000 5000

\optimal bandwidth" rule-of-thumb MSE-FT 2.77960 2.716855 2.38594 2.325666 2.12563 1.877361 1.89372 1.464121 1.62552 1.059059 1.44818 0.834545 1.29018 0.659888 1.10746 0.485088 0.98663 0.384715 0.87899 0.305229 0.75450 0.224842

Table 2: Bandwidth computed by using the rule-of-thumb and the use of the MSE

is

n X n T T X I^FT ( ; ) = n21h2 K (xjh ? xi ) ; (xjh ? xi ) i=1 j =1

!

n n X @K T (xj ? xi) ; T (xj ? xi ) @ I^FT ( ; ) = 1 X (x ? x ) i;m j;m 2 4 @ m n h i=1 j =1 @y1 h h

or a weighted partial derivative of a kernel density estimate n @ f^ @ I^FT ( ; ) = 1 X h;m (y ) @ m n i=1 @y1 i and n @ f^ @ I^FT ( ; ) = 1 X h;m @ m n i=1 @y2 (yi ): The Entropy-Index has a more complicated development

7

!

(2)

!

n X

@K T (xj ? xi ) ; T (xj ? xi ) (x ? x ) i;m j;m nh n @y1 h h @ I^E ( ; ) = 1 X j =1 ! ; n T (xj ? xi ) T (xj ? xi ) X @ m n i=1 1 ; K nh h h j =1 1

4

2

which simpli es to n @ fh;m (y ) @ I^E ( ; ) = 1 X i @y @ m n i=1 f^h (yi ) ^

1

and

n @ f^h;m (y ) @ I^E ( ; ) = 1 X i @y ^ @ m n i=1 fh (yi ) : It can easily be seen that we have mainly to calculate some weighted derivatives of the kernel density estimate. 2

4.2 Choosing a bandwidth for the derivative

The main problem that occurs in estimating a derivative of a kernel density estimator is that the derivative becomes to wiggly, if we use the same bandwidth as for estimating the kernel estimate. Thus it seems reasonable to enlarge the bandwidth. The AMISE will be ^ AMISE @@yfh l

!

Z @K(s) 1 = nh4 ds IR @yl 2

4 Z h + 36 IR

1 0 Z X @K( s ) @f( w ) @ si sj sk @y dsA dw @yi @yj @yk l 2

2

2

i;j;k=1

IR

2

and an \optimal" bandwidth

0 0 1 1? Z  Z Z X @K(s) B @K(s) A C @f(w) @ h = 36 n IR @yl ds @ IR i;j;k @yi @yj @yk IR si sj sk @yl ds dwA : 2

2

8

2

2

=1

2

Again we use the Triweight kernel and substitute the bivariate gaussian density for the unknown density to get the \optimal" bandwidth 8

1

hopt = n2:91 : 1=8 Since XGobi allows an interactive change of the bandwidth for the kernel estimate, we choose a multiplicative factor for the bandwidth of the derivative, found by relating the optimal bandwidth for the derivative (above) with that for the index (section 3.2): such that the bandwidth

m = 0:93n1=24

hder = 0:93n1=24h: where h is the bandwidth used in the index calculation.

5 Practical behaviour of binned kernel indices

5.1 Which distributions will be maximized by the kernelbased indices ?

From the de nition of both indices, we can easily derive which distribution from the estimators will be maximized. The Friedman-Tukey-Index will become maximal for xed n, if all datapoints are at the same location. n X n X K(0) max(I^FT ) = n21h2 i=1 j =1

Since we have the constraints E(x) = 0 and (Cov(X) = Ip ) we have to move at least 1 datapoint from the central point, e.g. the rst, away thus we will get

0

1

n X n X K(0)A max(I^FT )  n21h2 @K(0) + i=2 j =2 K( 0 )  n2 h2 (1 + (n ? 1)2) 0) 2  K( n2 h2 (n ? 2n + 2)  K(h20) (1 ? 2=n + 2=n2): A similar development can be done for the Entropy-Index. Table 3 shows the maxima, if we plug in the used Triweight kernel and the used bandwidth.

9

2.0 -3.0

-2.0

-1.0

X_2

0.0

1.0

2.0 1.0 0.0 X_2 -1.0 -2.0 -3.0

-1.0

0.0 X_1

1.0

2.0

-2.0

-1.0

0.0 X_1

1.0

2.0

-2.0

-1.0

0.0 X_1

1.0

2.0

-2.0

-1.0

0.0 X_1

1.0

2.0

-2.0

-1.0

0.0 X_1

1.0

2.0

-2.0

-1.0

0.0 X_1

1.0

2.0

X_2

0.0 0.0

-3.0

-3.0

-2.0

-2.0

-1.0

-1.0

X_2

0.0

0.0

1.0

1.0

2.0

2.0

-3.0

-3.0

-2.0

-2.0

-1.0

-1.0

X_2

X_2

0.0

1.0

1.0

2.0

2.0

-3.0

-3.0

-2.0

-2.0

-1.0

-1.0

X_2

X_2

0.0

1.0

1.0

2.0

2.0

-2.0

-1.0

0.0 X_1

1.0

2.0

-3.0

-2.0

-1.0

0.0 X_1

1.0

2.0

3.0

X_2

-2.0

-1.0 -5.0

-6.0

-4.0

-3.0

-4.0

-2.0

X_2

0.0

0.0

1.0

2.0

2.0

3.0

-2.0

-4.0

-3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

-4.0

X_1

-3.0

-2.0

-1.0

0.0 X_1

1.0

2.0

3.0

Figure 1: Movement of datapoints when the Friedman-Tukey-Index should be maximized. Each line connects a datapoint from its initial position in the gaussian sample to its position in the nal distribution, which has most points at 0.

10

Fraction of hopt n = 20 n = 50 n = 100 n = 200 0.05 128.52 185.19 238.03 302.92 0.07 66.04 95.16 122.32 155.67 0.10 33.94 48.90 62.86 80.00 0.14 17.44 25.13 32.30 41.11 0.19 8.96 12.91 16.60 21.12 0.26 4.60 6.63 8.53 10.85 0.37 2.36 3.41 4.38 5.57 0.51 1.21 1.75 2.25 2.86 0.72 0.62 0.90 1.15 1.47 1.00 0.32 0.46 0.59 0.75 Table 3: Maxima of the Friedman-Tukey can have for di erent bandwidths and number of observations (hopt is the rule-of-thumb bandwidth). To see that this happens in practice, we wrote a program which try to maximize the Friedman-Tukey-Index by moving the datapoints a little bit. A steepest ascent method is used. By this method a sequence of static plots showing the progressive movement of points towards 0. Figure 1 shows the movements for bandwidths (n = 50) given in Table 3. Unfortunately if the datapoints move too much and the distances become bigger than the bandwidth the movement stops. This happens because we have used the derivatives to nd the direction of the move.

5.2 Speed improvement

The main reason for the use of the binned estimator was that they can be computed faster. To demonstrate this we use the Cube-9 dataset which consists of 512 datapoints distributed at the vertices of 9-dimensional cube. The upper and middle picture in Figure 2 show the number of computed indexvalues in a 1 minute run for the Natural-Hermite-Index (order 0) and the FriedmanTukey-Index on a Alpha 3000/300. We can see easily that the Natural-HermiteIndex is much faster with factor bigger than 20. The lower picture shows the same for the Binned-Friedman-Tukey-Index. The index is still slower than the NaturalHermite-Index(0), but now only by a factor of 1:4, and it should be noted that it may now be faster than the Natural-Hermite-Index with more terms. Now the index is watchable.

5.3 Best projections of the binned estimator

As a rst approach to nd out if the binned estimators behaves nearly as the unbinned we have used the Bank dataset. The Bank dataset consists of 200 dat11

6.93e-02

2.62e-01

Binned Friedman-Tukey(0) Index 7.01e-02

4.86e-01

Friedman-Tukey(0) Index

Time

Time

Time

Figure 2: Each is a 1 minute segment on an Alpha 3000/300

12

7.36e-05

Hermite(0) Index 5.50e-03

Figure 3: XGobi run with Binned-Friedman-Tukey-Index for Bank data

13

1.96e-01

7.86e-02

7.81e-02

1.96e-01

Binned Friedman-Tukey(0) Index

Binned Friedman-Tukey(0) Index

apoints and 6 variables, which measure notesizes. The rst 100 observations are genuine Swiss banknotes and the second 100 are forged notes. It is known that the dataset can be separated well in 3 clusters; two large ones and a small one. Thus the Bank dataset provides us with three di erent interesting projections: 1. All three are clearly visible, 2. the two large clusters are overlapping and the small one is well separated from them and 3. the two large clusters are well separated and the small cluster is hidden by one of the large ones. Of course not every projection can be found easily. The most often viewed projection is the second one, the next the rst one and the most rarely seen is the last one. The Figure 3 shows that with the binned estimators all three possibilities can be found. The binned indices behaves, as expected, very similar. Another approach to examining the practical behaviour of the kernel-based indices is to examine simple data sets. For this purpose, two 4-dimensional data sets were generated, one containing \large" structure in the form of clusters (Figure 4), another with \ ne" structure (Figure 6). The two-dimensional kernelbased indices were computed on a selected subset of all two-dimensional projections of the 4-dimensional data generated from the interpolation f( ; ) : 0 = (cos ; 0; sin; 0); 0 = (0; cos ; 0; sin); ;  2 (?=2; =2)g. Each pair of angles (; ) speci es a 2-projection whose index can be plotted as a function of (; ) over the square (?=2; =2)  (?=2; =2). Nine points in the grid, corresponding to values ?=2; 0; +=2 for  and , are landmarks with simple interpretations: when jj = jj = =2 the projection is of variables 3 and 4, whilst when  = 0; jj = =2 it corresponds to variables 1 and 4, and to variables 2 and 3 if jj = =2;  = 0. Two di erent bandwidths were used, the optimal for each data set and a small bandwidth (h = 0:5). Figures 5 and 7 show the binned Friedman-Tukey and Entropy indices on the data. The behaviour is not unexpected as discussed in Polzehl & Klinke (1995). The small bandwidth is good for detecting the ne structure and gives optimal bandwidth gives a smoother function which makes optimization easier. Only the binned indices are shown as the results for the unbinned indices are indistinguishable from these.

5.4 Comparing binned and unbinned estimator

To compare both estimator we can use the method from section 5.1. We can compute for all projections the relative error ? Iunbinned j  = j Ibinned j Iunbinned j : 14

• •• • • • • •• • •• •

-1

• • • • • • •• •• • •• • • • • • • •••••••••• • ••• • • • •• • •• • •• •• • • • • •• • • • • • •• • •• • • • • • •• • •

X2

• • • • • •• • • • • • • • • • • ••••• • • • •• • • ••• • • • ••• •• • • • • • ••••• • • • •• • •• • • • •• •• • •• ••





••



• • • • •• • • ••• •• • ••• • • • •• • • •• • • • • • • • • •• • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • ••• • •• • •• • • ••••• • • • • • •• • • • • • ••• • • •• • • •• • •• • • • • • •• • • •• • • • • •• • • •••• • • • • • •••• • • •• • • •• • • • • • • •

0

1

•• • • • ••• • • • ••• •• ••• • • •• • • • • • • • •• • •• • • • • •• • •• •• •• •• • •



X3





0

• • • • • •• •• ••• • •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• • • • • •• • •

• • • •• • • •• • •• •• • •• ••• •• • • • • •• •

• • • • •• • • • •• • • •••• •••• • • ••• • •• • • ••• • •• • • • • •• • • • •• • • • • •• • • • • • •• •• • •• • •• • • •• • • • • •••• • ••• • • •• • • • •••••• •• • • • • •• • • • • • • • • ••••• •• • • • •• • •

2



-1

0 1 2 • • • • ••• • •••• ••• • • •• • •

1

-3 -2 -1 • • •• • •• • •• • • • • •• •• • •

1

2

0

2 1 0 -1 -2 2 1 0

• • • •• • •



-3

-2 -1

• • • • • • • •• • • • • • • •• •• • • •• • • • • ••• •• ••• • • •• • • •• • •• • •

• • •••••••• ••• • • • •• • •

0 1 • •• • • • •••• • • • •••• • •

-1

• •• •• •• •• ••• • • • •• ••

X1

• •• •• ••• •• •• •••• ••••• • • • •• • • • • • • •• •• •••••• • •• • •• • • • • • •• • • • • • • • ••• • • • •••• • • • •••••• • •• • • •• •• • • • • • •• • •

-1

-2

-2

X4

• -2

-1

0

1

2

Figure 4: 4-dimensional data containing \large" structure in the form of clusters.

15

Figure 5: Perspective plots of the binned Friedman-Tukey (top line) and binned Entropy (bottom line) indices on subsetted 2-dimensional projections of 4dimensional data set Figure 2,, using small bandwidth (left) and optimal bandwidth (right).

16

3 2 1 -3 -2 -1 0

-2

0

2

-3 -2 -1 0 1 2 3

-3 -2 -1 0

1

2

-3 -2 -1 0 1 2 -2 0 2 • •• • • • • •• • • • • • •••• • • • • •• • • •• • • • • • • •••• • ••• •••••••••••••• • • • •••• ••••••••• •••••••• • • • • • •••••••• •••••••••••• • • • • •••••• •••••••••• • • •••••••••• ••• ••• • •••• •••••••••••••••••••••••••••••• • ••••• •• • ••••• • •• •••••••••••••••• •• • • •••• •••••••••••••••••••••••••••••••••••••••••• • ••• • •• ••••• ••••••• ••••• • • • • • • • • • • • • • • • • • • • • • • • • • ••••••••• ••••••••••• ••• •••••••••••••••••• ••••••• • •• • •••• ••••• ••••••• • •••••• •• ••• • •• •••••••••••••••••••••••••••• X1 • ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • ••••••••••••••••••••••••••••••••••• •••••••• • • • • •••• ••••• ••••••• •••••• ••••• •• • • • •••••••••••••••••••••••••••••••••••••• • • • • • • ••••••••••••••••••••••••••••••••••••••••••••••• ••• •• • • • • •••••••••• • • • •• ••••••••••••••••••••••• •••••• ••••• • • • • •• ••••••••••••• ••••••••••••• •• •• •• • •••••••••••••••• • • •• • ••••••••••••• • ••• •• • • • • ••••••• • • • • • ••• • •• • • • ••••• •• • •• • • •• • • • •• • •• • • • •• •• ••• • • • • • • • • • • • • • •• ••••• ••• ••• • • ••••••••••••••• •••• •••• • ••••••• ••• • •• • • •• •••••••• • • • • • ••••• • •• • •• • • • ••••••••••••••••••••••••••••••••••••••• • •••••••••••••••••••••••• •••• • • • •• •• ••••••••••••••••••••••••••••••••••••••••••• •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• •• • • • ••••••••••••••••••••••••••••••• • • • ••• ••••• •••• ••• •• • • ••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • ••••••••••••••••••••••••••••••••••••••• •• • •••• •••• ••••••••• •••• ••••• ••••••• •••• •• • •• •• ••••••• •••• •• • •• ••••• •••••••••••• X2 •• •••••• ••••• •••••••••••••••••••••••••••••• •••• •• ••••••••••••••••••••••••••• •• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • • • • • • • • • • •••• •••••••• ••••••••••••••••••• • •••••• • • • • ••••• •••••••••••••••• ••• • • • • • • • •• •••• •• • • • ••••••••• ••• • • • • • • • •• • ••••••••••••••••••••• • ••• • •• • •• •••• • • • • ••••• • ••••• •••••• •• • • • • •• • • • •• •• • •• • • • • • • •• • • • • • • • • • • • • • • • •• • • •• •• •• •• • • • • •• •••••••••• • • • •• •• •••••••••• • •• • • • • •• ••• • ••••••• •• •• • •••• • • •••••••••••••••• ••• • •••••••• • ••• •••••••••••••••••••••••••••••••••••••••••••••••••• • • • • • • • • • • • •• •••••••••••••••••• ••••• • • • • • • • • • ••• •••••• •• • • • •••• •••••••• •• • • •••••••••••••••••••••••••••••••••••••••••••••••• • • • •••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • • ••• ••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••• • •••••• ••• • • • • • • • • • • • ••••• •••••••••••••••• •••••• ••• • • • ••••••••••••••••••••••••••••••••••••••••••••••••••• • X3 •• ••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • • ••• •••••• •• • • • •••• •••••••••••••••••• •••••••••• • • • ••••••••••••••••••••••••••• • • • • • •••••••••••• ••••••••••••••••••••••••••••••••••••••• • • ••••••••••••••••••• • • • • • • • • • • • • • •• ••••• •••••• • • • •• •••• ••••• ••••••• ••• •••• • •••••• • •••••••••• •••••• ••• •• • ••• •• • • •• •• •• • •• ••• • ••••• •• • • •••• • • •• • • • • • • • • • • ••• • •• • • • • • • • •• •• • • • • •• • • • •• ••••••••• •••••••••••••••• ••• ••••• •••••• ••• • •• • • ••• • •••••••••••••••••••••••• • • •• •• •• •• •• • • • •••••• ••••••••••••••••••• •• • • • • • •• ••••••••••••••••••••••••••••••••••• ••• • • • •• ••••••••••••••••••••••••••••••••••••••••••••••••••• • • ••• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •••• • ••••••••••••••••••••••••••••••••••••••••••••• •••• •• • • ••••••••••••••••••••••••••••••••••••••••••••••••••• ••• • ••• ••••••••••••••••••••••••••• • •••••••••••••••••• •••••• •• X4 ••• •• •••••••••••• • • • ••• •• •••••••• •••••••• • • • • • •••• •••••••••••••• •••• ••••• •• •• •• • •••••••••••••••••••••••••••••••••••••••••••• • ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••• •••••• •• • ••• ••••••••••••••••••••••••••••••••••••••••••••••••••••• • • • • •••••••••••••••••• ••• • •• • •••••••••••••• ••••••••••••••••• ••• • • •• ••• • •• •••• •• • • •• •• • •• • • • • • •••••••••••••••••••••••••• •• •• • • • • • ••• •• • • • • • ••••••• ••• •• •• • •• • •• ••• • ••• •• •• • • • • • • • • • • • -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Figure 6: 4-dimensional data containing \ ne" structure in the form of a spiral.

17

Figure 7: Perspective plots of the binned Friedman-Tukey (top line) and binned Entropy (bottom line) indices on subsetted 2-dimensional projections of 4dimensional data set Figure 4,, using small bandwidth (left) and optimal bandwidth (right).

18

The Tables 4 and 5 give the results for di erent bandwidths and number of datapoints. The tables show that for Friedman-Tukey-index the maximal deviation is small. It seems that that for the Entropy-index the deviation is sometimes very big. But we can not con rm this result on the following true datasets.

5.5 Comparing derivatives of binned and unbinned estimator

XGobi uses a steepest ascent method to maximize the index-function. Thus the question rises if we should use a binned estimator for calculating the direction of the steepest ascent. The following formula gives the derivative for the FriedmanTukey-index with Triweight kernel derived from formula (2) n X n @ I^FT ( ; ) = ?24 X @ m n2h4  i=1 j =1(xi;m ? xj;m )

0  !1  T T @1 ? (xj ? xi) ? (xj ? xi) A ? T (xj ? xi) h h 0  ! 1  T T I @1 ? (xj ? xi ) ? (xj ? xi ) < 1A 2

2

2

2

2

h

h

Analog to the index function we could replace the data in the projection plane by binned data. But we can see easily that a small variation of the bandwidth may include or exclude datapoints. Of course the term (xi;m ? xj;m) can be large and we will get a sudden change of the partial derivation which determines the desired direction. Another problem rises if we are approaching the maxima. The derivative converges to zero, which means that all partial derivatives converges to zero. Then elements of the sum have to balance each other, but in the case of the binned estimator the functions

0  !1  T T @1 ? (xj ? xi) ? (xj ? xi) A ? T (xj ? xi) h h 2

2

2

will have discrete values. Thus a balancing will be dicult. The Figure 8 shows the length of unbinned transformed by log10(1+ j j) against cos( binned; unbinned) for the Flea data. We can see easily that we have some serious deviation between both computed 's even for a large length of . This is due to the rst problem. But if the length of decreases the deviation becomes much more a problem. We achieve even an angle of 180 degrees between binned 19

n

h

20 0:09 0:13 0:18 0:25 0:35 0:50 0:69 0:97 1:35 1:89 50 0:08 0:11 0:16 0:22 0:31 0:43 0:60 0:84 1:17 1:63 100 0:07 0:10 0:14 0:19 0:27 0:38 0:53 0:74 1:03 1:44 200 0:06 0:08 0:12 0:17 0:24 0:34 0:47 0:66 0:92 1:29

Projections min I^unbinned max I^unbinned 295 7:10 9:23 403 3:64 4:74 652 1:87 2:81 877 0:96 1:73 1159 0:49 1:08 1293 0:26 0:89 1442 0:14 1:00 1318 0:10 0:87 1471 0:08 0:50 1806 0:06 0:28 226 3:85 4:62 381 1:98 2:69 483 1:02 1:75 530 0:54 1:63 651 0:30 1:03 758 0:18 0:85 824 0:12 0:99 972 0:09 1:12 1154 0:07 0:73 1994 0:06 0:42 244 2:42 2:96 271 1:27 1:69 314 0:67 1:10 331 0:38 0:85 463 0:22 0:81 536 0:14 0:96 695 0:10 0:91 867 0:08 1:25 1448 0:07 0:95 2875 0:06 0:52 160 1:58 2:08 231 0:85 1:28 218 0:47 0:87 254 0:28 0:70 459 0:18 0:71 417 0:12 0:63 345 0:10 0:89 2046 0:08 2:00 1446 0:07 1:22 1774 0:07 0:66

max  0:02184 0:02193 0:01455 0:00966 0:01129 0:01989 0:00990 0:01167 0:00554 0:00756 0:01372 0:01521 0:01077 0:02488 0:00976 0:00800 0:00809 0:01038 0:00776 0:00569 0:01484 0:02308 0:02727 0:01344 0:01817 0:02490 0:01864 0:01368 0:00429 0:00309 0:01519 0:00874 0:01802 0:02107 0:02471 0:03612 0:01552 0:01816 0:01049 0:01417

Table 4: Maximal relative error for the binned Friedman-Tukey-index for di erent bandwidths and number of observations. 20

n

h

20 0:09 0:13 0:18 0:25 0:35 0:50 0:69 0:97 1:35 1:89 50 0:08 0:11 0:15 0:22 0:30 0:42 0:59 0:83 1:16 1:62 100 0:07 0:10 0:14 0:19 0:27 0:38 0:53 0:74 1:03 1:44 200 0:06 0:08 0:12 0:17 0:24 0:34 0:47 0:66 0:92 1:29

Projections min I^unbinned max I^unbinned max  279 1:96 2:12 0:00803 388 1:29 1:50 0:01207 654 0:62 0:93 0:01617 924 ?0:03 0:37 16:73580 1209 ?0:70 0:04 0:01709 1276 ?1:33 ?0:27 0:01960 1420 ?1:92 ?0:53 0:00565 1352 ?2:31 ?0:48 0:00775 1524 ?2:57 ?0:95 0:00230 1826 ?2:77 ?1:42 0:00261 218 1:34 1:48 0:00742 389 0:68 0:89 0:01370 491 0:02 0:39 0:13915 672 ?0:61 ?0:11 0:03604 691 ?1:21 ?0:44 0:00727 739 ?1:72 ?0:62 0:00417 843 ?2:14 ?0:33 0:00408 922 ?2:47 ?0:27 0:00714 1180 ?2:65 ?0:58 0:00368 1858 ?2:78 ?1:02 0:00205 231 0:88 1:01 0:01126 277 0:23 0:41 0:04992 364 ?0:39 ?0:02 0:08339 333 ?0:98 ?0:45 0:01128 454 ?1:51 ?0:68 0:01125 520 ?1:97 ?0:71 0:01142 656 ?2:33 ?0:59 0:00781 929 ?2:57 ?0:26 0:00859 1167 ?2:70 ?0:32 0:00226 2149 ?2:79 ?1:02 0:00124 168 0:45 0:64 0:02126 221 ?0:17 0:09 2:02793 203 ?0:77 ?0:35 0:02101 296 ?1:32 ?0:65 0:01378 520 ?1:79 ?0:79 0:01254 491 ?2:17 ?0:82 0:01535 415 ?2:43 ?0:66 0:00690 844 ?2:60 ?0:31 0:00820 1127 ?2:71 ?0:63 0:00450 1452 ?2:78 ?0:68 0:00574

Table 5: Maximal relative error for the binned Entropy-index for di erent bandwidths and number of observations. 21

and unbinned. That means the binned and unbinned estimator compute two totally di erent directions.

0.0 -0.9

-0.6

-0.3

cos(alpha)

0.3

0.6

0.9

Alpha deviation for Flea data

0.0

0.5

1.0

1.5 2.0 log(1+len(alpha))

2.5

3.0

3.5

Figure 8: Angle of binned and unbinned estimator for the direction of on the Flea data.

A Mean Squared Error of the Friedman-TukeyIndex Under the assumption that Xi is independent and identical distributed (i.i.d), the unknown density suciently often di erentiable and the following de nitions Ki;jl

()

Z

si1 sj2 K(s1 ; s2 )l ds IRnon-zero if i and j even, = 0 otherwise. =

2

22

Ij

i

( )

Z

=

ZIR

2

J =

f(x1 ; x2)i

!

@ j f (x ; x ) + @ j f (x ; x ) dx dx 1 2 @xj1 1 2 @xj2 1 2

f(x1 ; x2)fx x x x (x1 ; x2)dx1dx2 IR Mi = K4(i;0) I4(1) + 6K2(i;2) J 1 1 2 2

2

we can start to calculate the MSE of the Friedman-Tukey-Index. First we transform the Friedman-Tukey-Index by



n XX K Xi ?h Xj I^h = n21h2 i=1 j =1



0 1   n X = n 1h B @nK(0; 0) + 2 K Xi ?h Xj CA ; i;j 2 2

=1

i

Suggest Documents