scatterplot matrix with brushing, the grand tour and other related graphical ... We will discuss the case of a line density plot for the parallel coordinate plot of two.
Construction of Line Densities for Parallel Coordinate Plots1 John J. Miller Edward J. Wegman Center for Computational Statistics George Mason University Fairfax, VA 22030
ABSTRACT The graphical representation of high dimensional data may be accomplished by using the parallel coordinate plotting system. In such a representation points in Euclidian n-space are mapped into line segments in Euclidian 2-space by a projective transformation. This representation preserves much of the geometric structure found in hyperspace but not easily representable by other methods. In large data set applications, the graphical displays may be heavily overplotted like ordinary scatterplots yielding uninterpretable displays. In this paper, we suggest replacing the raw data display with a density plot. In order to define densities for lines sensibly, we introduce the notion of line densities and develop their basic construction. We illustrate with theoretical parallel coordinate density plots for the normal and the uniform cases. Finally we illustrate sample density plots with 4-dimensional spheres. 1. Introduction. The graphical display of high dimensional data is a challenging and difficult task. A number of interesting techniques including rotating three-dimensional scatterplots, the scatterplot matrix with brushing, the grand tour and other related graphical and computational methodologies have been invented with the ambition of seeing structural relationships among random variables in three, four and more dimensions. One such technique is the parallel coordinate display. This has been exposited by Inselberg (1985) as a tool for computational geometry and by Bolorforoush and Wegman (1988) and Wegman (1990) as a tool for exploratory data analysis. Recognizing the basic weakness of Cartesian plots as an inability to have more than three orthogonal coordinate axes, the parallel coordinate idea is to give up orthogonality in favor of a system of parallel coordinate axes drawn in the 2-plane. Of course, as many parallel axes as one likes can be drawn in the 2-plane so that there is no inherent dimensionality limit. The idea then is to label each axis with an appropriate variable name and to connect points on adjacent axes with straight line segments. Figure 1.1 illustrates the basic idea. 1This
research was supported by the Army Research Office under contract number DAAL03-87-K-0087, by the Office of Naval Research under contract number N00014-J-89-1807 and by the National Science Foundation under grant number DMS-8701931.
Parallel coordinate plots draw their basic utility from the fact that the mapping from Cartesian space to parallel coordinate space is a projective transformation. The implication of this is that geometric structures in Cartesian space are mapped into geometric structures in parallel coordinate space, in particular, structures which carry statistical interpretations. Wegman (1990) details some of these interpretations. The drawback of both ordinary scatter diagrams and their multidimensional parallel coordinate analogues is that they begin to lose their effectiveness as the size of the data set becomes large. Scott (1985, 1986) has suggested replacing the scatterplot by a density estimate. This device allows for visualization of structure where only heavy overplotting would have occurred with scatterplots so that density estimation is an important adjunct to the data analysis process. We propose in this paper using a density plot for parallel coordinates in much the same way that Scott has suggested for ordinary scatter plots. However, for parallel coordinate plots a somewhat different form of density is required. This is true because the limiting asymptotics appropriate for points in an infinitesimal box are inappropriate for lines crossing through an infinitesimal box. We call the the limiting form a line density. In this paper we discuss the properties of line densities for lines constructed in parallel coordinate plots. We will show in Section 2 how a “theoretical form" of the line density should be defined. Although the primary use of parallel coordinate plots is in the display of sample data, it is extremely important to understand the properties of sample density plots in terms of the theoretical form of the density. By this, we mean the density arising when the variables being plotted have a joint probability distribution. We study these theoretical density plots to understand how the process of calculating densities operates and what the properties of such densities are. We will see that several properties of theoretical densities have implications on the most effective method of scaling the data to use in creating sample densities. Furthermore, we study theoretical densities so as to be able to better recognize and identify patterns in sample densities. Section 3 contains examples of theoretical line density plots for some common bivariate distributions. An extensive treatment is given to the situation that the plotted variables have a bivariate normal distribution. The case of points uniformly distributed on a square is also illustrated. We will discuss the case of a line density plot for the parallel coordinate plot of two variables, u and u . The parallel axes will be set at a distance d apart. For ease of exposition, we will superimpose a Cartesian (x, y) coordinate system on the parallel plot, where the u axis is given by the line y ~ 0 and the u axis is given by the line y ~ d. Then the point (u , u ) in the original Cartesian coordinates is represented by the line connecting the points (u, 0) and (u , d) in the parallel coordinate plot. The equation of this line in (x, y) coordinates is easily seen to be (x c u )/(u c u ) ~ (y/d) or x ~ (1 c p)u b pu , where p ~ y/d. This equation shows that the y coordinate is most conveniently described by the proportion p ~ y/d; hence, a generic point (x, y) is best represented by recoding y ~ pd. This will mean that densities may be described in terms of x and p rather than x and y. The actual value of d used in a parallel coordinate plot is unimportant as far as these densities are concerned. Hereafter we will refer to points in their (x, y) coordinates by (x, p). 2. Theoretical Line Densities
By a theoretical form, we mean the theoretical form of the density in parallel coordinates appropriate when U and U are random variables with a bivariate distribution. We will restrict our attention to the case where U and U have a continuous bivariate distribution, since discrete U and U yield density plots which simply mimic ordinary parallel coordinate plots. A density for such a distribution is typically defined by some sort of limiting argument. For points in two dimensions (u, v), the natural method for defining the density at a point (u, v) is to set up a rectangle with u extending to u b u and v to v b v. The probability mass within this rectangle is then computed and divided by the area of the rectangle, u v, with the limit taken as u and v approach zero. When the random variables, U and V have a continuous bivariate distribution, this argument will yield precisely the probability density function in Cartesian coordinates. One might presume that the proper method to define a line density is to set up the same sort of rectangle, establish the probability for the points in (u , u ) whose lines pass through it, divide by the area of the rectangle, and take limits. Using rectangles will, in fact, not work, as is shown by the following argument. Consider the point (x, p) and the rectangle defined by extending x to x b x and p to p b p. The set of all points (u , u ) whose lines pass through this rectangle is a region formed by the following four lines in (u , u ) (See Figure 2.1) x ~ (1 c p)u b x b x ~ (1 c p)u x ~ (1 c (p b p))u b x b x ~ (1 c (p b p))u
pu , b pu , (p b p)u , b (p b p)u .
These lines define Regions I, II, and III. The probability contained in Region I for the random variables U and U is P{x (1 c p)U b pU x b x}, which does not depend on p. Define the random variable V ~ (1 c p)U b pU . This probability is then P{x V x b x}. If V has a continuous distribution, then when this probability is divided by x and the limit taken as x approaches 0, the limit will be g (x), where g (v) is the density function for V . If V has a mass point at x, then the limit will be b B. In any case, the limit will be b B when the probability is divided by x p and both approach 0. Since the probability associated with Regions II and III is nonnegative and cannot reduce the probability in Region I, the attempt to define a density using rectangles will always result in a limit of b B. Although an attempt to define a line density by considering rectangles will not be successful, the correct method can be seen by examining Region I in Figure 2.1. This region encompasses all points (u , u ) represented by lines passing through the line segment from (x, p) to (x b x, p). The limit of the probability contained in this region divided by x will be g (x), if V as defined above has a continuous distribution. (If V does not have a continuous distribution, g (x) could be considered as a probability mass function instead of a density and all arguments remain true.) Let us then define the line density in the following manner: lim P{x V x b x}/ x %S
(1)
We can now state the following result: Proposition 1: Let U and U have a bivariate distribution such that V ~ (1 c p)U b pU has a continuous distribution for all 0 p 1. The line density (defined by (1) above) at the point (x, p) of a parallel coordinate plot for U and U is given by g (x), where g ( h ) is the density of the random variable V . The density so defined is a density over 0 p 1, c B < x < b B in the sense that f(x, p)dx dp ~ 1. Proof: We have already noted that the line density defined by (1) will equal g (x) under the conditions of the proposition. The second statement is proved by noting that f(x, p)dx ~
g (x)dx ~ 1 for all 0 p 1. Hence, f(x, p)dx dp ~ 1 dp ~ 1.
There is a practical motivation for defining a density in terms of lines passing through a line segment rather than lines through a rectangle which derives from counting considerations. Since the line segment from (x, p) to (x b x, p) is parallel to the parallel axes, the line representing a point (u , u ) will cross it in exactly one position. However, a line representing (u , u ) could, in principle, cross several rectangles with the same p but with different values of x, thus resulting in a form of double counting. Using the definition of density we have chosen avoids any possibility of this form of double counting. 3. Theoretical Density Plots: Examples 3.1 Bivariate Normal Distributions We study line density plots for bivariate normal distributions. There are several reasons for this. First, this is an important family of distributions and is thoroughly understood in Cartesian coordinates. In many ways the normal is the assumed standard. Thus to detect deviations from it we should have a template against which to measure. Thus we must understand its display in parallel coordinates. Second, the form of the density is easily tractable and results in parallel coordinates may be obtained in closed form. Finally, the results obtained by consideration of these distributions have important implications for how data should be scaled and plotted for sample density plots. Consider the case where U and U have a bivariate normal distribution with means, standard deviations, and correlation , , , , and . It is an easy calculation to see that the distribution of V is then normal with mean ~ (1 c p) b p and variance ~ (1 c p) b p b 2p(1 c p) . The line density for a parallel coordinate plot of a bivariate normal distribution is given by
f(x, p) ~
exp{c(x c ) /2 } . (2 )°
(2)
We may observe several things about f(x, p). First, the maximum of f(x, p) over x occurs at for any value of p, yielding a maximum equal to f( , p) ~ 1/(2 )° . Clearly this
expression will have a maximum over p when has a minimum. We find that the derivative of is given by d dp
~ c 2( c ) b 2p( b c 2 ).
(3)
with second derivative 2( b c 2 )>0. If we set the right hand side of (3) equal to zero and solve we find that the minimum for and, hence, the maximum for f(x, p) occurs at ^p( , ,) ~ ( c )/( b c 2 ). We note that ^p( , ,) ~ 1 c ^p( , ,), confirming what we know must be true from symmetry considerations. We further note that for certain combinations of parameters, p^ can be less than 0 or greater than 1. In particular ^p < 0 when > / and p^ > 1 when > / . This has implications for scaling as we plot these densities as we shall see below. If we wish to establish contours for f(x, p), we may set f(x, p) ~ h and solve for an equation in x and p, obtaining (x c ) ~ c (ln2h ). Note that if ~ 0 for all p (i.e. ~ ~ 0), then this equation becomes x ~ c (ln 2 h ), in which case, dx /dp ~ (x c )(1/ )(d /dp). In the special case where both marginal distributions are standard normals (i.e. ~ ~ 0, ~ ~ 1), ~ 0 and ^p ~ 0.5 for all p, ~ 1 c 2p(1 c p)(1 c ), and d /dp ~ c 2(1 c 2p)(1 c ). In all that follows, we shall refer to a bivariate normal distribution with parameters , , , , and by BVN( , , , ,). Figures 3.1 and 3.2 show perspective plots for BVN(0, 0, 1, 1, 0.4) and BVN(0, 0, 1, 1, c 0.9) respectively. These perspective plots give an interesting view of the densities. In particular, we see how the crossing patterns of lines arising from negative correlation leads to a surface of high curvature. We will, however, concentrate on contour plots since that is how the densities would most likely be displayed for sample data. Figures 3.3 through 3.7 are contour plots for BVN(0, 0, 1, 1, ), where ~ 0.9, 0.4, 0.0, c 0.4, and c 0.9, respectively. We note the relative straightness of the contours for ~ 0.9 and the curves for the contours for ~ c 0.9, with the increasing curvature for intermediate values of . The value of decreases as p goes from 0 to 0.5 and the increases symmetrically. Observing dx /dp and d /dp, we see that for values of for which decreases quickly enough, x can increase fast enough so that its derivative changes sign, leading to the figure 8 shapes for the contours observed in Figures 3.6 and 3.7. In studying Figures 3.3 through 3.7, we also note that negative correlations lead to more dramatic and easily observed features than do positive correlations. In particular, it is easier to see the difference between Figures 3.6 and 3.7 than between Figures 3.3 and 3.4. In density plots involving sample data, where the contours are blurred, this difficulty would be exacerbated. This illustrates that a necessary device for any system for plotting parallel coordinates is an axis reflection command. Reflecting one of the parallel axes would reverse the sign of the correlation. If this can be done dynamically, it dramatically adds to the analysts ability to study the correlation structure of a data set in detail. Data with stronger correlation
exhibit more dramatic pattern changes when the sign is changed than data with weaker correlation. To further study the densities for bivariate normal distributions, we investigated the effect of changing the means and standard deviations on the plots. Figure 3.8 is for the BVN( c 1, 1, 1, 1, c 0.9) and illustrates the effect of having different means on the two axes. The density maintains its basic shape but is distorted by the translation of the axes. There is no reason why an analyst should be forced to use such plots, so a preplot scaling which equates or nearly equates the means for the axes is important. Figures 3.9 and 3.10 are for BVN(0, 0, 0.6, ^ the 1.2, 0.4) and BVN(0, 0, 0.6, 1.2, c 0.9) respectively. Figure 3.9 illustrates the fact that p, point of maximum density, can occur for a point which is outside or nearly outside the range 0 p 1; p^ here is 0.059. This picture is quite distorted and does not look very much like the previous densities. Figure 3.10, while different from Figure 3.7, is of the same basic form. This reflects the fact that p^ cannot be outside [0,1] for < 0; see the equations above which describe when ^p will be outside. Again the potential gain of an axis reflection is noted. Furthermore, preplot scaling which equates or nearly equates the standard deviations will be extremely useful. For well behaved data, something as simple as range matching will work. By studying the densities for bivariate normal distributions, we have illustrated several important principles of plotting for parallel coordinates. We have seen that it is useful to be able to reflect axes and to scale the data on the axes to have a consistent means and standard deviations. 3.2 Uniform Distributions Uniform distributions interest us because they are useful examples of “short tailed" distributions and because when we study the properties of pseudo random number generators, we will use diverse plots of pseudo uniform random numbers. We will study the line density for a parallel coordinate plot for independent uniform random variables, each on [0,1]; that is, we will study the parallel plot line density for (U , U ) uniformly distributed on the unit square. In contrast to the bivariate normal situation, there is no easy closed form for the distribution of V defined as in (1) above. Hence, we will use a geometric method to solve for the density. First, we note that by Proposition 1, the parallel coordinate line density will be equal to g ( h ), the density of V . In this case, it is easier to calculate G ( h ), the distribution function of V , and differentiate. We shall use this method to prove Proposition 2. Proposition 2: Let U and U have independent uniform distributions on [0,1]. Consider the four Regions shown in Figure 3.11. The line density for the parallel coordinate plot of U versus U is given by the following formulae: f(x, p) ~ x/p(1 c p) for x p, x 1 c p (Region I) ~ 1/p for x p, x 1 c p (Region II) ~ (1 c x)/p(1 c p) for x p, x 1 c p
(Region III) ~ 1/(1 c p) for x p, x 1 c p (Region IV) Proof: We will derive G (v) for each region and take is derivative, g (v); then by Proposition 1, f(x, p) ~ g (x). Note that the intercepts on the u and u axes for the line (1 c p)u b pu ~ v are respectively v/(1 c p) and v/p. Now consider (v, p) in Region I. This corresponds to a situation as shown in Figure 3.12. The probability P{V v} is given by the shaded area in that figure and is easily seen to be v /2p(1 c p). The derivative of this expression is v/p(1 c p). Now consider (v, p) in Region II. This corresponds to a situation as shown in Figure 3.13. The probability P{V v} is again given by the shaded area and is found as follows. The area of the shaded rectangle is (v c (1 c p))/p and the area of the shaded triangle is 6(v/p) c (v c (1 c p))/p7/2. Adding these areas, we obtain v/p c (1 c p)/2p. The derivative of this expression is 1/p. Symmetry considerations easily yield the given expressions for Regions III and IV.
It should be noted that we may easily verify that g ( h ) defined above is indeed a density which satisfies g (v)dv ~ 1 for all 0 p 1. Attention may now be turned to the contours of f(x, p). The maximum of f(x, p) is 2 and occurs when x ~ p ~ 1/2. The minimum is 0 which occurs when x ~ 0 or 1. Setting f(x, p) ~ h, for 0 h 2 yields the following four expressions. The equations describing the contours are x ~ hp(1 c p) in Region I, p ~ 1/h in Region II, x ~ 1 c hp(1 c p) in Region III, and p ~ 1 c 1/h in Region IV. We may easily show that these contours are continuous on the boundaries of the regions. Figure 3.14 shows a perspective plot of the density f(x, p) and Figure 3.15 shows a contour plot. We note the rising shelf in Regions II and IV and the rapidly descending density in regions I and III. 4. Finding Holes in Hyperspheres We have demonstrated in this paper a definition and several methodologies for computing parallel coordinate densities. The intent of this exercise is to design a data analytic tool for detecting structure in data sets which are simultaneously high dimensional and large in scale. We believe parallel coordinates is admirably suited to use in multiple dimensions as has been argued elsewhere. (See Wegman, 1990). But as with other multivariate plotting tools, the plots become overwhelming when the data sets become large. Density plots alleviate the confusing images by focusing on general structure as opposed to individual observations. To illustrate our point we choose to consider an example which is a notoriously difficult problem, the problem of finding holes in multidimensional spheres. Because of complete symmetry of spheres of all dimensions, it is a difficult challenge to find holes simply by looking at any lower dimensional projection (be it 2 or 3 dimensional). Slicing helps in three dimensions, but is not particularly a useful tool in four and higher dimensions because the 3-dimensional projections of 4-dimensional spheres with holes will not have holes. Figure 4.1 illustrates a parallel coordinate line density plot for a four-
dimensional sphere of unit radius. Figure 4.2 illustrates the parallel coordinate line density plot of the same four-dimensional sphere but with a center section of radius .5 removed. Notice that there is a marked squaring of the contours for the sphere with a hole and a noticeable deficiency in mass in the middle. The sphere with a hole actually has a much flatter density and the sphere with no hole has a line density plot which is markedly more peaked. Figure 4.3 represents the ratio of these two theoretical parallel coordinate plots. Figure 4.3 is especially significant for the reason that the ratio essentially creates two rather sharply divided regions. Thus while the density plots themselves may be comparatively difficult to estimate, their ratio should be comparatively easy to estimate. The idea of comparing ratios is reminiscent of the idea of the Q-Q plot, comparing sampled data to a theoretical template. In our case the yellow region represents a mass deficiency in the central region, which of course, is exactly what we would expect with a 4-dimensional hole. In fact from the hyperbolic edges of the yellow region, we may conclude that the yellow region represents a 4-dimensional sphere so that we are literally seeing the hole in Figure 4.3. Figures 4.4 and 4.5 are sampled density plots for a 4-sphere with no hole and a 4sphere with hole of radius 1/2. These plots bear a bit of explanation. In our normal density plots, we use color to distinguish contour levels according to the scale on the left hand side of the Figures 4.1 through 4.5. These are scaled to the maximum of the density plotted so that two densities may have very different maxima, but as in Figure 4.1 and 4.2 will appear to differ only in the shape of the contours. In order to provide some relative scale so that markedly different maxima can be distinguished and also to provide information on fine scale structure, cycle through the color scale in each of Figures 4.4 and 4.5 several times. The color scale is simply a coding of the density relative on the scale of 0 to 1. In order to create plots of the form of Figure 4.4 and 4.5 we do not use the actual maximum of the density for scaling but simply set the “maximum" parameter artificially low so that we cycle through the color scale modulo the “maximum" parameter. Thus the more color cycles the larger the peak. (This technique also gives rather attractive pictures which seem to capture the attention of those who see them. It is our answer to fractals.) It should be clear that Figure 4.4 is rather more peaked and has somewhat rounder central contours than Figure 4.5. This is, of course, what we would expect with a solid sphere compared to a sphere with a hole. Wegman and Bolorforoush (1989) have implemented many multidimensional data analytic techniques in a MS-DOS program known as Mason Hypergraphics. Sample parallel coordinate line density plots are included among these techniques. Wegman and Bolorforoush use a variant of the ASH (average shifted histogram). ASH is particularly useful for graphics purposes because binning can be done with no logical branching and in an integer mode. Thus it is sufficiently fast to compute even on relatively simple machines.
REFERENCES Bolorforoush, M. and Wegman, E. J. (1988), “On some graphical representations of multivariate data," Computing Science and Statistics: Proceedings of the 20th Symposium on the Interface, 121-126.
Inselberg, A. (1985), “The plane with parallel coordinates," The Visual Computer, 1, 69-91. Scott, D. W. (1985), “Average shifted histograms: effective nonparametric estimators in several dimensions," Ann. Statist., 13, 1024-1040.
Scott, D. W. (1986), “Data analysis in three and four dimensions with nonparametric density estimation," in Statistical Image Processing and Graphics, (Wegman, E. J. and DePriest, D. J., eds.), 291-306. Wegman, E. J. (1990), “Hyperdimensional data analysis using parallel coordinates," to appear J. Amer. Statist. Assoc. Wegman, E. J. and Bolorforoush, M. (1989) Mason Hypergraphics (c) 1988, 1989, a MS-DOS package for exploring high dimensional data, to be published by Professional Statisticians Forum, Ltd.
Figure 1.1 Parallel coordinate representation of two n-dimensional points Figure 2.1 Probability mass regions for parallel coordinate density plots Figure 3.1 Parallel coordinate line density plot for BVN(0, 0, 1, 1, 0.4) Figure 3.2 Parallel coordinate line density plot for BVN(0, 0, 1, 1, c 0.9) Figure 3.3 Parallel coordinate line density contour plot for BVN(0, 0, 1, 1, 0.9) Figure 3.4 Parallel coordinate line density contour plot for BVN(0, 0, 1, 1, 0.4) Figuire 3.5 Parallel coordinate line density contour plot for BVN(0, 0, 1, 1, 0.0) Figure 3.6 Parallel coordinate line density contour plot for BVN(0, 0, 1, 1, c 0.4) Figure 3.7 Parallel coordinate line density contour plot for BVN(0, 0, 1, 1, c 0.9) Figure 3.8 Parallel coordinate line density contour plot for BVN( c 1, 1, 1, 1, c 0.9) Figure 3.9 Parallel coordinate line density contour plot for BVN(0, 0, .6, 1.2, 0.4) Figure 3.10 Parallel coordinate line density contour plot for BVN(0, 0, .6, 1.2 c 0.9) Figure 3.11 Density regions for two uncorrelated uniform densities
Figure 3.12 Region I integration region for uniform densities Figure 3.13 Region II integration region for uniform densities Figure 3.14 Parallel coordinate line density plot for random variables uniformly distributed on the unit square Figure 3.15 Parallel coordinate line density contour plot for random variables uniformly distributed on the unit square Figure 4.1 Parallel coordinate line density contour plot for a solid 4-dimensional hypersphere Figure 4.2 Parallel coordinate line density contour plot for a 4-dimensional hypersphere with a hole of radius 1/2 Figure 4.3 Parallel coordinate plot of the ratio of the line density of a 4-dimensional hypersphere with a hole of radius 1/2 to the line density of a solid 4-dimensional hypersphere Figure 4.4 Parallel coordinate line density contour plot of 1500 sampled points from a solid 4-dimensional hypersphere Figure 4.5 Parallel coordinate line density contour plot of 1500 sampled points from a 4-dimensional hypersphere with hole of radius 1/2