High Dimensional Data Visualization

5 downloads 0 Views 347KB Size Report
Projection Pursuit and Grand Tour. • For data on mixed scales. – Trellis Displays. All four visualization tools shall be investigated in this section in order to give a ...
High Dimensional Data Visualization Martin Theus University of Augsburg, Department of Computational Statistics and Data Analysis [email protected] One of the biggest challenges in data visualization is to find general representations of data, that can display the multivariate structure of several variables at a time. In general we can distinguish two types of data displays: dynamic (i.e. which allow user interaction) and static. It turns out, that in high dimensional data exploration the interaction with the visual display is very important in order to gain a better insight into the data. For presentation purposes static graphics are usually needed to communicate the findings. Four different plot types have proven to be the most powerful: • For purely categorical data – Mosaic Plots • For purely continuous data – Parallel Coordinate Plots – Projection Pursuit and Grand Tour • For data on mixed scales – Trellis Displays All four visualization tools shall be investigated in this section in order to give a comprehensive overview of how to visually explore and present high dimensional data.

1

Mosaic Plots

Mosaic plots derive quite naturally from barcharts and spineplots. They form a recursive and conditional display of all possible combinations of categories of the discrete variables included. Each tile in a mosaic plot corresponds to an intersection of categories of the involved variables. The area of the tiles is proportional to the number of observations in that particular intersection. Figure 1 gives an example barchart off the housing type of the Housing Factor data set from Cox & Snell’s (1981) book. All cases with highly satisfied tenants are selected in red. 1

Adding two more dimensions, namely the Contact and the Influence a tenant feels to have on the housing situation gives the plot in Figure 2. Still cases representing highly satisfied tenants are highlighted in red.

Apartments Atrium Houses Terraced Houses Tower blocks

Figure 1: Barchart of Housing Type Atrium H.

Terraced H.

Tower blocks

High

Low

Apartments

Low

Medium

High

Figure 2: The mosaic plot for the complete Housing Factors data set. Housing Type is first along x, then Contact along y and Influence the third division along x again Figure 2 nicely shows the four dimensional interaction structure of the data. There are two obvious deviations from the general pattern. Tenants in Tower Blocks seem to be more satisfied with their living condition even when they claim to have only low Influence on these conditions. Tenants in Atrium Houses with low Contact and high Influence are less satisfied to a high degree than to be expected from the overall pattern. This is an example of how mosaic plots can not only be used in an exploratory graphical analysis but also be an excellent addition for setting up and interpreting corresponding log linear models, cf. Theus and Lauer (1998).

[More on interactions and variations of mosaic plots] 2

2

Parallel Coordinate Plots

Parallel coordinate plots use the 2-dim. cartesian space more efficiently than scatterplots by placing all coordinate axes in parallel. For each observation the points in the resulting dot-plots are connected by a polygon, resulting in n polygons of k −1 line segments for a dataset of size n×k. The parallel arrangement of the axes allows to investigate far more variables at a time as for instance with scatterplot matrices (SPLOMs), since there is no redundant plotting of entities. Nevertheless, the problem of overplotting is much more serious as it is with scatterplots. Since features are only visible at adjacent axes, a flexible permutation mechanism is needed to get a comprehensive view on all data. palmitoleic

palmitic

oleic

linolenic

stearic

linoleic

eicosenoic

arachidic

Figure 3: PCP for the olives data. Moderate α−belnding has been applied. Figure 3 gives a classical example of a parallel coordinate plot, showing the fatty acid content of Italian olive oils from 9 Italian regions. The Sardinian oils have been highlighted allowing to trace the two cluster through all 8 variables. Parallel coordinate plots have been discussed not only in statistics resp. data analysis, but also in mathematics concerning their geometrical properties. Whereas interactive scaling issues are not important for the geometrical interpretation, their very much are in data analysis. Figure 4 shows an example of the results of the single stages of all 147 cyclists, who completed all 20 stages of the 2004 Tour de France. Whereas Figure 4 is scaled on individual scales per axis, the (cumulative) 3

E01.L203

E00.Prolog

E03.L210

E02.L197

E05.L201

E04.Team65

E07.L205

E06.L196

E09.L161

E08.L168

E11.M164

E10.S237

E13.S206

E12.S198

E15.S181

E14.L193

E17.S205

E16.EZ.AlpedHuez

E19.EZ55

E18.M167

E20.L163

Figure 4: PCP for the results of the 20 stages of the 2004 Tour de France — default scaling K01.L203

K00.Prolog

K03.L210

K02.L197

K05.L201

K04.Team65

K07.L205

K06.L196

K09.L161

K08.L168

K11.M164

K10.S237

K13.S206

K12.S198

K15.S181

K14.L193

K17.S205

K16.EZ.AlpedHuez

K19.EZ55

K18.M167

K20.L163

Figure 5: PCP for the cumulative results of the 20 stages of the 2004 Tour de France. Data is on common scale and has been centered at the mean of each variable. data in Figure 5 is on common scale and centered at the mean of each variable. The US Postal team is highlighted both in Figure 4 and 5.

[More on different scalings and sortings and outlier]

3

Grand Tour and Projection Pursuit

So far, all plots are rendered on a two dimensional medium, a sheet of paper or a computer screen. Even a 3-d rotating plot is “just” a 2-d projection of 3-d data, which gets it pseudo 3-dimensionality only by the “motion” of the rotation. The Grand Tour generalizes the idea of a 3-d rotating plot to any arbitrary number of dimensions. It is defined as: A continuous 1-parameter family of d−dimensional projections 4

of p−dimensional data which is dense in the set of all d−dimensional projections in IRp . The parameter is usually thought of as time For a 3-d rotating plot, the parameter p equals 3 and the parameter d equals 2. In contrast to the 3-d rotating plot, the grand tour does not have rotational controls, but uses successive randomly selected projections. Figure 6 shows an example of 3 successive planes P 1, P 2 and P 3 in 3 dimensions. The planes between the randomly selected base planes are interpolated, to

P1

P3

P2

Figure 6: Example path of a grand tour get a smooth pseudo rotation, which is comparable to a physical 3-d rotation. A more technical description of the grand tour can be found in Buja (1996). Although the human eye is not very well trained to recognize rotations in more than 3 dimensions, the grand tour allows to find structures like groups, gaps and dependencies in high dimensional data.

[More examples and projection pursuit]

4

Trellis Displays

The last two sections dealt with either purely categorical or purely continuous data. When plotting data on mixed scales, Trellis Displays1 have proven useful. Trellis Displays use a lattice like arrangement to place plots on the so called panels. Each plot in a trellis display is conditioned upon at least one other variable. The biggest advantage of trellis displays is the 1

Trellis displays are called Lattice Graphics within the R package

5

common scale among all plot panels. This allows an effective comparison of the panel plots between rows, columns (and pages), depending on the number of conditioning variables and the type of panel plot. The probably simplest example of a trellis display is a box-plot y by x. Figure 7 shows a Analysis

Algebra

Numerics

Statistics

Comp. Science

Opt

WiSo

Figure 7: A boxplot y by x is a simple form of a trellis display. (Note that the whiskers of the base boxplots have been extended to be light gray boxes in order to allow highlighting) box-plot of exam results conditioned on the faculty of the students. Results can easily be compared between subjects, since the scale does not change when visually traversing the different faculties. An example of two panel variables Mark in 1st Degree and Mark in 2nd Degree and two conditioning variables Subject and Gender and one so called adjunct variable, namely Region where the student comes from is plotted in Figure 8. Trellis displays are most powerful when up to two conditioning b d 1.0 1.5 2.0 2.5 3.0 3.5 4.0

male Algebra

r s

● ●

● ●

1.0 1.5 2.0 2.5 3.0 3.5 4.0

male Analysis

male Comp. Science

1.0 1.5 2.0 2.5 3.0 3.5 4.0

male Numerics

male Opt

male Statistics

male WiSo ● ●

● ●

● ●

● ●







● ● ●



● ● ●

●●





● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ●



2nd Degree



● ● ●●● ● ● ● ●● ●

● ●





female Algebra

female Analysis





● ●● ● ● ● ● ●●



●●

●● ●●



● ● ●



● ●



● ● ●

● ● ●● ● ● ● ● ● ●● ● ●● ●

female Comp. Science











●● ● ●● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●



female Numerics

● ●













● ● ● ●

● ●● ●



● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●



● ●



female Opt





● ● ● ●● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● ●

female Statistics

● ●

2.5 ●





2.0 1.5 1.0

female WiSo

3.5



2.5

● ● ●

2.0

● ●



● ● ●

● ●



● ● ●

1.0

● ●





● ● ● ● ● ● ● ● ● ● ●

















● ●



● ●

● ● ●

● ●

● ●

● ●



●●







● ● ●

●● ● ● ● ● ●







● ● ●



●● ● ● ●

● ●





1.0 1.5 2.0 2.5 3.0 3.5 4.0

● ●



● ● ● ● ●





● ●

3.0

1.5

3.5 3.0

● ●● ●● ●● ● ● ●

● ●● ●









● ●

● ● ● ●







● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●



● ●

● ●





● ●

● ● ●





● ● ●●





● ●● ●





● ●

1.0 1.5 2.0 2.5 3.0 3.5 4.0



1.0 1.5 2.0 2.5 3.0 3.5 4.0

● ● ●

●●



● ●





● ● ●



● ●

● ● ●● ● ●

●● ● ●

● ●

● ● ●



●●



● ●

● ●●

● ●





● ● ●

●●

● ●



● ● ●● ●



1.0 1.5 2.0 2.5 3.0 3.5 4.0

1st Degree

Figure 8: A trellis display incorporating 5 variables of the students data set. 6

variables and more complex panel plots are used. The panel plot can be any arbitrary plot. Often fitted models are superimposed. Figure 9 shows the same plot as in Figure 8 except for the adjunct variable. Each scatter plot has a lowess smoother superimposed. One problem with trellis displays is the fact that it is hard to judge the number of cases in a panel plot. E.g. in Figure 9 it is desirable to have confidence bands for the scatter plot smoother in order to be able to judge the variability of the estimate. 1.0 1.5 2.0 2.5 3.0 3.5 4.0

male Algebra

1.0 1.5 2.0 2.5 3.0 3.5 4.0

male Analysis

male Comp. Science

1.0 1.5 2.0 2.5 3.0 3.5 4.0

male Numerics

male Opt

male Statistics

male WiSo ● ●

● ● ●

● ●











● ● ●





● ● ●

● ●



2nd Degree

●●

● ● ●●

● ●









● ●

female Algebra

female Analysis



● ●● ● ●

● ●

● ●





● ●

● ● ●

● ● ●

●● ●



● ● ● ● ● ●●

● ●

female Comp. Science











●● ●



● ● ●



●● ●





● ●

●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ●



female Numerics















●●



● ● ● ●

● ●



female Opt





● ●

2.5 ●





2.0 1.5 1.0

female WiSo

3.5



2.5 2.0

● ● ● ●

● ●

● ●



● ● ● ● ● ● ● ● ● ●

● ● ●

● ●

















● ●



● ● ●

●● ● ●

● ●

● ●

● ● ●







● ●

● ●



●●

















●● ● ● ●

● ●

● ●









1.0 1.5 2.0 2.5 3.0 3.5 4.0

● ●

● ●

● ● ●

1.0

●●

● ●



● ● ●





● ●

3.0

1.5

● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●

female Statistics

3.0

● ●● ● ● ●● ● ● ●

●● ● ●



●● ●

● ●



● ●

● ● ● ●







● ● ●● ● ●● ● ●● ●● ● ●● ●







● ●

● ● ●● ● ● ● ●● ●● ●● ●● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●









● ●● ● ● ● ● ●●



●● ●

● ●

● ●





● ●

● ●

3.5





● ● ● ●





● ●● ●





● ●

1.0 1.5 2.0 2.5 3.0 3.5 4.0



1.0 1.5 2.0 2.5 3.0 3.5 4.0

● ● ●

●●



● ●





● ● ●



● ●

● ● ●● ● ● ●









● ● ●







● ●



● ●

● ●

● ●







● ● ●



● ●



● ● ●● ●



1.0 1.5 2.0 2.5 3.0 3.5 4.0

1st Degree

Figure 9: Same trellis display as in Figure 8 with an additional lowess smoother superimposed. Another problem with trellis displays is the method of shingling. Shingling is used, when a conditioning variable is not categorical but continuous. Shingling a continuous variable means to discretize the variable into a number of overlapping categories reps. intervals. If this is done automatically without any control and insight of the data analyst, the resulting categories/intervals might be hard to interpret or even misleading.

[Further discussion of applications and examples]

7

Suggest Documents