High Dimensional Data Visualization Martin Theus University of Augsburg, Department of Computational Statistics and Data Analysis
[email protected] One of the biggest challenges in data visualization is to find general representations of data, that can display the multivariate structure of several variables at a time. In general we can distinguish two types of data displays: dynamic (i.e. which allow user interaction) and static. It turns out, that in high dimensional data exploration the interaction with the visual display is very important in order to gain a better insight into the data. For presentation purposes static graphics are usually needed to communicate the findings. Four different plot types have proven to be the most powerful: • For purely categorical data – Mosaic Plots • For purely continuous data – Parallel Coordinate Plots – Projection Pursuit and Grand Tour • For data on mixed scales – Trellis Displays All four visualization tools shall be investigated in this section in order to give a comprehensive overview of how to visually explore and present high dimensional data.
1
Mosaic Plots
Mosaic plots derive quite naturally from barcharts and spineplots. They form a recursive and conditional display of all possible combinations of categories of the discrete variables included. Each tile in a mosaic plot corresponds to an intersection of categories of the involved variables. The area of the tiles is proportional to the number of observations in that particular intersection. Figure 1 gives an example barchart off the housing type of the Housing Factor data set from Cox & Snell’s (1981) book. All cases with highly satisfied tenants are selected in red. 1
Adding two more dimensions, namely the Contact and the Influence a tenant feels to have on the housing situation gives the plot in Figure 2. Still cases representing highly satisfied tenants are highlighted in red.
Apartments Atrium Houses Terraced Houses Tower blocks
Figure 1: Barchart of Housing Type Atrium H.
Terraced H.
Tower blocks
High
Low
Apartments
Low
Medium
High
Figure 2: The mosaic plot for the complete Housing Factors data set. Housing Type is first along x, then Contact along y and Influence the third division along x again Figure 2 nicely shows the four dimensional interaction structure of the data. There are two obvious deviations from the general pattern. Tenants in Tower Blocks seem to be more satisfied with their living condition even when they claim to have only low Influence on these conditions. Tenants in Atrium Houses with low Contact and high Influence are less satisfied to a high degree than to be expected from the overall pattern. This is an example of how mosaic plots can not only be used in an exploratory graphical analysis but also be an excellent addition for setting up and interpreting corresponding log linear models, cf. Theus and Lauer (1998).
[More on interactions and variations of mosaic plots] 2
2
Parallel Coordinate Plots
Parallel coordinate plots use the 2-dim. cartesian space more efficiently than scatterplots by placing all coordinate axes in parallel. For each observation the points in the resulting dot-plots are connected by a polygon, resulting in n polygons of k −1 line segments for a dataset of size n×k. The parallel arrangement of the axes allows to investigate far more variables at a time as for instance with scatterplot matrices (SPLOMs), since there is no redundant plotting of entities. Nevertheless, the problem of overplotting is much more serious as it is with scatterplots. Since features are only visible at adjacent axes, a flexible permutation mechanism is needed to get a comprehensive view on all data. palmitoleic
palmitic
oleic
linolenic
stearic
linoleic
eicosenoic
arachidic
Figure 3: PCP for the olives data. Moderate α−belnding has been applied. Figure 3 gives a classical example of a parallel coordinate plot, showing the fatty acid content of Italian olive oils from 9 Italian regions. The Sardinian oils have been highlighted allowing to trace the two cluster through all 8 variables. Parallel coordinate plots have been discussed not only in statistics resp. data analysis, but also in mathematics concerning their geometrical properties. Whereas interactive scaling issues are not important for the geometrical interpretation, their very much are in data analysis. Figure 4 shows an example of the results of the single stages of all 147 cyclists, who completed all 20 stages of the 2004 Tour de France. Whereas Figure 4 is scaled on individual scales per axis, the (cumulative) 3
E01.L203
E00.Prolog
E03.L210
E02.L197
E05.L201
E04.Team65
E07.L205
E06.L196
E09.L161
E08.L168
E11.M164
E10.S237
E13.S206
E12.S198
E15.S181
E14.L193
E17.S205
E16.EZ.AlpedHuez
E19.EZ55
E18.M167
E20.L163
Figure 4: PCP for the results of the 20 stages of the 2004 Tour de France — default scaling K01.L203
K00.Prolog
K03.L210
K02.L197
K05.L201
K04.Team65
K07.L205
K06.L196
K09.L161
K08.L168
K11.M164
K10.S237
K13.S206
K12.S198
K15.S181
K14.L193
K17.S205
K16.EZ.AlpedHuez
K19.EZ55
K18.M167
K20.L163
Figure 5: PCP for the cumulative results of the 20 stages of the 2004 Tour de France. Data is on common scale and has been centered at the mean of each variable. data in Figure 5 is on common scale and centered at the mean of each variable. The US Postal team is highlighted both in Figure 4 and 5.
[More on different scalings and sortings and outlier]
3
Grand Tour and Projection Pursuit
So far, all plots are rendered on a two dimensional medium, a sheet of paper or a computer screen. Even a 3-d rotating plot is “just” a 2-d projection of 3-d data, which gets it pseudo 3-dimensionality only by the “motion” of the rotation. The Grand Tour generalizes the idea of a 3-d rotating plot to any arbitrary number of dimensions. It is defined as: A continuous 1-parameter family of d−dimensional projections 4
of p−dimensional data which is dense in the set of all d−dimensional projections in IRp . The parameter is usually thought of as time For a 3-d rotating plot, the parameter p equals 3 and the parameter d equals 2. In contrast to the 3-d rotating plot, the grand tour does not have rotational controls, but uses successive randomly selected projections. Figure 6 shows an example of 3 successive planes P 1, P 2 and P 3 in 3 dimensions. The planes between the randomly selected base planes are interpolated, to
P1
P3
P2
Figure 6: Example path of a grand tour get a smooth pseudo rotation, which is comparable to a physical 3-d rotation. A more technical description of the grand tour can be found in Buja (1996). Although the human eye is not very well trained to recognize rotations in more than 3 dimensions, the grand tour allows to find structures like groups, gaps and dependencies in high dimensional data.
[More examples and projection pursuit]
4
Trellis Displays
The last two sections dealt with either purely categorical or purely continuous data. When plotting data on mixed scales, Trellis Displays1 have proven useful. Trellis Displays use a lattice like arrangement to place plots on the so called panels. Each plot in a trellis display is conditioned upon at least one other variable. The biggest advantage of trellis displays is the 1
Trellis displays are called Lattice Graphics within the R package
5
common scale among all plot panels. This allows an effective comparison of the panel plots between rows, columns (and pages), depending on the number of conditioning variables and the type of panel plot. The probably simplest example of a trellis display is a box-plot y by x. Figure 7 shows a Analysis
Algebra
Numerics
Statistics
Comp. Science
Opt
WiSo
Figure 7: A boxplot y by x is a simple form of a trellis display. (Note that the whiskers of the base boxplots have been extended to be light gray boxes in order to allow highlighting) box-plot of exam results conditioned on the faculty of the students. Results can easily be compared between subjects, since the scale does not change when visually traversing the different faculties. An example of two panel variables Mark in 1st Degree and Mark in 2nd Degree and two conditioning variables Subject and Gender and one so called adjunct variable, namely Region where the student comes from is plotted in Figure 8. Trellis displays are most powerful when up to two conditioning b d 1.0 1.5 2.0 2.5 3.0 3.5 4.0
male Algebra
r s
● ●
● ●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
male Analysis
male Comp. Science
1.0 1.5 2.0 2.5 3.0 3.5 4.0
male Numerics
male Opt
male Statistics
male WiSo ● ●
● ●
● ●
● ●
●
●
●
● ● ●
●
● ● ●
●●
●
●
● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ●
●
2nd Degree
●
● ● ●●● ● ● ● ●● ●
● ●
●
●
female Algebra
female Analysis
●
●
● ●● ● ● ● ● ●●
●
●●
●● ●●
●
● ● ●
●
● ●
●
● ● ●
● ● ●● ● ● ● ● ● ●● ● ●● ●
female Comp. Science
●
●
●
●
●
●● ● ●● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●
●
female Numerics
● ●
●
●
●
●
●
●
● ● ● ●
● ●● ●
●
● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●
●
● ●
●
female Opt
●
●
● ● ● ●● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● ●
female Statistics
● ●
2.5 ●
●
●
2.0 1.5 1.0
female WiSo
3.5
●
2.5
● ● ●
2.0
● ●
●
● ● ●
● ●
●
● ● ●
1.0
● ●
●
●
● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
● ●
●
● ●
● ● ●
● ●
● ●
● ●
●
●●
●
●
●
● ● ●
●● ● ● ● ● ●
●
●
●
● ● ●
●
●● ● ● ●
● ●
●
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
● ●
●
● ● ● ● ●
●
●
● ●
3.0
1.5
3.5 3.0
● ●● ●● ●● ● ● ●
● ●● ●
●
●
●
●
● ●
● ● ● ●
●
●
●
● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●
●
● ●
● ●
●
●
● ●
● ● ●
●
●
● ● ●●
●
●
● ●● ●
●
●
● ●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
● ● ●
●●
●
● ●
●
●
● ● ●
●
● ●
● ● ●● ● ●
●● ● ●
● ●
● ● ●
●
●●
●
● ●
● ●●
● ●
●
●
● ● ●
●●
● ●
●
● ● ●● ●
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
1st Degree
Figure 8: A trellis display incorporating 5 variables of the students data set. 6
variables and more complex panel plots are used. The panel plot can be any arbitrary plot. Often fitted models are superimposed. Figure 9 shows the same plot as in Figure 8 except for the adjunct variable. Each scatter plot has a lowess smoother superimposed. One problem with trellis displays is the fact that it is hard to judge the number of cases in a panel plot. E.g. in Figure 9 it is desirable to have confidence bands for the scatter plot smoother in order to be able to judge the variability of the estimate. 1.0 1.5 2.0 2.5 3.0 3.5 4.0
male Algebra
1.0 1.5 2.0 2.5 3.0 3.5 4.0
male Analysis
male Comp. Science
1.0 1.5 2.0 2.5 3.0 3.5 4.0
male Numerics
male Opt
male Statistics
male WiSo ● ●
● ● ●
● ●
●
●
●
●
●
● ● ●
●
●
● ● ●
● ●
●
2nd Degree
●●
● ● ●●
● ●
●
●
●
●
● ●
female Algebra
female Analysis
●
● ●● ● ●
● ●
● ●
●
●
● ●
● ● ●
● ● ●
●● ●
●
● ● ● ● ● ●●
● ●
female Comp. Science
●
●
●
●
●
●● ●
●
● ● ●
●
●● ●
●
●
● ●
●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ●
●
female Numerics
●
●
●
●
●
●
●
●●
●
● ● ● ●
● ●
●
female Opt
●
●
● ●
2.5 ●
●
●
2.0 1.5 1.0
female WiSo
3.5
●
2.5 2.0
● ● ● ●
● ●
● ●
●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ●
●
●
●
●
●
●
●
●
● ●
●
● ● ●
●● ● ●
● ●
● ●
● ● ●
●
●
●
● ●
● ●
●
●●
●
●
●
●
●
●
●
●
●● ● ● ●
● ●
● ●
●
●
●
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
● ●
● ●
● ● ●
1.0
●●
● ●
●
● ● ●
●
●
● ●
3.0
1.5
● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●
female Statistics
3.0
● ●● ● ● ●● ● ● ●
●● ● ●
●
●● ●
● ●
●
● ●
● ● ● ●
●
●
●
● ● ●● ● ●● ● ●● ●● ● ●● ●
●
●
●
● ●
● ● ●● ● ● ● ●● ●● ●● ●● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●
●
●
●
●
● ●● ● ● ● ● ●●
●
●● ●
● ●
● ●
●
●
● ●
● ●
3.5
●
●
● ● ● ●
●
●
● ●● ●
●
●
● ●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
● ● ●
●●
●
● ●
●
●
● ● ●
●
● ●
● ● ●● ● ● ●
●
●
●
●
● ● ●
●
●
●
● ●
●
● ●
● ●
● ●
●
●
●
● ● ●
●
● ●
●
● ● ●● ●
●
1.0 1.5 2.0 2.5 3.0 3.5 4.0
1st Degree
Figure 9: Same trellis display as in Figure 8 with an additional lowess smoother superimposed. Another problem with trellis displays is the method of shingling. Shingling is used, when a conditioning variable is not categorical but continuous. Shingling a continuous variable means to discretize the variable into a number of overlapping categories reps. intervals. If this is done automatically without any control and insight of the data analyst, the resulting categories/intervals might be hard to interpret or even misleading.
[Further discussion of applications and examples]
7