Graphical Methods for Describing Data*
Lynd D. Bacon, Ph.D. President Lynd Bacon & Associates, Ltd. Homewood IL The American Marketing Association Marketing Research Division “All-Tutorial Event” April 15 & 16, 1996 - Princeton NJ April 29 & 30, 1996 - Oak Brook IL
*Thanks to the planning committee and the AMA for the opportunity to lead this workshop. My thanks also to the AMA staff for their patience and professionalism as I put it together.
Background: Lynd D. Bacon Lynd Bacon is president of Lynd Bacon & Associates, Ltd., a firm providing management science and research services to help firms better adapt to their competitive environments. Bacon has a Ph.D. in experimental psychology from the University of Illinois at Chicago, and an M.B.A. from the Graduate School of Business at the University of Chicago. Previously, Bacon was President and CEO of Information Arts, Inc., Director of the GM Satisfaction Center for Maritz Marketing Research Inc., and Associate Director of the Center for Research in Marketing at the University of Chicago. He has taught courses in marketing, experimental psychology, statistics and methodology, management, and health care management at The University of Chicago Graduate School of Business, The University of Illinois-Chicago, and Rush University. While at the University of Chicago, he also taught marketing strategy and planning in the University’s continuing education program. Bacon’s professional affiliations include the American Marketing Association, the American Psychological Association, the Association for Consumer Research, the American Evaluation Association, the American Association for Quality Control, the American Statistical Association, the Psychometric Society, the IEEE Computer Society, and the Society for Chaos Theory in Psychology and the Life Sciences. He has conducted a range of tutorials on methodological issues in marketing research for the American Marketing Association. His interests include consumer decision-making, statistical graphics, dogs, and motorcycles. Bacon’s interest in graphics goes back to his training at the University of Illinois. His dissertation was entitled The Detection of Luminance Patterns in the Left and Right Visual Field, and was a study of how humans process spatially distributed information. As part of his graduate education in cognitive, experimental, and physiological psychology, he studied visual physiology, the development of visual perception, visual psychophysics, and visual perception, memory, and judgment in adults. Lynd Bacon can be reached at: Lynd Bacon & Associates, Ltd. 545 S. Hamilton Wood Road Homewood IL USA 60430-4403
[email protected] (708) 957.0883 (voice) (708) 957.1920 (fax/bbs)
[email protected]
Graphical Methods for Data Analysis: Some Background
“To make oneself understood to the people, you must first speak to their eyes.” - N. Boneparte We humans gather most of the important information about our environment using our eyes. The processing our nervous system does on visual information is arguably more sophisticated and sensitive than what it brings to bear for other sensory modalities. We also tend to trust visual information more than other kinds of sensory input. When faced with discrepant stimuli, we will often favor what we see over other inputs. Statisticians and psychologists have long been interested in how we process visual information, because of the implications for perception of, and judgment about, visual stimuli. Statisticians have been interested in the visual display of data, or “charts” and “graphs,” at least as early as William Playfair(1786). Graphs can be used to summarize data, and to discover relationships in it. The oldest area of research in experimental psychology is the study of visual psychophysics, which concerns the relationship between the physical characteristics of things we may see, and our judgments about them(Baird & Noma 1978). Cognitive psychologists have also studied our memory for visual information. These research traditions indicate that although the human visual system and mind, are highly sensitive instruments, there are some things they can do better than others. We are, for example, subject to certain perceptual biases and distortions. There also are limits to how much information we can apprehend at any given time, what we can remember, and how we remember it. Our interest here is an understanding of both our perceptual strengths and weaknesses, as well as what is important in analyzing data. Our goal is to review basic principles of statistical graphics relevant to this understanding and that can be used to develop more effective statistical graphics. The reason that the principles are useful is that they are based on theory and empirical evidence. There is another aspect of designing statistical graphics that has to do with aesthetics. It can be argued that “pleasing” displays are more likely to be effective because viewers of them will probably attend to them more. In some cases embellishments of basic graphs may in fact make them better. In others, they may inhibit effectiveness. Tufte (1983) calls graph features that do not represent information “chartjunk.” This paper is a complement to the tutorial presentation. The presentation materials contain specific examples that are not included here, and vice versa. At the end of this paper is a reference list that includes citations for this paper, as well as general resources and references that may be mentioned in the seminar. There are still others on this list that may not be mentioned anywhere, but which I consider to be interesting and relevant. !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
1
What is a graph? According to Kosslyn(1994), a graph is a visual display showing one or more relationships between numbers. James and James (1992) define graphs in their Mathematics Dictionary as.
...(3) A drawing which shows the relation between certain sets of numbers... Used to convey a better idea of the meaning of the data than is evident directly from the numbers. “ (p 189) Note that the meaning, or message, of data is an element. Playfair(1786) felt that graphs not only have meaning, but that they aid decision-making in the face of large amounts of complex data:
“On inspecting any one of these Charts attentively, a sufficiently distinct impression will be made, to remain unimpaired for a considerable time, and the idea which does remain will be simple and complete, at once including the duration and amount. Men of great rank, or active business, can only pay attention to general outlines; nor is attention to particulars of any use, any further than as they give a general information: And it is hoped, that with the assistance of these Charts, such information will be got, without the fatigue and trouble of studying the particulars of which it is composed.” Playfair was obviously concerned with improving decision-making, as well as with the limited attention span of businessmen. Using graphs in marketing research As marketing researchers we have two kinds of uses for graphs. One is to help us summarize data to so we can learn about it. By summarizing data graphically, we can describe measures of interest, discover relationships, or test models that are based on our beliefs about how the world works. Cleveland(1993) refers to the graphical analysis of data as “visualizing data,” and considers it to consist of both graphing and model fitting. In the process of using graphs for analytic purposes, we often make several, and discard most of them. The second kind of use is communicating our observations to others. For this use, we have to be concerned with whether they are effective for an intended audience whose expertise and interests we have somehow taken into account. These graphs are often called “presentation graphics,” and are typically more comprehensive than analytical graphs. They tell a story about the data we have developed during the analysis of it.
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
2
Graph effectiveness and goodness Most would agree that a good graph makes its message clear. Alternatively, it answers the questions that viewers ask. This is definition makes effectiveness of communication paramount, and certainly more important than entertainment value. According to Tukey(1977), a good graph compels us to notice what we never expected to see. Bertin(1983) thought that a good graph answers the viewer’s questions. Kosslyn (1994) thinks that a graph is good if whatever pattern or trend there is in the data is immediately perceived and understood. So, a good graph not only conveys meaning, but does so without a lot of work on the part of the graph’s perceiver. It knocks your socks off, or at least kicks you in the shins. The research that has been done on graphical perception (see, for example, Cleveland 1985, Spence & Lewandosky, 1991; Kosslyn 1994 for a brief review) has in fact emphasized accuracy and speed in interpretation. In this research, characteristics of graphs are varied, and their effects on accuracy, consistency, and speed of judgments is measured. Simply put, graphical perception is the decoding of information encoded in a visual display( Cleveland 1985). These studies of perception do not address how graph characteristics may effect a graph’s memorability. There is little research directly on this issue, but there is an extensive literature on memory for pictures. Several authors have codified the findings from these lines of research into design “principles” for good statistical graphics. We will review some of these recommendations below. Human limitations One key to understanding graphical design principles is to have an appreciation of what we as graph perceivers can and cannot do well. This is because graphs are effective to the extent to which they exploit human abilities (Kosslyn 1994). Kosslyn has summarized what needs to be considered in regard to human capacities in three lemmas:
!
The mind is not a camera.
!
The mind judges a book by its cover.
!
The spirit is willing,but the mind is weak.
Kosslyn points out that our visual systems actively organize and interpret input based on the following. Salience is determined by changes, and large differences in size or !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
3
brightness will attract our attention. Inputs that are close to each other, have good continuation, or are similar in color or shape, are organized into perceptual units. The visual system is organized into “channels” for processing orientation, color, and movement information. Variations along some perceptual dimensions cannot be perceived without noticing variations on others, such as height and width. Systematic perceptual distortions occur in some visual dimensions, like area, intensity, and volume. The relations between object properties in space are not very accurately encoded. Finally, our mind tends to interpret two-dimensional patterns as threedimensional if possible, but doesn’t do so with great accuracy. Kosslyn’s points out that we have a tendency to believe what we see is real and true. We are put to some effort when what we think we see, and how we interpret it, don’t concur. We also tend to make symbolic interpretations about visual patterns that are based on our experience. Finally, we are “wired” to be sensitive to, and to interpret changes. Any perceived change in a pattern is construed to be informative. Thus, the mind infers a lot about visual input, thereby “judging the book.” Kosslyn’s notion is one good reason not to fill research reports with graphs that don’t show anything. Managers will think they do. Kosslyn’s third observation, that the “mind is weak,” reflects the limited information processing capacities of human cognition. There are limits to the detectability of visual stimuli, as well as to whether we can distinguish between them. Our immediate, or shortterm, memory is limited to holding just a few pieces of information at a time. If we are overloaded with information, we take longer to use it, and will make mistakes. Our ability to recall information is limited by the conditions under which it was encoded into long term memory, that is, the context in which it was learned. Have you ever flipped through pages of graphs, trying to compare them? When we communicate with others, or use information provided by them, we operate under the assumption that the information we get will be relevant to a specific issue, question, or decision. And, we can only interpret visual information if we have the necessary knowledge about its content. In general, we tend to gather information and to make decisions as if we are trying to minimize cognitive load, that is, to make the least effort possible. And, we often compromise by stopping short of finding the best answer to problems we solve. This has been referred to as “satisficing” by Herb Simon and other cognitive scientists. It is my view that one of the most important graphical design principles is “minimizing cognitive load” for the viewer. This doesn’t mean that graphs shouldn’t be complicated. It means that graphs shouldn’t make it hard for the viewer to figure out what the relevant complexity is. This idea is related to Tufte’s (1993) observation that graphs need to be designed so that the technology (both statistical and graphical) used to create them is transparent to the graph viewer. That is, the viewer should not be separated from the data information by how the display is constructed. Perceiving any new graph is likely to require some learning, in the sense that the relevant !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
4
attributes must be detected, discriminated, and interpreted. The process is not unlike starting to study cytoarchitecture by viewing histological slides. To viewers who have little knowledge of histology, an unlabeled slide is likely to look completely unstructured. With some identifying information, some structure will be perceived. With information and experience, the slide content can be identified accurately. We want to avoid making graphs that are like microscope slides, both for our own benefit, and also for the benefit of others who will view them. We usually have little control over other viewers’ knowledge, skills, and motivations. We need to consider what these are, and figure out how to design our graphs so they can learn to perceive what’s in them. One important strategy for minimizing the cognitive load of other viewers is to minimize the number of different kinds of graphs we present to them. Psychophysics of graph perception In the early 1980's, a research program was begun at Bell Laboratories on graphical perception. The work has been conducted by Chambers, Cleveland and their colleagues (1983, 1985, 1993a, b; Cleveland & McGill 1984, 1985). The goals of this research have been to describe graphical perception, and to develop a paradigm for designing graphical displays. The early objectives of this research were to describe the basic tasks we perform when perceiving graphics, and to order them based on difficulty. They defined the following elementary perceptual tasks, or judgments, summarized in table I. These tasks have to do with judging, and deciding about, quantitative information. That is, information whose value can vary continuously. In addition to the elementary tasks, Cleveland et al. have emphasized the fundamental importance of distance between elements, and detection in graphical perception. Distance affects the ease with which basic tasks are accomplished. Judgments along a common axis, for example, are affected by how far away the data points are that are being compared. Accuracy will decrease with increasing distance. In fact, comparing two points on a common axis may be no easier than comparing them on unaligned, identical axes if the points are far enough apart in a graph. Detection is the most basic requirement for graphical perception to occur. A data point, for example, must be detected before it can be analyzed in the context of other graph elements. Detection issues concern the separability of points on a scatterplot. As more and more symbols occupy the same physical region in a graph, the harder it gets to discern the individual symbols as distinct entities. If they cannot be distinguished, their relations to one another cannot be, either. This isn’t of concern if what the important characteristic of the data is a density “blob,” of course. After running some psychophysical experiments, Cleveland et al. observed that the tasks in Table I. were reliability different in terms of difficulty. In fact, they are listed in Table 1 in terms of increasing average error size. That is, judgments were most accurate, on a !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
5
percent error basis, for judging positions along a common scale, and least accurate for judging differences in hue. The typical task for a subject in these studies as to judge what percent of a characteristic like length one stimulus had relative to another.
Table 1. Elementary Graphical Perception tasks for quantitative information (Cleveland, 1985) Task
Example
1. Position along a common scale
locating two points on a scatterplot on a common axis
2. Position along identical, nonaligned scales
comparing points on different plots having the same axes or reference frames
3. Length
judging the length of confidence bars around points at different locations in a plot
4. Angle -slope
Deciding if two lines have the same slope, or how much change occurs at different points in a time series
5. Area
Comparing the size of circles or squares
6. Volume
Judging the relative volumes enclosed by cubes
7. Color hue and saturation; density
Comparing areas on color-coded penetration/market potential maps.
The implications of these results for designing statistical graphics is clear. One should design graphics so that perceiving the important statistical information relies on the easiest elementary tasks possible. Cleveland et al. are not fond of pie charts because perceiving them requires making judgments about angle and/or areas. In principle, a bar chart or dot chart should be easier to perceive veridically. Figure 1 compares a pie and a dot chart for the same data. You be the judge. More on dot charts, or dot plots, will follow below. A model for studying displays Cleveland (1993b) has described a model of the perceptual processing of graphs. One of the reasons his model is interesting is that it provides some heuristics for evaluating how good a graph’s design is, and also some terminology for graphical and informational elements of displays. The implications of these results for designing statistical graphics is clear. One should design graphics so that perceiving the important statistical information relies on the easiest elementary tasks possible.
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
6
proportion of customers from each state
proportion of customers from each state
o
28
o
6
o
22
o
48
12
o
5 6
13
4
5 1
22
o
13
o
1
48 28
o
4
37 29
47
47
o
37
o
45 45
o
12
o
29
o
0.0
0.05
0.10
0.15
0.20
proportion of customers
Figure 1 Pie chart and dot chart of proportion of customers from states, which are given as numbers. Cleveland sees a graphical method for statistical analysis as having two components. A statistical component has to do with selecting what statistical information will be displayed graphically. The related decisions are theory, and experience, driven. For example, to describe the distribution of weekly purchase amounts in a sample of households, one has to decide what information is sufficient to describe such a distribution. If we were to summarize the purchase amounts in tabular form, we might include estimates of the mean and median, perhaps the standard deviation, and the range. Where the relevant information is the relationship between two quantitative (i.e., continuous) variables, we might want to show the data pairs and their linear relationship. The second component of a method consists of selecting how to display the relevant statistical information. For the relationship example above, we might choose to create a scatterplot that also shows a best fit line for the data points. Cleveland also distinguishes between types of information placed in a graphical display. His taxonomy consists of a two-way classification. Information is either quantitative or categorical. It can be represented both in terms of measurement scale, and in terms of its physical characteristics in the graph. Measures on a continuous variable, like total durable purchases in dollars, are quantitative information. Peoples’ names are categorical. The !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
7
scales are dollars per unit time, and formal names, respectively. Scale representations are like labels on a graph. The physical representations of these data are described by how they are shown on a graph relative to other graph elements. Table 2 summarizes this classification scheme, and gives some examples. Imagine a bar chart that shows monthly purchase volumes per customer group per month. Cleveland believes that the scale and physical representations compel two types of graphical perception processes. The first process is table look-up, in which scale information is decoded. The second is pattern perception, which is the processing of geometrical patterns in the graph. These two types of processing are quite different. Table look-up focusses on individual values of data as separate pieces of information. Pattern processing has to do with constructing groups of elements that may be interpreted. Table 2. Cleveland’s(1993b) two-way taxonomy for graphical display information : from a bar chart of purchase volumes by customer segment Information Type: Representation: scale physical
quantitative
categorical
total purchases per month in dollars
the segment names
bar heights in inches
bar order on horizontal axis
Each of the processing types has specific operations. Those for pattern perception are detection, assembly, and estimation. These have to do with the apperception of geometrical aspects, the grouping of detected elements, and magnitude estimation of relations between physical values. The types of estimation are discrimination (“Are they the same?”), ranking (“Which is bigger?”), and ratioing. The latter is judging the ratio of two physical quantities, like the ratio of the heights of two bars in a histogram. The above three operations are in rough order of increasing difficulty. The operations for table look-up are scanning, interpolation, and matching. These are ways of getting scale information, and generally have to do with local features. You may scan to locate a data point, and then interpolate between a scatter plot’s axis tick marks to estimate an amount. Matching is used to decode graph legend information, for example. The difficulty of these operations will depend on distances, among other things. Cleveland has so far applied the model to the analysis of using grids, types of plotting symbols, and the importance of a graph’s aspect ratio. We will summarize some of these findings in observations on graph elements, below. A developing theme for designing statistical graphics is to “minimize cognitive load.” If we can reduce the amount of information processing required for graphical perception, we will improve the effectiveness !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
8
of our graphs. Cleveland’s theory is one way of answering the question of how we can do this.
Aesthetics of Visually presented information
“Above all else show the data.” E. Tufte Edward Tufte (1983, 1990) has studied the use of graphical displays, and has become sort of the father of “good taste” in graphical design. He believes that displays of numbers can be not only effective in leading the viewer to good decisions, but that they can also be attractive. He is concerned with graphical “integrity:” the extent to which a graphic enables veridical perception and correct inference. Like Kosslyn, he believes that parts of a graph lead to expectations about other parts. The rules for integrity include:
! ! ! !
The visual representation in graph is consistent with the numeric representation. Data variation and variation in design features are not confused. The rule about axis sizes has to do with this attribute. Areas and quantities should also be sensibly related. Data variation should be shown. Labeling should be clear and comprehensive. Graphical data should be in a relevant context. In Tufte’s words, the question “Compared to what?” should be answered.
Tufte believes that to create good statistical graphics, the designer must have substantive, statistical, and artistic skills. In the context of marketing research, this means that you must understand the marketing issue(s) that the data at hand can address, know how the relevant structure in the data should be summarized, and be able to lay out a display of the summary so that it communicates effectively to the relevant audiences. To improve a graph aesthetically, you should(Tufte 1983, p. 176):
!
choose a proper graph format
!
use words, numbers, and graph elements together
!
show balance, proportion, and a sense of relevant scale
!
show an accessible amount of detail
!
have a story told in the graphic
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
9
!
do a careful, professional job constructing the graph
!
avoid informationless decoration, i.e., chartjunk.
Information density and multiples Tufte also suggests that graphics should be evaluated in terms of what proportion of all graphical elements (the “ink”) actually represents the data being summarized. He defines the data ink ratio (DIR) as: = the proportion of a graphic’s ink used for non-redundant display of data = One minus the proportion of the total elements of a graph that can be erased without a loss of information about the data. The idea is to maximize DIR. What can be erased is chartjunk. Tufte includes in the chartjunk category cartoon-like icons (sometimes stacked in histograms), and also needless color, grids, and fills. This isn’t to mean that graphs shouldn’t be complex. It is just that the complexity must be information-rich. Sometimes DIR can be maximized by using tables rather than graphs. Tufte points out this often the case with how pie charts are used. How often have you seen pie charts that represent just a few numbers? These will have small DIRs, and the content would be better communicated in numeric, rather than graphical, form. Tufte’s(1983,1990) small multiples notion is based on the idea of maximizing the information density of graphs by shrinking them. Our visual acuity is such that we can discern patterns in physically small areas. Small multiples combine shrunken graphs as if they are frames of a movie. Each frame has the same design so that patterns across the frames can be perceived. Small multiples are similar to Cleveland’s multi-panel graph layout. Small multiples and multipanel displays allow multivariate data to be show in the plane of paper sheets and computer displays. Note that the juxtaposition of graphs with the same design reduces cognitive load for the viewer. One of the interesting design issues in making multipanel displays is how the panels should be arranged. Cleveland et al. have suggested that the panels be laid out using the idea of Cartesian coordinates in two dimensions. This is the basic X-Y graph we have all become familiar with somehow. The idea is that small values should be at the lower left, and large values at the upper right. This is because the convention for X is to increase from left to right, and for Y to increase from bottom to top. As an example, consider a dotchart for the importance of service attributes to overall customer satisfaction. Suppose you have importance measures across several customer segments. You can construct a multipanel display of dotcharts, and arrange them from lowest to highest on some sensible dimension, like median importance score, or a one!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
10
dimensional solution obtained by multidimensional scaling or correspondence analysis. The idea is to order them based on some reasonable dimension of similarity, going from left to right the most rapidly, filing in higher and higher rows of panels. Figure 2 gives an example of such a display. In this figure, the segments are arranged by median importance score across all attributes. The median for the group in the lower left panel is the lowest, and the one for the upper right panel is the largest. Within each panel, the attributes are arranged using a similar rule. The attributes are arranged on the vertical axes based on their median importance score across groups. The goals of this organization are to make it be based on a generally known principle, and to accentuate differences in patterns across the panels. Note that the dimension that the panels are organized on, median importance, is a marketing-relevant one. It measures the importance of the set of attributes to each group.
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
11
GROUP 5
GROUP 4
4 14 17 18 19 16 5 24 6 20 15 7 2 23 22 9 10 21 12 13 1 11 3 8
GROUP 1
4 14 17 18 19 16 5 24 6 20 15 7 2 23 22 9 10 21 12 13 1 11 3 8 0.0
0.1
0.2
0.3
0.4
4 14 17 18 19 16 5 24 6 20 15 7 2 23 22 9 10 21 12 13 1 11 3 8 0.0
importance (TE) of performance attribute
0.1
0.2
0.3
0.4
importance (TE) of performance attribute
GROUP 6
0.2
0.3
importance (TE) of performance attribute
0.4
0.2
0.3
0.4
GROUP 3
4 14 17 18 19 16 5 24 6 20 15 7 2 23 22 9 10 21 12 13 1 11 3 8 0.1
0.1
importance (TE) of performance attribute
GROUP 2
4 14 17 18 19 16 5 24 6 20 15 7 2 23 22 9 10 21 12 13 1 11 3 8 0.0
0.0
4 14 17 18 19 16 5 24 6 20 15 7 2 23 22 9 10 21 12 13 1 11 3 8 0.0
0.1
0.2
0.3
importance (TE) of performance attribute
0.4
0.0
0.1
0.2
0.3
0.4
importance (TE) of performance attribute
Figure 2. Example of a multipanel display of the importance of 48 service attributes to overall satisfaction in six different segments.
Layering and separation, background, and grids Tufte(1990) has emphasized the importance of layering and separation as an important graphical method. This is the creation of visual strata by using differences in density, grey, color, or proximity. What is crucial for the effectiveness of the method is proper relationships between the information layers. There must be in appropriate proportion and harmony to what the graph is to convey (p. 54). An example of this idea is the relationship between data and a grid they are plotted on. If the grid is too light, locating the data points relative to the axes is difficult. If the grid is too dark, or if the grid lines are too wide, they can obscure the data points. Another example has to do with tabulations. A tabulation, or table, has cells that contain numbers. These cells need to be separated somehow so that they can be distinguished. Table 1 has an example table layered in two ways. The bottom table has better layering than the top one. Notice how the numeric entries in the bottom table “leap out” relative to those in the top table. If you look closely at the intersections of the grid in the top table, you may see small white squares. These are in fact an optical illusion called the Hering Grid. It is thought to result from the lateral inhibition in the visual system. Fill patterns can similarly create unwanted moire patterns in displays. These patterns are generated by our visual systems, and can be quite distracting.
Grids on displays lend themselves to layering issues. They were much more common when plotting was done by hand (which some of us may remember doing), but are making somewhat of a comeback (Carr, 1994). They are often effective in greyscale plots when the background of the plot is light grey, and the grid itself is white. This has the effect of making the grid seem more like it is in the background of the data elements, yet it is still available for referencing the location of the data. Graphic methods for exploring and analyzing data The distinction between analytic and presentation graphics is often ambiguous, and there is not a consensus that the distinction is a useful one. What generally distinguishes displays used for graphical data analysis is that they are usually less elegant, and most are thrown away. Graphical data analysis is undertaken to draw conclusions about data, whereas presentation graphics have more of the flavor of communicating conclusions. Probably the earliest exploratory displays were maps. Famous examples are those used by John Snow to identify the source of the 1854 London cholera epidemic (Gilbert 1958). Snow plotted the incidence of cases on a London city map, and observed their location relative to sources of drinking water. An important aspect of what Snow did was to use a spatial display of data to discern the possible sources of variation. R.A. Fisher, founder of modern statistics and the inventor of the scatterplot, believed that assessing variability is at the core of quantitative inference (Fisher 1958).
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
13
11.57 (4.56) 1005 12.44 (6.70) 1211
11.57 (4.56) 1005 12.44 (6.70) 1211
Table 1 Examples of bad (top), and better (bottom) layering and separation in tables. The cells have been left mostly empty to emphasize the figure-ground contrast.
It’s interesting to note that the quality movement has made understanding variability an important task for management, and that quality improvement programs often rely on graphical data summaries to make statistical information more accessible.
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
14
Another characteristic of graphical analysis that sets it apart from presentation graphics is the centrality of making probabilistic inferences. Such inferences are the main paradigm for analyzing data in Western science. The operations of hypothesis testing and using confidence intervals are based on probablistic notions combined with the concept of random draws of observations from populations. Modeling can be thought of as describing the structure and uncertainty in data in some decision-relevant manner. Graphical data analysis, also called visualization, is a way of learning from data (Cleveland 1993b). It has two components: graphing, and fitting. It can also be used to test crucial assumptions about models. John Tukey, who according to Tufte made statistical graphics “respectable,” is generally considered to be the father of exploratory graphical data analysis. In his seminal doctoral dissertation (which had hand-drawn graphics), Tukey began to describe a collection of visualization tools and elements that he later elaborated in his 1977 book, Exploratory Data Analysis. He also laid much of the groundwork for visualization as it is done today. Some of the more important of his elements are as follows. Box Plot. This is a device for summarizing the shape and location of a data distribution. As originally conceived, the box plot located five quantiles plus outliers on the scale of measurement. The quantiles were the 10th, 25th, 50th (median), 75th, and 90th. Observations outside the 10th and 90th were represented individually. The “box” enclosed the interquartile range, with the median shown by a line in it. The 10th and 90th percentiles were indicated by “whiskers,” and observations outside of them by data points. Figure 3 shows an example of this classic type of boxplot, along with some more recent variations. These include extending the whiskers to 1.5 times the interquartile range, “notching” the box to show a confidence interval for the median, and box width related to sample size. Whiskers and other elements can in fact define various quantities, so providing definitions in figure captions or legends can be quite important. Boxplots are probably most useful when two or more are compared. See Figure 4 for an example of boxplots for cereal unit sales. In the left panel, the brands have been arranged haphazardly. In the right panel, they are ordered from top to bottom in terms of decreasing median sales. Note how this organization makes it easier to compare details, while at the same time facilitating the global structure of the unit sales data. A figure with multiple boxplots answers Tufte’s question of “Compared to what?”
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
15
Classic
The New Box
Box w/ CI Notch
800
800
800
•
•
600 200
400
600 400 200
200
400
600
• •
Figure 3 Some flavors of boxplots
boxplot for cereal brand sales
brands ordered by median sales
GI Doom
Capt. Goof
Axe-2-Grinde
New Shoes
Fred’s Rubble
GI Doom
Soap
Fred’s Rubble
Drudge Lite
9 Livers
Capt. Goof
Soap
New Shoes
Drudge Lite
9 Livers
Axe-2-Grinde 60
65
70
75
60
65
70
75
Figure 4 Two organizations for multiple boxplot displays
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
16
N = 254 Median = 270.25 Quartiles = 194, 365 Decimal point is 2 places to the right of the colon 0 1 1 2 2 3 3 4 4 5 5 6 6
: : : : : : : : : : : : :
8 000011122233333333333344444 55555566666677777778888888899999999999 0000000111111111111222222233333333444444444 555556666666666777778889999999999999999 000000001111112222333333333444 55555555666667777777888888899999999 0122234 55555678888889 111111134 555667 44 7
High: 756.0 869.5
2 Stem-leaf plot. Stem-and-Leaf Plot. Tukey’s ingenious creation is a way of describing the shape of a distribution while preserving much information about individual observations, and which requires only normal typographic characters to render. It’s really a hybrid between a table and a graph. Box 2 gives an example. The stem-leaf plot summarizes the distribution of 254 observations. In its center you’ll note what appears to be a bar chart made of digits and turned on its side. Each digit to the right of the column of colons (“:”) is one observation. The numbers to the left of the colons are “stems” and those to the right are “leaves.” The top row, the “0" stem, indicates a single observation of about 80, since the actual decimal point is located two digits to the right of where the stem’s colon (:) is. Since the decimal is two digits to the right, the plot shows values rounded to the nearest 10's digit. The observation on the 0 stem is actually 83. The “6" stem shows observations in the 600's. There are two leaves at 40 and 40. Note that despite how simply this display can be created, it offers more information than the typical histogram. Stem-and-leaf plots can be enhanced by adding stem frequencies, proportions, and so on. In the example, two high “outliers” are identified, along with the 1st, 2nd, and 3rd quartiles. Mean-difference Plot. This is a device for making departures from a line more apparent. It is also called a “sum-difference” plot. It is like a scatter plot of two variables, say x and y, that has been rotated 45o. Plotted on the new vertical axis is x-y, and on the horizontal axis either x+y, or (x+y)/2. Consider Figure 5. It shows two ways of looking at how sales of a particular barbeque grill differed between a metropolitan area’s suburbs (“S”) and urban (“U”) areas as a function !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
17
of prevailing average barbeque price. The data were collected over the period of a year. Sales were measured in average monthly sales volume per 1000 households. The question of interest is “How did sales differ as a function of price level?” Some possibilities are that sales didn’t differ, that they differed by a constant amount, or that they differed in price sensitivity. Tukey m-d plot
64
0
66
1
68
2
70
72
3
74
4
empirical QQ plot of volumes
64
66
68
70
72
74
x is urban quantiles, y is suburban
66
68
70
72
74
x is mean of quantile values, y is difference
Figure 5 Empirical QQ and Tukey m-d plots
The left panel gives an empirical quantile-quantile (QQ) Plot of the data for S and U. The plot is constructed by dividing the data up into quantiles (here, 20), and plotting the numerical values corresponding to the quantiles, pairwise. If the distributions for S and U are very similar, the points in the QQ plot would be close to the diagonal, slope=1, line. If the locations of the S and U distributions differ by a constant, then the points should appear as on a line parallel the diagonal indicated. If the distributions have different widths, which might indicate differences in price sensitivity, they would perhaps line on a line whose slope is different than 1, or on a curve. The right panel is the Tukey m-d plot for the quantile pairs. Here, the difference between the S-U quantile value pairs is plotted as a function of the pair means. Note how this plot makes it easier to see if the points in the left panel fall on a line. !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
18
QQ Plots are often used to test assumptions about how residuals from statistical model predictions are distributed. An example of this application will be given below. Other methods and tools One-dimensional scatterplots. Scatterplots can be on a line. Figure 6 gives the cereal brand sale data again. The plot symbol ‘|’ has been jittered: each data point has had a small random amount to it to make their locations more discernable. Jittering can be used in a range of plots. It can be particularly helpful when looking at rating scale data.
Scatterplot matrices This is kind of a multipanel display that is a matrix of plots. Figure 7 gives an example. Shown in it are a measure of annual product consumption, and four
one dimensional scatterplots: sales by brand
Capt. Goof
|
New Shoes
||
GI Doom
|
Fred’s Rubble
| | |
9 Livers Soap Drudge Lite Axe-2-Grinde
|| |
| ||||
|||||| ||
|
| | | ||| ||| ||| | | |
|| |
|| | ||
|
||
| | |||
| | ||
|| |
| |||||||
||
|
|| || || | | | ||| | ||| | ||
60
||
| || | |
|| | |
|
| || ||
| | || | | | | | | || ||| | |||| | || | | | |
|||| ||||
| || | |||| | | |
||| | |||
| | || |||
| | || ||
|
|
| |
65
|
70
75
daily sales
Figure 6 One dimensional scatterplots characteristics of some customers, who in this case are firms. Other information can be included in a scatterplot matrix. For example, we may want to
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
19
40
50
60
70
10
20
30
40
50
60
15
20
30
50
70
5
10
ann.revs
15
20
30
FTEs
50
5
10
share
100
10
30
emp.age
80
purc.quant
5
10
15
20
5
10
15
20
80
90
100
110
Figure 7 Scatterplot matrix of customer characteristics include straight lines or “smoothers” to make it easier to appreciate the relationships between the quantitative variables. Figure 9 shows the data with least squares and loess (locally weighted regression) functions included. Sometimes, the fits can be better appreciated without the data points, if the latter are large in number. Dot plots These are a useful tool that should be seeing more use now that conventional statistical software can create them. Dot plots, or dot charts, summarize the relationship between categorical and continuous measures. Figure 1 included an example of a simple dot plot. Dot plots can also be combined in multipanel displays, like many other graphical methods. See Figure 2 for an example. Icons and plotting symbols. Icons are devices to represent more than two dimensions on the surface of a page or screen. They are really plotting symbols whose characteristics can vary according to different measures. Perhaps the most memorable of statistical icons are Chernoff’s(1973) faces. Chernoff’s idea was to assign variables to facial features so that differences across observations might be observed. It turns out that how variables are assigned to the facial features makes a big difference in what differences are detected, which isn’t all that surprising. Figure 8 shows some faces for the firms whose data were plotted in the scatterplot matrix, !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
20
customer faces
13
8
7
12
4
6
5
2
9
10
11
3
1
Figure 8 Chernoff’s face icons for encoding multivariate observations
above. The faces are laid out based on the first solution from a metric multidimensional scaling of the firms. They are thus ordered in terms of “similarity” going down the columns top to bottom first, and then across columns. Purchase amount is coded in the curve of each face’s smile. The other variables, annual revenues, FTEs, market share, and employee age, are coded as the face area, face shape, nose length, and nose curve, respectively. Figure 11 gives an example of star icons used to represent multivariate data. Each has five points. The variables are assigned counterclockwise starting from the right of center(at “3 o’clock”). The distance of each point from the center depends on that vari-
able’s value for a firm relative to all the firms. A very useful plotting icon is the thermometer. This symbol is a rectangle that can be filled just like mercury fills the column of a thermometer. The outside of each thermometer is a fixed reference frame, making it easier to compare quantities. Figure 10 shows thermometers representing the proportion of customers for each state, superimposed on a map. Sunflowers are and icon that can show the density of points at a single location. The larger the number of points, the bigger the bud. Sometimes the size of ordinary symbols like circles and squares are used to encode the number of observations with the same value. Jittering might in general be a better technique. Symbol size can be used to encode a third variable on a plot. A famous example from the world of strategy is in the Boston Consulting Group’s bubble plot. Circle areas are used to encode profits or revenues for product lines.
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
21
40
50
60
70
•
10
20
30
40
50
60
•
•
15
•
20
30
•
••
•
• • •
•
•
• •
••
•
•
•
•
•
•
•
• • •
• 5
•
10
ann.revs
• •
••
•
• •
•
• ••
•
• ••
• •
• •
• • •
10
• •
•
•
• •
•
•
•••
• • • •
•
•
•
•• 5
• 10
15
•
• •
•
•
• • •
• •
• •
• • • •
•
•
emp.age
•
•
•
• • • •
•
•
• •
• 10
15
• • •
•
• 5
•
•
•
••
•
•
• •
•
•
•
• • •
• •
•
20
•
• •
•
••
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
• • •
• •• •
•
• •
• •
•
•
•• ••
•
••
•
•
• •
•
•
•
•
•
•
•
•
•
share
• 50
•
•
•
• • •
•
•
•
•
•
20
• • •
• •
•
•
•
•
• •
•
• •
15
30
•
•
•
•
•
•
•
• •
••
•
•
••
•
•
••
•
•
30
•
5
FTEs
•
•
••
•
• 20
100
•
•
• •
10
•
••
•
50
•
•
•
•
purc.quant •
• •
•
80
70
•• •
• 80
90
100
110
Figure 9 Scatterplot matrix with linear least squares (dotted) and loess (solid) functions fitted. Different symbols can be used to encode groups, but some care should be taken to choose symbols that can be distinguished. Where distinguishability breaks down is in plot areas that have high data density. There is some research indicating that simpler symbols generally work better than complicated ones (like letters). There are some (e.g. Cleveland) who believe hollow symbols are better than solid ones, since hollow symbols can be more easily distinguished as separate entities.
Multipanel displays I mentioned these above. It’s worth pointing out that the multipanel method can be used with a wide variety of data. Figure 12 is a multipanel display that shows predicted demand in kilotons for “YoPo”, a pork flavored yogurt, as a function of the prices of bacon, Spam, and pork rinds, a salty snack. (The names of the real products have been cleverly disguised here.) The surface shown as a wire grid was fit using a nonparametric surface smoother. In each display panel, predicted Yopo demand reflects the prices of bacon and Spam, which appear to be competitors. The cross-elasticities of demand for YoPo are for the most part positive in each panel, based on the apparent slope of the surface in the !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
22
direction of each price axis. What varies over panels is the average price of pork rinds. Average pork rind price goes
Figure 10 Thermometer icons from a low value in the lower left, to its highest value in the upper right. This ordering was the organizing heuristic for the display. It’s worth noting that visualizing model results like these without getting some idea of the accuracy of prediction is not a good idea. One reason is that viewers of the graph will infer greater accuracy than exists. One way to show uncertainty for the example at hand is to plot pointwise standard errors of the model predictions for various margins. That is, for a panel in the example, you could hold one predictor, say Spam price, constant, and plot the predicted YoPo demand as a function of bacon price, with standard errors. These marginal predictions could be arranged in another display. Another option is to graph a confidence interval slab in three dimensions, and provide various slices through it. This can get messy, but it works out sometimes. An important goal in statistical graphics is to summarize both structure and uncertainty. (Another goal worth considering is having fun.)
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
23
customer stars
13
8
7
12
4
6
5
2
9
10
11
1
3
Figure 11 Star icons
Model fits, transformations, and distribution shapes Model development and graphics are usually both involved in visualization. One important aspect of development is testing model assumptions. As an example, consider the conventional linear regression model which we have all used so often. Two of the necessary assumptions include:
ô ô
the residuals are normally and independently distributed as normal with mean zero, and some variance F2 ; the residuals are not related to the predicted values.
Whether these assumptions are in fact tenable for a given model and data set can be investigated graphically.
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
24
Figure 12 Demand for YoPo, the pork yogurt, as a function of the prices of Spam, bacon, and pork rinds. Let’s consider the customer data on firms mentioned earlier. Suppose we are investigating how the annual purchase amounts of these customers depends on their annual revenues, number of FTEs, and employee age. After doing some exploration, which is likely to include looking at scatterplots as in the scatterplot matrix describe earlier, we decide we are interested in linear dependencies. So, we fit the following linear regression model:
predicted.purchase.quantity ' a.constant % another.constant ( revenues % still.another.constant ( FTEs % a.fourth.constant ( mean.employee.age
&or& purc.quanti ' $0 % $1 ( ann.revs i % $3 ( FTEsi % $4 ( empl.agei % ,i
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
25
The * character means multiply. In the second model equation, we’ve added an epsilon, ,, representing the difference between the observed purchase quantity and the predicted one. Note that the second equation is for one customer. The customers are replications, and are assumed to have the same values for the $’s, which are the regression coefficients. Box 3 gives the usual results. The number of customers in this analysis (“N”) is 13.
Coefficients: (Intercept) ann.revs FTEs emp.age
Value Std. Error t value Pr(>|t|) 71.648 14.142 5.066 0.001 1.452 0.117 12.410 0.000 0.416 0.186 2.242 0.052 -0.237 0.173 -1.365 0.205
Residual standard error: 2.31 on 9 degrees of freedom Multiple R-Squared: 0.982 F-statistic: 167 on 3 and 9 degrees of freedom, the p-value is 3.32e-008
From these results, it would seem that the 3 Linear regression results. three characteristics can explain a large amount of the variation of purchase quantity. The quality of the coefficient estimates and the validity of the standard error estimates depend on the extent to which assumptions are met. Most of these assumptions can be examine graphically, but here we’ll only look at the two mentioned earlier. One way of looking at whether F2 is really constant is to plot the residuals versus the predicted purchase quantities and see if any structure is apparent. Figure 13 shows a scatterplot of the standardized residuals versus the predicted quantities. Included is a smoother (the solid, mostly horizontal curve) to help reveal whether the quantities are not independent. The points labeled by numbers are observations that are outside the interval ±2F. If F2 is in fact not constant but varies across observations, we would expect to see some structure in this plot. Common patterns are a funnel shape getting wider from one side to another, and a bow shape. The latter is typical of data like proportions in which the variance is related to the mean. The patterns can be used to suggest transformations of the variables. Figure 14 is a QQ plot similar to what we saw earlier, but in this one, the horizontal axis describes what the residuals quantiles would be if they were normally distributed. This is sometimes called a theoretical QQ plot. The idea is to make it easier to see if the observed data are likely to have been generated by the theoretical distribution. If so, the points should be close to, and randomly varying about, the diagonal dashed line, which is the locus of perfect correspondence. You can see from Figure 14 that the assumption that the residuals are normally distributed is tenable. We can therefore feel confidence when examining the coefficient estimates and their standard errors.
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
26
Axis lengths and the Aspect Ratio A topic of long standing in the area of statistical graphics is what how the axes lengths in bivariate plots should be related. According to Kosslyn(1994, p. 66), the aspect ratio, the ratio of a graph’s vertical axis to its horizontal axis, should be such that differences in the
4
residuals vs fitted w/ loess
0 -2
Residuals
2
6
13 8 80
90
100
110
Fitted : ann.revs + FTEs + emp.age
Figure 13 Residuals vs. fitted for customer data linear regression model
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
27
4
normal QQ plot of model residuals •6
• •
0
Residuals
2
•
•
•
•
•
• -2
•
•
•13 •8 -1
0
1
Quantiles of Standard Normal
Figure 14 QQ plot for residuals from customer data regression
data cause corresponding perceivable graph differences. This is correspondence is what he calls the principle of compatibility. This idea may not always be so easy to implement. Consider data that are repeated measures over time, like measures of customer satisfaction and repeat purchase probabilities. The typical plot design has time running from left to right, and the repeated measures on the vertical axis. The question is, what should the ratio of the physical axis lengths be to most effectively reveal interesting patterns? Look at Figure 15, which shows four different aspect ratios for a time series plot of monthly widget sales. The aspect ratios for the top three panels are 1.0, 0.5, and 0.1, going from top to bottom. The aspect ratio in the bottom panel uses the heuristic that the average absolute slope of all data line segments should be 45o. This is based on the idea that the discriminability of slope changes should be maximized (Cleveland 1993a). In this example, the heuristic ratio is somewhat larger than 0.1. Which of these panels do you think reveals the most about these data? !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
28
Expect that picking the most useful aspect ratio will require some experimentation.
Plotting to answer the question In addition to following Tufte’s rule to “show the data,” you should also plot what you need to answer the important questions about the data. You may find that you need almost one graph for every important question you need to address. Trying to answer too many questions with a single graph will lead to needless complexity and difficulty for the viewer.
50
100
150
200
This rule of thumb is valuable when time series data are to be compared. Consider Figure 16. It shows a customer satisfaction measure, the proportion of satisfied customers, by month over a six year period for “us” and arch rival “them.” This time series plot is useful for observing how we and they are doing over time. But if the important questions are along the lines of “How are we doing compared to them?”, “Are we maintaining our lead?”, or “Are they catching up to us?” then something more like Figure 16 answers more directly. Note that you wouldn’t want to combine the two plots. You might use them both.
1970
1974
50
100
150
200
1966
1968
1970
1972
1974
50
150
1966
1968
1970
1972
1974
1966
1968
1970
1972
1974
50
100
200
1966
Figure 15 Time series plots for widget sales. Aspect ratios from top to bottom are 1,0.5,0.1, and hueristic.
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
29
proportion of satisfied customers
0.2
0.4
0.6
ours theirs
1988
1989
1990
1991
1992
1993
1994
Figure 17 Time series of proportion satisfied customers, “us”, and arch rival “them.”
0.0
0.1
0.2
0.3
0.4
ours - theirs
1988
1989
1990
1991
1992
1993
1994
Figure 16 Time series of difference between proportions of satisfied customers, ours minus theirs !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
30
Color Color can be a good way to encode statistical information if it is used thoughtfully. Generally speaking, it should be used with some reserve, and also bearing in mind that judging differences in hue and brightness is relatively difficult. For encoding values of a quantitative variable, both Tufte and Cleveland suggest using a mixture of two hues whose saturations vary. Saturation is essentially how pure a color is. The more saturated, the purer. Tufte suggests using complementary hues, and taking into account our spectral sensitivity function when assigning numeric values to color values. The spectral sensitivity function describes our sensitivity to light as wavelength varies. In the daytime, normal humans are more sensitive to yellow-green than other colors. He points out that “the mind’s eye does not readily give an order to ROYGBIV,” so arbitrary color assignment and rainbow encoding are not good techniques. ROYGBIV is an acronym for the names of the colors around Newton’s color wheel: red, orange, yellow, green, blue, indigo, and violet. Cleveland attends to the problems of reproducing graphics, and encourages the use of the basic colors used for printed reproduction: cyan, magenta, yellow, and black. The encoding scheme he suggests is “Two Hues, Varying Lightness,” or THVL. The idea is to pick two basic colors, such as cyan and magenta, and mix them in levels across the range of the variable being color encoded. He observed that using more than 10 levels will usually result in discrimination problems. Color coding is generally most appropriate, and the easiest to use, for representing categorical data. 3D and Animation Our displays are for the most part flat. Oftentimes the graphic design problems we are trying to solve concern how to accurately represent more than two variables on this plane. Some of the methods we’ve already considered encode more than two measures. Figure 11, for example, encodes four: three prices, and demand. Each panel in Figure 11 is a 3D plot. Another example are the 3D scatterplots you’ve likely seen. But these examples are “static.” They are essentially projections of 3D space on to planes like Abbott’s(1952) “Flatland.” In creating such displays, we try to use depth cues in angles and brightness to trick the mind of the viewer. Another static cue is when one object obscures another. The obscured one will usually be judged to be further away. How shadows fall and texture gradients are other, real-world cues to depth. Yet another cue is “retinal disparity:” the difference between images of the same object that fall on the inside of your eyeballs. Somehow our visual systems fuse these nonidentical optical images into a single perceptual object that has depth, distance, and so on. This is the Cyclopean vision described by Julesz(1971). By using dynamic displays, we can trick the mind’s eye more effectively. It’s well known !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
31
that relative motion is a powerful cue for depth perception. Objects that move faster across the visual field are generally perceived to be closer. So an animated display can encourage the perception of separation in 3D by varying the speed at which display elements such as points traverse a screen. You’ve probably seen interactive displays for exploratory analysis. The most common form seems to be the 3D scatterplot, or “cloud” plot. Additional variables can be encoded into these displays, also, by using symbol size and color. Animation can also be used to illustrate certain kinds of simulation so that the implications of models can be better understood. Consider, for example, the diffusion of a new product into the targeted market. Differences between different scenarios and their associated assumptions might be summarized by showing the simulated growth of market share under each over a rescaled time base. Another example application is showing the sensitivity of average satisfaction level to equivalent marginal changes in various satisfaction “drivers.” Such uses of animation are like the demonstrations used to illustrate phenomena in the physical sciences. Telling a graph’s story with the PGP principle Tufte has suggested that good graphs tell a story. Sometimes, this story takes a little explaining to viewers. Mosteller(1980) has described a technique that can be useful in doing this. He calls it the Particular-General-Particular (PGP) strategy, and he attributes it to R. Rourke, a colleague. The idea is to by explaining a particular important feature consistent the overall message. Next, the message itself is covered, which often will be the answer to the most important question the graph is designed to answer. Finally, another feature is described to sort of solidify the general idea. This strategy is consistent with the definition of complete concept learning, and with some current theories of knowledge acquisition and organization. In application, however, it can be difficult, maybe in part because you are forced to be explicit about what a graph is really about. As an example, here’s an attempt at using PGP to describe Figure 11, the thermometer plot of the distribution of customers across the continental United States. Asssume that the question the display was designed for is “Where do our customers come from?” Then the message has to do with their geographic distribution and concentrations in various states. We’ll also assume that the thermometers are an unfamiliar feature, so they must be defined:
“Figure 11 shows how our customers are distributed across the U.S. The proportion of our total customers in a state is shown in a “thermometer,” and the higher the temperature, the larger the proportion.
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
32
Colorado has a relatively large proportion, since its thermometer shows a relatively high value. It by no means has a majority of our customers, since its level doesn’t exceed the thermometer middle point. Our customers are from the Western States, Middle South, and Midwest, with a small number from New England. They are not concentrated in any particular region, nor in any specific state. The Upper Great Plains states are not represented in our customer base. No customers are found in Montana, the Dakotas, or Montana. These as well as other unrepresented but populous states like Texas and Florida probably deserve scrutiny in terms of market potential.” To fully comprehend the message, the viewer needs to both understand how the thermometer icons work, and interpret their pattern over geography. I began above by defining thermometers as a necessary percursor to the particular about Colorado. In the next paragraph, I make explicit the general message of the graph. This is follows by particulars about states that have no customers, and therefore, no thermometers in them. I’ll bet you can come up with better example applications of the PGP principle. A useful variation is to pick out a feature of the graph to define, and direct the viewer’s attention to the feature first. Higher dimensions, virtual worlds, distributed graphing Data that are in more than three dimensions often resist being displayed effectively on the flat surface of a page or screen. There are many real-world phenomena whose true mathematical descriptions cannot be reduced to fewer dimensions without significant distortion or information loss. What this means, for example, is that you can look at, say, seven-dimensional data in all 2, two-at-a-time pairwise combinations, and still not be able to appreciate fully the data’s structure. It’s a case of the whole being larger than the sum of the parts. Scientists have attempted to get around this problem by using some of the methods described above to encode dimensions beyond three. One of Tufte’s(1983) favorite examples of a higher-dimensional display is an historical map of the Napoleonic campaign against Moscow(p. 177), which was constructed by Minard in the 1869. This rather remarkable graphic shows six different variables. As the number of data dimensions increases, however, the cognitive demands on the viewer tend to increase quite rapidly. As the number of variables increases, the remaining ways of rendering them distinguishable from each other become more removed from our consensual experiences with graphs. Techniques from the rapidly growing area of computer science called virtual reality (“VR”) may provide some assistance in rendering data of high dimensionality. In the future, VR !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
33
technologies may allow us to represent the dimensions of data using a wider range of our senses. In the meantime, the available methods already allow using powerful metaphors of time and space. VR methods can allow us to exploit the intuition we already have about the world we live in. VR applications are becoming available for viewing on the world wide web (WWW) that are based on Virtual Reality Markup Language (VRML, pronounced like “vermille”) version 1.0. VRML is based on a language developed by Silicon Graphics, Inc., and was officially introduced in 1995. It allows you to “fly through” a scene of objects in space on different trajectories, and to generally move around in the space. You view the space using a browser much like what’s used to view hypertext markup language (HTML) documents on the WWW and LANs. Imagine being able to travel through a three-dimensional scatterplot, looking for interesting patterns! Version 2.0 of VRML should find industry acceptance during 1996. It will add animation as well as some other new features to the methods already included version 1.0. A main part of the information revolution these days has to do with connectivity and distributed processing. The latter concerns using computing resources that are spread out across local area networks (“LANs”) and wide area networks (i.e., the internet). Graphing data, as a computer activity, will benefit from advances in connectivity. It wil sonn be possible to use graphing tools and data from remote locations on demand. Developers of graphing software will be able to make it available to users on a per-use basis, and to provide updates and custom applications in a way that is transparent to the software user. Breakthroughs like Sun Microsystem’s JAVA language are enabling the rapid progress in distributed processing. JAVA makes it possible to develop programs that are easily distributed across networks, and that do not depend on the kind of hardware they must be run on. A few of the first examples Sun provided of JAVA involved interactive, real-time graphics. These can be views on their website http://java.sun.com. Tools You Can Use Not surprisingly, there appears to be no single tool that can meet all your needs. Depending on what you are trying to accomplish, you might get by with one of the commercial statistical packages. The graphic capabilities of these have improved substantially in the last 5 years as the Microsoft Windows, X-Windows, and Macintosh GUI’s (graphical user interfaces) have diffused through the small computer/workstation marketplace. There are times, however, when you may need to edit or add features to the graphics produced by your statistical software. To do these things, you can use a variety of desktop publishing and drawing programs. You may occasionally find that your statistics packages can’t create a specific visualization you need. You will have to resort to more specialized mathematical rendering and graphing tools, or even libraries of graphics functions for a computer language like BASIC, Fortran, or C. Below I’ve listed some of the more common tools of the types mentioned. Note that the !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
34
classification is somewhat arbitrary in that several of the programs listed in different categories actually share a number of features. General statistical packages w/ (at least some) useful graphics Capability: SAS, SASPC, SAS JMP, SPSS/PC, SYSTAT (DOS and Windows), STATISTICA, MINITAB. Splus. Statistical graphics programs: DeltaGraph, Harvard CHART XL, ORIGIN, BMDP/Diamond, Stanford Graphics, PSI-Plot. Math and graphing programs: Maple V, Mathematica, Macsyma, Matlab, MathCad, GAUSS, Spyglass Transform, Spyglass Slicer/Dicer, Spyglass Plot, PV-WAVE, BBN Cornerstone. Desktop publishing graphics programs: Adobe Illustrator, Aldus( Adobe) Freehand, Corel Draw/Paint, Hijaak Pro/Draw (Inset Systems), Axum, Freelance Graphics. Some useful direct-mail sources for software that can provide you with a catalog are:
SciTech: Software for Science (800) 622-3345 http://www.scitechint.com/scitech/ The Programmers Supershop (800) 421-8006 http://Computing.SuperShops.com Sources on the internet Following are some internet sites that might be of interest to you. Some have public domain software for downloading. http://www.gdb.org/Dan/Softsearch/graph-links.html has an archive of software. http://lib.stat.cmu.edu/ is StatLib at Carnegie Mellon Univ. It includes a range of references and software. http://gams.cam.nst.gov/OtherSources.html sources of software and data. http://ftp.ncsa.uiuc.edu/Visualization/ materials and software related to computer visualization at the Natl. Center for Supercomputing Applications. The next two might be worth looking at to get a better idea of things to come: http://java.sun.com !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
35
The website for Sun’s Java Language. http://sdsc.edu/vrml/ The official repository for virtual reality markup language at San Diego Supercomputing Center. Printed References Abbott, E.A (1952), Flatland: A Romance of Many Dimensions, 7th Ed., New York: Dover. American Statistician, Alexandria Va: American Statistical Association. Baird, J.C. and E. Noma (1978), Fundamentals of Scaling and Psychophysics, New York: John Wiley. Bertin, J (1983), Semniology of Graphics, Madison WI: University of WIsconsin Press. Carr, D. (1994), "Topics in scientific visualization: Using grey in plots," Statistical Computing & Graphics, 5(2), Aug, 11-14. Chambers, J.M., W.S. Cleveland, B. Kleiner and P. Tukey (1983), Graphical Methods for Data Analysis, Pacific Grove CA: Wadsworth. Chernoff, H. (1973), "The use of faces to represent points in k-dimensional space graphically," Journal of the American Statistical Association, 68. Cleveland, W. and R. McGill (1984), "Graphical perception: Theory, experimentation, and application to the development of graphical methods.," Journal of the American Statistical Association, 79, 807-822. ------ and R. McGill (1985), "Graphical perception and graphical methods for analyzing and presenting scientific data.," Science, 229, 828-833. Cleveland, W.S. (1985), The Elements of Graphing Data, Monterey CA: Wadsworth. ------ (1993a), Visualizing Data, Murray Hill NJ: AT&T Bell Laboratories. ------ (1993b), "A model for studying display methods of statistical graphics," Journal of Computational And Graphical Statistics, 2(4), Dec. Dewdney, A.K. (1984), The Planiverse: Computer Contact with a Two-Dimensional World, New York: Poseidon Press. Fisher, R.A. (1958), Statistical Methods for Research Workers, 13th ed., New York: Hafner. Gilbert, E. (1958), "Pioneer maps of health and disease in England," Geographical Journal, 124, 172-183. Hoaglin, D.C., F. Mosteller and J.W. Tukey (1983), Understanding Robust and Exploratory Data Analysis, ed. (Eds.), New York: John Wiley. IEEE Transactions on Pattern Analysis and Machine Intelligence, New York: IEEE Computer Society. IEEE Transactions on Visualization and Computer Graphics, New York: IEEE Computer Society. James, G. and R. James (1992), Mathematics Dictionary, 5th Ed., New York: Van Nostrand Reinhold. Journal of Computational and Graphical Statistics, Alexandria VA: American Statistical Assn.;Institute of Mathematical Statistics; Interface Foundation of North America. Journal of Experimental Psychology: Applied, Washington DC: American Psychological !-!!2- ^ 'RAPHICAL -ETHODS
PAGE
36
Association. Journal of Experimental Psychology: Human Perception and Performance, Washington DC: American Psychological Association. Journal of Experimental Psychology: Memory and Cognition, Washington DC: American Psychological Association. Journal of the American Statistical Association, Alexandria VA: American Statistical Association. Julesz, B. (1971), Foundations of Cyclopean Perception, Chicago: The University of Chicago Press. Kosslyn, S.M. (1985), "Graphics and human information processing," Journal of the American Statistical Association, 80, 449-512. ------ (1989), "Understanding charts and graphs," Applied Cognitive Psychology, 3, 185225. ------ (1994), Elements of Graph Design, New York: W.H. Freeman. Lewandowsky, S and I Spence (1989), "Discriminating strata in scatterplots," Journal of the American Statistical Association, 84, 682-688. Mosteller, F. (1980), "Teaching of statistics: Classroom and platform performance," The American Statistician, 34(1), Feb, 11-17. Pinker, S. (1990), "A theory of graph comprehension," in Artificial Intelligence and the Future of Testing, ed. R. Friedle, Norwood NJ: Ablex, 73-126. Playfair, W. (1786), The Commercial and Political Atlas, London: Corry. Rovan, Joze (1994), "Visualizing solutions in more than two dimensions," in Correspondence Analysis in the Social Sciences, ed. M. Greenacre and J. Blasius, New York: Academic Press, 210-230. Spence, I. and S Lewandosky (1991), "Displaying proportions and percentages," Applied Cognitive Psychology, 5, 61-77. ------ and S. Lewandowsky (1990), "Graphical Perception," in Modern Methods of Data Analysis, ed. J. Fox and J. S. Long, Newbury Park CA: Sage, 13-57. Statistical Computing & Graphics, Alexandria Va: Statistical Computing and Statistical Graphics Sections, American Statistical Association. Stevens, S.S. (1975), Psychophysics, New York: John Wiley. Tufte, E. (1993), "Envisioning Information," Workshop on statistical graphics, Chicago IL, Dec 9. Tufte, E.R. (1983), The Visual Display of Quantitative Information, Chesire CN: Graphics Press. ------ (1990), Envisioning Information, Chesire CN: Graphics Press. Tukey, J. (1993), "Graphic comparisons of several linked aspects: Alternatives and suggested principles," Journal of Computational and Graphical Statisticsl, 2, 1-49. Tukey, J.W. (1977), Exploratory Data Analysis, Reading MA: Addison-Wesley. Tversky, B and D.J. Schiano (1989), "Perceptual and conceptual factors in distortions in memory for graphs and maps," Journal of Experimental Psychology: General, 118, 387-398. Wilkinson, L.E., M. Hill, M. Stacey, G. Birkenbeuel and E. Vang (1992), Systat: Graphics, Evanston IL: Systat, Inc.
!-!!2- ^ 'RAPHICAL -ETHODS
PAGE
37