Basic principles of graphing data - SciELO

3 downloads 14763 Views 665KB Size Report
ABSTRACT: Data visualization is a very important aspect of data analysis and of presentation. ... graphing statistical data can be a tool of misinformation.
Basic principles of graphing data

483

Point of View

Basic principles of graphing data Marcin Kozak Warsaw University of Life Sciences – Dept. of Experimental Design and Bioinformatics – Nowoursynowska 159 – 02-776 – Warsaw, Poland. E-mail ABSTRACT: Data visualization is a very important aspect of data analysis and of presentation. Focusing on the latter, this paper discusses various elements of constructing graphs for publications. Bad and good graphs are compared, and a checklist with graphical elements to be used while creating graphs is proposed. Key words: visualization, graphs, statistics

Princípios básicos na confecção de gráficos RESUMO: A visualização de dados é um aspecto importante da análise de dados e de sua apresentação. Nesse ponto de vista enfocam-se vários elementos da construção de gráficos para publicações. Gráficos ruins e bons são comparados, e uma lista de checagem de elementos de composição gráfica é proposta para ser utilizada durante a confecção de gráficos. Palavras-chave: visualização, gráficos, estatística

Introduction Agricultural scientists produce abundant results. Some of them are important, others may be less important, still others may be negligible. The way one presents the results, then, counts for several reasons. First, one needs to emphasize those important ones. Second, for most data, interpretation is easiest with graphs, and it is far easier based on a good than a bad graph. Just as statistical analysis can provide false conclusions (Huff, 1954; Kozak, 2009a), poor graphs can misinform, sometimes leading to serious misinterpretations. Huff (1954) was probably the first to discuss that graphing statistical data can be a tool of misinformation. Later some of his notions found counter-arguments, yet the main ideas still hold: graphing must be correct and convey true information and interpretation. Scientific graphs are not to be “beautiful”; they are to be informative. Of course, ugly graphs will not convey any interesting message, so elegance should not be disregarded (Tufte 1991; 1997; 2001 and 2006). Tufte (1991; 1997; 2001 and 2006), Cleveland (1993; 1994), Jacoby (1997; 1998) and Wilkinson (2005), among others, discuss graphing data in great detail. Harris (1999) is a useful reference for information graphics. Yet scientific literature in the 21st century is full of poor graphs, some of which are incomprehensible while others can even unintentionally falsify information. Data visualization is a developing research area, and what was considered a good graph 30 years ago does not have to be good these days. In this paper basic information is offered about graphing data with the help of which authors should be able to construct sufficiently good scientific graphs. Focus will mainly be placed on graphical rather than statistical aspects, so remember that when constructing a graph, you must take all

care to ensure it delivers the message you intend it to, from both scientific and statistical points of view. Note also that visualization data with the purpose of data analysis and interpretation at the computer screen does not follow exactly the same rules as those described in this paper deals mainly with graphing for others, that is, for publications. You will see good and bad graphs, what can go wrong when visualizing data, and why it is wrong. At the same time you will see how the problem can be overcome, and what to do to ensure the graph will be fine. In the next section, “General principles of graphing data”, a general vision of graphing scientific data is presented. Next, in the section “Specific principles of graphing data” some detailed rules to be followed are presented to ensure if a graph is efficient. Do not forget, however, that these are really basic principles, as the title of the paper says; if one wants to learn more about this topic, one should refer to sources cited in this paper. The last section offers a short conclusion, while the Appendix lists data sets used in this paper and the information on the software employed to analyze and graph these data. Note that in figure captions more information than is normally needed is given, for example the type of graph. Each figure caption informs whether the figure (or its panels) represents a good or rather poor style.

General principles of graphing data The general and most important principle of graphing data is to construct a graph so that it conveys the message in the most efficient way, and the message one wants it to convey. Thus, all elements of the graph should be helpful but not distracting, and important aspects should be emphasized but not hidden. This is the

Sci. Agric. (Piracicaba, Braz.), v.67, n.4, p.483-494, July/August 2010

484

Kozak

most general rule, and all the following rules account for this general one. Remember that all details count: viewers of a graph will follow your ideas if you help them. So if you don’t feel an expert in visualization, think twice - or rather thrice - before deciding not to follow any of the rules. Before making a graph, think over what message it is to convey and whether it is needed whatsoever. In general, graphing three numbers is not advisable; no matter how amazing this can seem, one can find quite a few such graphs, especially piecharts and barplots. In fact, two numbers are sometimes graphed. Two numbers will be most efficiently presented within a sentence, while few numbers within a text-table (which is a simple table inserted within the text - Tufte, 2001; Kozak, 2009b) or a regular table. Tufte (2001) argues that even a dozen or so numbers are better represented by a table than a graph, but this is arguable - in general for 10 and even fewer numbers to be compared a graph can be preferred, although it depends on various things, including what the numbers represent. Make graphs as simple and as complex as it is needed to deliver the message. But neither be afraid of a complex graph nor make it more complex than it should be. Remember that the human eye can work efficiently with very composite images, detecting large amount of information in small spaces (Tufte, 2001). However, always be careful to make the graph readable and understandable: avoid clutter (see, e.g., Reynolds et al., 2009; Silva, 2009c). An efficient way to proceed while constructing a graph is to: 1) figure out the contexts and the message 2) figure out the way of presenting it, so the type, layout and style of graph to be used

3) construct the graph 4) check Tables 1 and 2 and revise the graph accordingly 5) check whether the revised version conveys the message you want it to convey 6) check Tables 1 and 2 and revise the graph accordingly 7) and so on… Table 1, mentioned before, offers basic elements to check while constructing a graph - they will be discussed in the following section. They do not refer to every type of graph, but it will be best to check everything that is listed there, and decide what does and what does not apply to your graph. Table 2 cites Cleveland’s (1994) very useful principles of graphing data. Note how similar the process of graph construction is to the process of writing: figure out the message and style, construct, revise, revise, revise, revise… Revising cannot be ignored, and exactly like in writing, it can be a source of inspiration, leading to results and conclusions one could not even imagine before. The above “algorithm” of graph construction together with Tables 1 and 2 will be especially useful for non-experienced graph constructors, because those experienced ones follow it intuitively.

Specific principles of graphing data Data points are very important elements of many graphs, especially scatterplot and line plot with lines superimposed on data points. Symbols representing them together with their size and color should be carefully chosen so that the points could be easily seen, and the patterns present in the data (or their lack) could be easily noticed. Figure 1, picturing logarithm of the volume of cherry trees against the logarithm of

Table 1 – Elements to check while constructing a graph. Refer to the text for details. Data points

symbol, size, color, overlap

Lines

type, width, color, overlap

Color

check whether needed at all; check whether all elements are easily distinguished type of box, aspect, minimum and maximum value, data rectangle and scale-line rectangle, tick marks (number of, location, direction, length and width), tick marks' labels (font type, length, rotation, numbering style, abbreviations of text labels) check whether needed, position, box around, size, elements within (see above for "data points" and "lines", and "labels" within "box and axes") of the graph; of bars, cross-hatching, shading, color

Box and axes Legend Background Text inside the graph

Dimensions

check whether needed; font style, size and color check whether informative, explaining everything that is needed to understand the graph and the message it delivers check for over-dimensionality

Captions

Reference line

check whether needed, line type, width and color

Grid lines

check whether needed, line type, width and color

Error bars

check if the information is given on what they represent; line type, width and color, type of ending all the above elements; layout (rows, columns, pages), choice of panel variables and conditioning variable(s), panel order

Trellis display

Sci. Agric. (Piracicaba, Braz.), v.67, n.4, p.483-494, July/August 2010

Basic principles of graphing data

485

Table 2 – Clear vision principles given by Cleveland (1994). 1. Make the data stand out. Avoid superfluity. 2. Use visually prominent graphical elements to show the data. 3. Use a pair of scale lines for each variable. Make the data rectangle slightly smaller than the scale-line rectangle. Tick marks should point outward. 4. Do not clutter the interior of the scale-line rectangle. 5. Do not overdo the number of tick marks. 6. Use a reference line when there is an important value that must be seen across the entire graph, but do not let the line interfere with the data. 7. Do not allow data labels in the interior of the scale-line rectangle to interfere with the quantitative data or to clutter the graph. 8. Avoid putting notes and keys inside the scale-line rectangle. Put a key outside, and put notes in the caption or in the text. 9. Overlapping plotting symbols must be visually distinguishable. 10. Superposed data sets must be readily visually assembled. 11. Visual clarity must be preserved under reduction and reproduction. 12. Put major conclusions into graphical form. Make captions comprehensive and informative.

Figure 1 – A scatterplot with a regression line superimposed, representing trees’ volume against girth, both after a logarithmic transformation (data set trees). The graph is full of drawbacks: filled squares as plotting symbols are difficult to distinguish; the box consists of two scale lines only; the data rectangle is too large so the points are clustered within a small area; the tick mark labels on the y-axis are vertical instead of horizontal even though they are just one-digit numbers; the tick marks point inward. Note that the regression line extrapolates the values outside the ranges observed in the experiment, which can be dangerous.

girth (data trees), is poor. The choice of filled squares as symbols is bad because it is difficult to distinguish the points. (By the way, a filled rhombus, which is the default symbol for a scatterplot in Microsoft Excel, is an equally poor choice.) Compare it with Figure 2, whose readability is greatly enhanced thanks to open circles used as symbols—in fact, open circles should be considered the best symbol in case of some overlap in scatterplots for non-grouped data. (Note that for dotplots, in which no overlap occurs, closed circles are usually used as plotting symbols—Figures

Figure 2 – A scatterplot with a regression line superimposed, based on the same data as in Figure 1, but all the drawbacks mentioned there are overcome. It is considered a GOOD GRAPH.

3 and 4). Sometimes, however, overlap can be a serious problem—see Figure 5. When data are grouped, the choice of plotting symbols matters too: c, z, }, € should work well for little overlap, and c, +,