AN ADJUSTED BOXPLOT FOR SKEWED DISTRIBUTIONS

32 downloads 477 Views 843KB Size Report
However, when the data are skewed, usually too many points are classified ..... A frequently used graphical tool to analyze a univariate data set is the box- plot.
COMPSTAT’2004 Symposium

c Physica-Verlag/Springer 2004 

AN ADJUSTED BOXPLOT FOR SKEWED DISTRIBUTIONS Ellen Vanderviere and Mia Huber Key words: Boxplot, skewness, medcouple. COMPSTAT 2004 section: Graphics. Abstract: The boxplot is a very popular graphical tool to visualize the distribution of continuous univariate data. First of all, it shows information about the location and the spread of the data by means of the median and the interquartile range. The length of the whiskers on both sides of the box and the position of the median within the box are helpful to detect possible skewness in the data. Finally, observations that fall outside the whiskers are pinpointed as outliers, hence the boxplot also includes information from the tails. However, when the data are skewed, usually too many points are classified as outliers. This is because the outlier rule is solely based on measures of location and scale, and the cutoff values are derived from the normal distribution. We present a generalization of the boxplot that includes a robust measure of skewness in the determination of the whiskers. We show with several simulation results that this adjusted boxplot gives a more accurate representation of the data and of possible outliers.

1

Introduction

One of the most frequently used graphical techniques for analyzing a univariate data set is the boxplot, proposed by Tukey [6]. If Xn = {x1 , x2 , . . . , xn } is a univariate data set, the boxplot is constructed by • putting a line at the height of the sample median Q2 • drawing a box from the first quartile Q1 to the third quartile Q3 • classifying all points outside the interval [Q1 − 1.5 IQR ; Q3 + 1.5 IQR]

(with IQR = Q3 − Q1 )

as outlier and marking them on the plot • drawing the whiskers (i.e. the lines that go from the ends of the box to the most remote points that are no outliers) This construction implies that a boxplot gives information about the location, spread, skewness and tails of the data. However, in some cases the information about the tails that is given by the boxplot, is not reliable. As was already mentioned in Hoaglin, Mosteller

1934

Ellen Vandervieren and Mia Hubert

and Tukey [3], too many points are classified as outlier when the data are sampled from a skewed distribution. Consider e.g. the chi-squared distribution with df degrees of freedom, which is very skewed for small df . Table 1 lists the theoretical lower whisker and the upper whisker for df = 1, 5 and 20. The last column gives the probability of a type I error, which we define as the probability to exceed the whiskers, or equivalently, to mark regular observations as outliers. We see that this probability is about 7.6% for χ21 which is very high, and decreases with df . For the normal distribution on the other hand, the expected percentage of outliers is only 0.7%. Distr. χ21 χ25 χ220 N(0,1)

Lower outlier cutoff -1.730 -3.252 2.888 -2.698

Upper outlier cutoff 3.155 12.552 36.392 2.698

Total % outliers 7.58 2.80 1.39 0.70

Table 1: Theoretical lower and upper outlier cutoff values for several distributions, and the expected percentage of classified outliers according to the boxplot rule.

The large discrepancy between these percentages is caused by the fact that the outlier rule is solely based on measures of location and scale, and the cutoff values are derived from the normal distribution. We present a generalization of the boxplot that includes a robust measure of skewness in the determination of the whiskers. To construct this adjusted boxplot we will derive new outlier rules at the population level. To draw the boxplot at a particular data set, we then just need to plug in the finite-sample estimates.

2 2.1

Generalization of the boxplot A robust measure of skewness

To measure the skewness of a continuous distribution F , we will use the medcouple, which we denote as MC. It is defined as MC(F ) =

med

x1

Suggest Documents