Modified boxplot for extreme data Babangida Ibrahim Babura, Mohd Bakri Adam, Anwar Fitrianto, and A. S. Abdul Rahim
Citation: AIP Conference Proceedings 1842, 030034 (2017); doi: 10.1063/1.4982872 View online: http://dx.doi.org/10.1063/1.4982872 View Table of Contents: http://aip.scitation.org/toc/apc/1842/1 Published by the American Institute of Physics
Modified Boxplot for Extreme Data Babangida Ibrahim Babura1,2,a) , Mohd Bakri Adam1 , Anwar Fitrianto3 and A. S. Abdul Rahim4 1
Laboratory of Computational Statistics and Operations Research, Institute for Mathematical Research, Universiti Putra Malaysia, 43300 Serdang Selangor, 2 Department of Mathematics, Faculty of Science, Federal University Dutse, Ibrahim Aliyu bye-pass Dutse, Jigawa Nigeria, 3 Department of Mathematics, Faculty of Science, Universiti Putra Malaysia, 43300 Serdang Selangor. 4 Department of Economics, Faculty of Economics and Management, Universiti Putra Malaysia, 43300 Serdang Selangor. a)
Corresponding author:
[email protected]
Abstract. A boxplot is an exploratory data analysis (EDA) tool for a compact distributional summary of a data set. It is designed to captures all typical observations and displays the location, spread, skewness and the tail of the data. The precision of some of this functionality is considered to be more reliable for symmetric data type and thus less appropriate for skewed data such as the extreme data. Many observations from extreme data were erroneously marked as outliers by the Tukeys standard boxplot. We proposed a modified boxplot fence adjustment using the Bowley coefficient, a robust skewness measure. The adjustment will enable us to detect inconsistent observations without any parametric assumption about the distribution of the data. The new boxplot is capable of displaying some additional features such as the location parameter region of the Gumbel fitted extreme data. A simulated and real life data were used to show the advantages of this development over those found in the literature.
INTRODUCTION Boxplot received a considerable amount of interest since its introduction as a schematic plot by Tukey in 1977 [1]. The main purpose of constructing a boxplot is to utilize its simplicity as an exploratory data analysis (EDA) toolkit to display data. This display is mainly to capture typical observations, study symmetry or tail behavior, identify outliers and some distributional assumptions about data. It can also be used to compare parallel batches of data and supplement more complex displays about univariate information. Boxplot consists five values of significance namely; the two fences, the upper and lower hinges (quartiles), and the median [1]. For univariate data set X, Tukey’s standard boxplot was designed to capture data within the fence mark-up region given by [Q1 − 1.5IQR, Q3 + 1.5IQR] of regular observations in X, where Q1 , Q3 are first and third quartiles of X respectively and IQR is an interquartile range of X. The fundamental weakness of the Tukey’s boxplot is in marking outliers as fairly justified in Hoaglin (1983)[2]. For skewed distributions, the fences mark-up does not correctly define outliers. This problem made a significant percentage of regular observation to exceed the fence cutoff values specified in Tukey(1977) [1]. Thus, a substantial amount of literature were achieved in addressing the problem. A survey by Frigge et al. [3], examine the different choices of the five number summary for implementation of boxplot and describes their consequences. Some of the suggestions are in a standard selection of the fence constant other than 1.5 [3]. In some scholarly works, efforts were made to address the misconception about the outlier rule. The work of Kimber (1990) [4] requires the use of semiinterquartile ranges for asymmetric whiskers by adjusting the fences formula towards skewed data using the lower and upper semi-interquartile ranges. An attempt to replace the first and third quartiles in Tukey’s definition of fence with
The 3rd ISM International Statistical Conference 2016 (ISM-III) AIP Conf. Proc. 1842, 030034-1–030034-9; doi: 10.1063/1.4982872 Published by AIP Publishing. 978-0-7354-1512-6/$30.00
030034-1
the median is suggested by Carling(2000) [5]. Similar ideas with a pre-specified outside rate were proposed by Schwertman et al.(2004) [6] and Schwertman de Silva(2007)[7]. Despite the interesting ideas involved in Schwertman et al.(2004) [6] and Schwertman de Silva(2007)[7], the estimated fence constants are sample size dependent and depend on some properties that require uncontaminated distribution a difficult parameters to estimate. However, both the suggestions in Kimber(1990)[4] and Carling(2000) [5] does not sufficiently address the outliers problem for skewed data. A precise adjustment of the position of fences to account for skewness using Medcouple (MC), a robust measure of skewness introduced by Brys et al(2004)[8] was recently proposed by Hubert and Vandervieren(2008)[9]. Despite the exhaustive work in Hubert and Vandervieren(2008) [9], a severely skewed, or heavy-tailed distributions will resist the adjustment. The alternative suggested by Bruffaerts et al.(2014)[10] is using a rank-preserving transformation of the sample that allows the data to fit Tukey’s g-and-h distribution. However, the contribution in Bruffaerts et al.(2014)[10] substantially deviates from the simplicity of boxplot especially relating to the distributional transformation of the data. We observed that the major limitation to the aforementioned adjustment is the compromise between distributions for generalization purposes. Also when exposed to simulated extreme data from Generalized Extreme Value Distribution (GEV) we observed a significant underestimation of the fences leaving a significant proportion of the data as outliers. The account of limitations as detailed in the literature, prompt for the need to modified existing display of boxplot for extreme data. Extreme data are records of events that are more extreme than any that have already been observed. Such observation can either be low extreme as minima or high extreme as maxima. The current development in global warming which signifies a considerable interest in environmental research and financial crisis a consequence from so much volatility in the financial sector, are some of the events that give rises to a universal interest in modeling and forecasting of an extreme events. Exploratory data analysis towards extreme data is not given its due attention despite its significance towards confirmatory data analysis. We present a research that deployed the use of appropriate boxplot for extreme data. A modified boxplot fence adjustment using the Bowley coefficient, a robust skewness measure was proposed. The adjustment is a display of improvement of boxplot that enable a proper use of boxplot for extreme data without fear of displaying regular observations as outliers. The new boxplot is capable of displaying some additional features such the location parameter region of the Gumbel fitted extreme data. A simulated and real life data were used to show the advantages of this development over those found in the literature.
Generalized Extreme Value Distribution (GEV) The GEV distribution is drived from an important theorem in extreme value theory (EVT) practices which is given as follows: Theorem 1 If there exists sequences of constants {an > 0} and {bn }, as n → ∞, such that Pr{(Mn − bn )/an ≤ x} → G(x) where Mn is a block maximum of n observations and G is a non-degenerate distribution function, then G is a member of the GEV family: x − μ −1/ξ G(x) = exp − 1 + ξ σ x−μ defined on {x : 1 + ξ σ > 0}, where σ > 0 and μ, ξ ∈ . So, G(x) is said to be Freichet if ξ > 0, Weibull if ξ < 0 and become Gumbel when ξ = 0 with re-expression of the limiting distribution as [11] x − μ G(x) = exp − exp − . σ
ADJUSTMENT OF BOXPLOT FENCES FOR EXTREME DATA The methodology in this work has it guiding principal from the Tukey’s standard boxplot and adjusted boxplot for skewed distributions proposed by [9].
Tukey’s Standard Boxplot The region [Q1 − 1.5IQR, Q3 + 1.5IQR] gives the Tukey’s mark-up fences of a standard boxplot, where Q1 and Q3 are the first and third quartile respectively and IQR is the interquartile range. So, any univariate observation
030034-2
can be considered potential outlier if it does not fall within the region. For a data coming from normal distribution, approximately 0.7% of the observations will fall outside this region and become an outlier. In practice, we can change the theoretical percentage of Gaussian data that is defined as outliers. Thus the need to change the constant factor 1.5 multiplying the IQR in the expression of the described region. The adjusted factor which leads to a theoretical z −z0.75 detection rate equal to α, (0 ≤ α ≤ 0.5) is given by c(α) = z1−α/2 , where z p , 0 ≤ p ≤ 1, is the order p quantile of 0.75 −z0.25 the standard normal distribution. For example, if c(0.05) = 0.95 (the fence constant value), we can reasonably expect that about 5% of data from a normal distribution will become an outlier. Fixing fence constant will affect the efficiency of the outliers rule for other asymmetric distributions like the GEV family of distributions.
Hubert’s Adjusted Boxplot The limitation attached with the standard boxplot was addressed by Hubert and Vandervieren(2008) [9] with adjustment of fence constant according to the extent of asymmetry in the data. The following bound for the regular observations is proposed [9] to be ⎧ −4MC ⎪ ⎪ IQR, Q3 + 1.5e3MC IQR if MC ≥ 0 ⎨Q1 − 1.5e ⎪ ⎪ ⎩ Q1 − 1.5e−3MC IQR, Q3 + 1.5e4MC IQR if MC < 0 where MC is a robust measure of skewness called medcouple, proposed by Brys et al.(2004)[8]. For a given sample {x1 , ..., xn }, MC is given by MC = med h(xi , x j ), xi ≤Q2 ≤x j
where Q2 is the sample median, med(·) is the sample median operator and xi x j with h as the kernel function given by (x j − Q2 ) − (Q2 − xi ) . h(xi , x j ) = x j − xi Brys et al.(2004) [8] give more details definition of the kernel function h for the case when xi or x j are tied to the median and its other additional properties. However, the lower and upper bound for the measure are respectively given by -1 and 1. The MC is equal to zero when the observed distribution is symmetric, whereas it become positive or negative when it corresponds to a right or left skewed distribution respectively.
Fences Adjustment with Bowley Coefficient of Skewness The robust skewness measure sometimes called the Bowley coefficient is proposed by Bowley[1920] [12] and is defined as Q3 + Q1 − 2Q2 δ= , Q3 − Q1 where Qi s are the three boxplot quantiles. δ satisfied the benchmark requirements for any reasonable skewness as categorized by Groneveld and Meeden(1984)[13] is as follows: (i) δ is location scale invariant that is for r.v.’s X and Y with Y = aX+b for a ∈ (0, ∞) and b ∈ (−∞, ∞), δ(X) = δ(Y); (ii) for a symmetric distribution δ = 0 (iii) δ is sign equivariant, that is if Y = −X then δ(Y) = −δ(X). (iv) If F and G are c.d.f’s for X and Y as above and F < G (component wise), then δ(X) ≤ δ(Y). We incorporate δ in the new definition of boxplot fences by replacing the constant multiple of IQR with a relation say fl (δ) and fu (δ) which are functions of δ into the outliers cut off value. So, the proposed fences will be given by Q1 − fl (δ)IQR Q3 + fu (δ)IQR
for the lower fence for the upper fence
(1)
We intend our definition to coincide with the Tukey’s standard definition of fence by making the functions fl (δ) and fu (δ) to be equal to 1.5 when δ = 0 when the GEV distribution became symmetric. Let consider two models and select the best that define the fences as related to the coefficient δ namely:
030034-3
1.
2.
A linear regression model describe as
fl (δ) = 1.5 + hδ fu (δ) = 1.5 + kδ
(2)
where 1.5 is the known intercept when the δ = 0 and h, k ∈ are to be determined as the model coefficients. An exponential regression model given by fl (δ) = 1.5emδ fu (δ) = 1.5enδ
(3)
where 1.5 is the known intercept when δ = 0 and m, n ∈ are to be determined constants This models are interestingly simple and align with the principles of EDA.
Determination of the Models Parameters Let the values of fl and fu be according to the Tukey’s fences mark-up expected percentage outliers of 0.7% of all regular sample observations. We simulate an extreme data from GEV random samples in order to have a proper estimates of fl and fu . This means that the constants h, k should satisfy; Q1 − (1.5 + hδ)IQR ≈ Qα Q3 + (1.5 + kδ)IQR ≈ Qβ
(4)
Qα and Qβ denotes the αth and βth quantiles respectively. Then α = (0.7%)/2 = 0.0035 and β = 1 − α = 0.9965. So, the system in Equation 4 can be rewritten as; Q1 −Qα IQR Qβ −Q3 IQR
− 1.5 ≈ hδ − 1.5 ≈ kδ
However, the exponential model in Equation 3 can easily be transform to linear system as given in Equation 6; 1 −Qα ln 23 QIQR ≈ mδ Qβ −Q3 2 ln 3 IQR ≈ nδ
(5)
(6)
For both the two systems in Equations 5 and 6 we employ the median resistance line fit, an EDA procedure to estimate the parameters h and k[14]. In the two linear resistance fit, we require the intercept to be zero. Having this will enable both the linear and exponential model to be exactly 1.5 when δ = 0. To simulate the quantities in the left hand side (LHS) and right hand side (RHS) of Equations 5 and 6. We generate from a GEV distribution a sample of size 100, in which we fixed the location and scale parameters to 0 and 1 1 . The defined respectively and varied the shape parameter within the interval (−0.8 ≤ ξ ≤ 10) with an increment of 100 band for ξ was considered based on EVT practical suggestions in the literature. Coles et al.(2001)[11] explained the non-regular cases for the estimation of the parameters of GEV distribution. The case (ξ ≤ −0.5) leads to a scenario where the GEV distribution will have short bounded upper tail which is a situation rarely encountered in practice. This means that we can run the simulation without given weight to the non-regular cases. This development justify the consideration of the value -0.8 as a lower limit for ξ in this simulation experiment. However, since the Bowley coefficient of skewness (δ) is approximately 1.0 when ξ ≥ 10, thus we set ξ = 10 to be the upper limit for ξ in the simulation band. The generated samples with the same location, scale and shape parameter was replicated 5000 times, in which an average δ and corresponding average quantities in the LHS of Equations 5 and 6 for each sample was recorded as the finite sample estimate of the quantities. We sorted the returned simulated quantities, by selecting those with nonnegative skewness (δ ≥ 0). The assumption here was that we can easily exchange the estimated parameters of the model to have the coefficient for a model with the negative skewness (δ < 0). The returned quantities were further sorted for δ ≤ 0.9 in order to fit the lower fence models, while for the upper fence models the returned quantities are sorted for 0 ≤ δ ≤ 0.6. This choices are made to allow better fit as both two models explode exponentially for δ outside the said band. The linear and exponential fit for lower and upper fences were respectively presented in Figure 1 and Figure 2. Observing the residuals and clustered points around the resistance line, we can easily conclude that the exponential is the best model.
030034-4
1
Linear Exponential Resistance line
−5 −3 −1
LHS of systems 5/6(upper)
Fitted linear/exponenetial models for Lower fence
0.0
0.2
0.4
0.6
0.8
Skewness
0.4
Linear Exponential zero fit
−0.2
Residuals
Residual plots for lower fence
−3.5
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
Fitted Values
FIGURE 1. Median resistance line fit for linear and exponential models in Equations 5 and 6 for lower fence
The Proposed Modified Boxplot for Extreme Data The half slope ratio for the linear fit in Figure 1 was obtained to be 0.87 and 3.74 for lower and upper fences respectively, which indicates that relation is not linear especially for the upper fence. While the exponential fit has 0.78 and 0.69 as the half slope ratios for lower and upper fence respectively, relatively good values close to one for a proper linear relation.
200 50 100 0
LHS of systems 5 & 6(lower)
Fitted linear & exponenetial models for upper fence Linear Exponential Resistance line
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Skewness
50
1.5
Residuals
0.5
100
2.5
Residual plots exponential
−0.5
0
Residuals
Residual plots linear
−20
0
20
40
60
Fitted Values
−1
0
1
2
3
4
Fitted Values
FIGURE 2. Median resistance line fit for linear and exponential models in Equations 5 and 6 for upper fence
The estimated parameters of the exponential model in Equation 6 are m = −4.76 and n = 6.52. We then rounded off the estimates to m = −4 and n = 6 for robustness. However, since we consider earlier that the model is for the selected population values with nonnegative skewness, we, therefore, consider a swap of values between m and n for negative skewness. We then summarize the proposed fence mark-up for the new boxplot by replacing fl (δ) and fu (δ) in Equation 1 with the new fence mark-up as follows.
030034-5
FIGURE 3. Modified Boxplot with location parameter region for Gumbel type GEV distribution.
⎧ ⎪ ⎪ ⎨Q1 − 1.5e−4δ IQR ⎪ ⎪ ⎩ Q3 + 1.5e6δ IQR ⎧ ⎪ ⎪ ⎨Q1 − 1.5e6δ IQR ⎪ ⎪ ⎩ Q3 + 1.5e−4δ IQR
for lower fence with δ ≥ 0 for upper fence with δ ≥ 0
(7)
for lower fence with δ < 0 for upper fence with δ < 0
(8)
Display of the Location Parameter Region for Gumbel Type GEV Distribution The shape parameter of Gumbel type GEV distribution vanishes to zero, making it a two parameter distribution; the location μ and scale σ parameters. To enhance the display of our newly modified boxplot for batch comparison, we formulate the following simulation experiment. Let α ∈ [0, 1] be such that Qα ≈ μ. Here, Qα is the quantile that returns the location parameter of a Gumbel type GEV sample. We can simulate the finite sample estimate of Qα from random samples of the Gumbel type GEV distribution. We consider the confidence band of the empirical quantiles from the simulated population of Qα to be confidence band for μ. We set the location parameter μ = 0, scale parameter σ = 1 and the shape parameter be zero to generate a sample of size 100 and determine Qα . We replicated this process 10,000 to determine a good finite sample estimate of Qα . We obtain the 95% confidence band from this simulated quantities as [Q0.28 , Q0.47 ]. The interval indicates that we are 95% confidence that the location parameter μ of Gumbel type GEV distribution will fall within the proposed region. We incorporate this interval into the modified boxplot as a gray rectangular box in between lower quartile and median. The height of the inscribe box is the length of the confidence interval [Q0.28 , Q0.47 ]. Figure 3 is the modified boxplot with this enhancement. Figure 4 illustrates a batch comparison of three different set of data as generated from Gumbel type GEV distributed. The first plot is using the modified boxplot and the second is using notch boxplot. We can see an improvement in fence mark-up is quite better in the proposed modified boxplot. However, the confidence interval for the location parameter is better illustrated by the gray region of the modified boxplot than the notch around the median in the notch boxplot. The extent of the notch does not even capture the exact value of the location parameter. The non-overlap of the gray regions of the third batch of the data set with the other two batches clearly indicates a difference in the samples location parameter.
Performance of the Modified Boxplot Using Simulated Data To compare the performance of modified boxplot with both standard and adjusted boxplots for extreme data, we perform a simulation study. We set the parameters of GEV distribution to get a 0 skewed and right skewed extreme
030034-6
30
FIGURE 4. Comparison between modified and notch boxplots for display of confidence band of location parameter of the Gumbel type GEV distribution.
20 15 10 0
5
percentage outliers
25
Standard Adjusted Modified
0.0
0.1
0.2
0.3
0.4
0.5
0.6
skewness
FIGURE 5. Performance of the modified boxplot in an uncontaminated simulated data
data set. That is, we set the location and scale parameter to be 10 and 2 respectively and varies a shape parameter within 1 the interval (0.1 ≤ ξ ≤ 0.5) with an increment of 100 . Figure 5 shows the result of the simulation comparison. The modified boxplot percentage outliers are mostly below 5% as against the standard and adjusted boxplots especially when the extent of skewness increases.
Performance of the Modified Boxplot Using Real Data Figure 6 shows the three different boxplots display for 30 years (1974-2003) maximum monthly rainfall data record at Subang Station, Malaysia. To the left is the Tukey’s standard boxplot, in the middle is Hubert’s adjusted boxplot and to the right is the proposed modified boxplot for extreme data. The display shows a slight improvement of data captured from adjusted and a significant improvement from modified boxplot. The modified has captured all the extreme data but indicated that some values at the lower end of the data might be potential outliers and need special attention. The highest value is a record in 2001 in which is considered the worst flood to hit Malaysia in over 100 years.
030034-7
_
_ 150
_
_
150
_
_
150
_
100
100 50
50
100 50
_ _ _ _ _ _
_ _ _ _
0
0
0
_ _
FIGURE 6. Boxplot distplay of 30 years (1974-2003) maximum monthly rainfall data record at Subang station, Malaysia. Source: Earth Observation Center(EOC), Institute of Climate Change, UKM.
DISCUSSION The findings in this research address so many reservations Extreme Value Theory (EVT) community has upon using boxplot for extreme data especially about the display of regular observations as outliers. The modified boxplot can be a valuable tool for EDA display of both normal or extreme data. However, we recommend the following pitfalls while using modified boxplot for extreme data. • • • • •
All analysis with modified boxplot should remain for exploratory, not confirmatory purposes. The minimum sample size to be used should be at least 5 data points. The proposed fence mark-up is more appropriate when 0 ≤ δ ≤ 0.9 for the upper fence, and 0 ≤ δ ≤ 0.6 for lower fence in case of right skewed extreme data and vise versa for the left skewed extreme data. The use of proposed location parameter region can be considered more reliable when skewness is low, e.g. |δ| ≤ 0.2. The median and IQR should not be an estimate of the extreme model parameters, so the box in the modified boxplot will not represent such parameters.
However, this research has paved a way for further improvement of boxplot for extreme data. There need to be further display enhancement of boxplot to give more information about parameters of an extreme model e.g. the scale and shape parameter for a GEV model.
ACKNOWLEDGMENTS The authors are grateful to the Universiti Putra Malaysia (Research Grant No. FRGS5524846) and Federal University Dutse Nigeria (Tetfund Fellowship 2013) for sponsorship of the research work. We, however, acknowledge the constructive criticism by the unanimous reviewers. We are highly grateful to the R software developers for creating a reliable free tool used in this research.
REFERENCES [1] [2] [3] [4] [5] [6]
J. Tukey, Exploratory Data Analysis (Addison-Wesley, Reading, Massachusetts, 1977), pp. 39–49. D. C. Hoaglin, F. Mosteller, and J. W. Tukey, Understanding robust and exploratory data analysis, Vol. 3 (Wiley New York, 1983). M. Frigge, D. C. Hoaglin, and B. Iglewicz, The American Statistician 43, 50–54 (1989). A. Kimber, Applied Statistics 39, 21–30 (1990). K. Carling, Computational Statistics & Data Analysis 33, 249–258 (2000). N. C. Schwertman, M. A. Owens, and R. Adnan, Computational statistics & data analysis 47, 165–174 (2004).
030034-8
[7] [8] [9] [10] [11] [12] [13] [14]
N. C. Schwertman and R. de Silva, Computational statistics & data analysis 51, 3800–3810 (2007). G. Brys, M. Hubert, and A. Struyf, Journal of Computational and Graphical Statistics 13, 996–1017 (2004). M. Hubert and E. Vandervieren, Computational statistics & data analysis 52, 5186–5201 (2008). C. Bruffaerts, V. Verardi, and C. Vermandele, Statistics & probability letters 95, 110–117 (2014). S. Coles, J. Bawa, L. Trenner, and P. Dorazio, An introduction to statistical modeling of extreme values, Vol. 208 (Springer, 2001), pp. 47–54. A. L. Bowley, Elements of statistics, Vol. 2 (PS King, 1920). R. A. Groeneveld and G. Meeden, The Statistician 33, 391–399 (1984). P. F. Velleman and D. C. Hoaglin, Applications, basics, and computing of exploratory data analysis (Duxbury Press, 1981), pp. 121–146.
030034-9