Modified Boxplot for Extreme Data Babangida Ibrahim Babura1,2,a) , Mohd Bakri Adam1 , Anwar Fitrianto3 and Abdul Rahim Abdul Samad4 1
Laboratory of Computational Statistics and Operations Research, Institute for Mathematical Research, Universiti Putra Malaysia, 43300 Serdang Selangor, 2 Department of Mathematics, Faculty of Science, Federal University Dutse, Ibrahim Aliyu bye-pass Dutse, Jigawa Nigeria, 3 Department of Mathematics, Faculty of Science, Universiti Putra Malaysia, 43300 Serdang Selangor. 4 Department of Economics, Faculty of Economics and Management, Universiti Putra Malaysia, 43300 Serdang Selangor. a)
Corresponding author:
[email protected]
Abstract. A boxplot is an EDA tool for a compact distributional summary of a data set. It is design to captures all typical observations and displays the location, spread, skewness and the tail of the data. The precision of some of this functionality is considered to be more reliable for symmetric data type and thus less appropriate for skewed data such as the extreme data. Many observations from extreme data were erroneously marked as outliers by the Tukeys standard boxplot. We proposed a modified boxplot fence adjustment using the Boley coefficient, a robust skewness measure. The adjustment will enable us to detect atypical observation without any parametric assumption about the distribution of the data. The new boxplot is capable of displaying some additional features such as the distribution free confidence interval of the Gumbel fitted extreme data. A simulated and real life data were used to show the advantages of this development over those found in the literature. first draft ISM3 conference proceedings
INTRODUCTION Boxplot received a considerable amount of interest since its introduction as a schematic plot by Tukey in 1977 [1]. The main purpose of constructing a boxplot is to utilize its simplicity as an EDA toolkit to display data. This display is mainly to capture typical observations, study symmetry or tail behavior, identify outliers and some distributional assumptions about data. It can also be used to compare parallel batches of data and supplement more complex displays about univariate information.Boxplot consist of five values of significance namely; the two fences, the upper and lower hinges (quartiles), and the median [1]. For univariate data set X, Tukey’s standard boxplot was designed to capture data within the fence markup region given by [Q1 − 1.5IQR, Q3 + 1.5IQR] of regular observations in X, where Q1 , Q3 are first and third quartile of X respectively and IQR is an inter-quartile range of X. The fundamental weakness of the Tukey’s boxplot is in marking outliers as fairly justified in [2]. For skewed distributions, the fences markup does not correctly define outliers. This problem made a significant percentage of regular observation to exceed the fence cutoff values specified in [1]. Thus a substantial amount of literature were achieved in addressing the problem. A survey [3], examine the different choices of the five number summary for implementation of boxplot and describe their consequences. Some of the suggestion are in a standard selection of the fence constant other than 1.5 [3]. In some scholarly works, efforts were made to address the misconception about the outlier rule. The work of Kimber (1990) [4] require the use of semi-interquartile ranges for asymmetric whiskers by adjusting the fences formula towards skewed data using the lower and upper semi-interquartile ranges. An attempt to to replace the
first and third quartile in Tukey’s definition of fence with the median is suggested by [5]. Similar ideas with a prespecified outside rate were proposed by [6] and [7]. Despite the interesting ideas involved in [6] and [7], the estimated fence constants are size dependent and depend on some properties that require uncontaminated distribution a difficult parameters to estimate. However, both the suggestions in [4] and [5] does not sufficiently address the outliers problem for skewed data. A precise adjustment of the position of fences to account for skewness using medcouple (MC), a robust measure introduced by [8] was recently proposed by [9]. Despite the exhaustive work in [9], a severely skewed, or heavy-tailed distributions will resist the adjustment. The alternative suggested by [10] is using a rank-preserving transformation of the sample that allows the data to fit Tukeys g-and-h distribution. However, the contribution in [10] substantially deviates from the simplicity of boxplot especially relating to the distributional transformation of the data. We observed that the major limitation to the aforementioned adjustment is the compromise between distributions for generalization purposes. Also when exposed to simulated extreme data from GEV distribution we observed a significant underestimation of the fences leaving a significant proportion of the data as outliers. The account of limitations as detailed in the literature, prompt for the need to modified existing display of boxplot for extreme data. An extreme data are records of events that are more extreme than any that have already been observed. Such observation can either be low extreme as minima or high extreme as maxima. The current development in global warming which signifies a considerable interest in environmental research and financial crisis a consequence from so much volatility in the financial sector, are some of the events that give rises to a universal interest in modeling and forecasting of an extreme events. Exploratory data analysis towards extreme data is not given its due attention despite its significance towards confirmatory data analysis. We present a research that deployed the use of appropriate boxplot for extreme data. A modified boxplot fence adjustment using the Boley coefficient, a robust skewness measure was proposed. The adjustment is a display improvement of boxplot that enable a proper use of boxplot for extreme data without fear of displaying regular observations as outliers. The new boxplot is capable of displaying some additional features such as the distribution-free confidence interval of the Gumbel fitted extreme data. A simulated and real life data were used to show the advantages of this development over those found in the literature.
The Generalized Extreme Value Distribution (GEV) If there exists sequences of constants {an > 0} and {bn }, as n → ∞, such that Pr{(Mn − bn )/an ≤ x} → G(x) where G is a non-degenerate distribution function, then G is a member of the GEV family: { [ ( x − µ )]−1/ξ } G(x) = exp − 1 + ξ σ ( x−µ ) defined on {x : 1 + ξ σ > 0}, where σ > 0 and µ, ξ ∈ R. So, G(x) is said to be Freichet if ξ > 0, Gumbel if ξ = 0 and Weibull if ξ < 0.
GPD Distribution Another distribution for modeling the extreme data is called the Generalized Pareto Distribution (GPD). This consider observations above or below a certain marked threshold value u as the extreme data and thus approximate G(x) as follows. For a sufficiently large u the distribution function of (X − u), conditional on X > u is approximately given by H(y) = 1 − (1 +
ξy −1ξ ) σ ˆ
defined on {y : y > 0 and (1 + ξy/σ) ˆ > 0}, where σ ˆ = σ + ξ(u − µ) . See [11] for details about GEV and GPD models.
ADJUSTMENT OF BOXPLOT FENCES FOR EXTREME DATA The methodology involved in this work has it guiding principal from the Tukey’s standard boxplot and adjusted boxplot for skewed distributions proposed by [9].
The Tukey’s Standard Boxplot The region [Q1 − 1.5IQR, Q3 + 1.5IQR] gives the Tukey’s markup fences of a standard boxplot, where Q1 and Q3 are the first and third quartile respectively and IQR is the interquartile range. So, any univariate observations can be considered potential outliers if it does not fall within the region. For a data coming from normal distribution, approximately 0.7% of the observations will fall outside this region and become outliers. In practice, we can change the theoretical percentage of Gaussian data that is defined as outliers. Thus the need to change the constant factor (1.5) multiplying the IQR in the expression of the described region. The adjusted factor which leads to a theoretical z −z0.75 detection rate equal to α, (0 ≤ α ≤ 0.5) is given by c(α) = z1−α/2 , where z p , 0 ≤ p ≤ 1, is the order p quantile of 0.75 −z0.25 the standard normal distribution. For example, if c(0.05) = 0.95 (the fence constant value), we can reasonably expect that about 5% of data from a normal distribution will become an outlier. Fixing fence constant will affect the efficiency of the outliers rule for other asymmetric distributions like the GEV family of distributions.
The Hubert’s Adjusted Boxplot The limitation attached with the standard boxplot was address by [9] by adjusting the fence constant according to the extent of asymmetry in the data. They proposed the following bound for the regular observations to be if MC ≥ 0 [Q1 − 1.5e−4MC IQR, Q3 + 1.5e3MC IQR] [Q1 − 1.5e−3MC IQR, Q3 + 1.5e4MC IQR] if MC < 0 where MC is a robust measure of skewness called medcouple, proposed by [8]. For a given sample {x1 , ..., xn } MC is given by MC = med h(xi , x j ), xi ≤Q2 ≤x j
where Q2 is the sample median and xi , x j with h as the kernel function given by h(xi , x j ) =
(x j − Q2 ) − (Q2 − xi ) x j − xi
[8] give more details on additional properties of the kernel function h. However, the lower and upper bound for the measure are respectively given by -1 and 1. MC is equal to zero when the observed distribution is symmetric, whereas it become positive or negative when it corresponds to a right or left skewed distribution respectively.
Fences Adjustment with Bowley Coefficient of Skewness The robust skewness measure sometimes called the bowley coefficient is proposed by [12] and is defined as δ=
Q3 + Q1 − 2Q2 , Q3 − Q1
where Q′i s are the three boxplot quantiles. δ satisfied the benchmark requirements for any reasonable skewness as categorized by [13] as follows: (i) (ii) (iii) (iv)
δ is location scale invariant that is for r.v.’s X and Y with Y = aX+b for a ∈ (0, ∞) and b ∈ (−∞, ∞), δ(X) = δ(Y); for a symmetric distribution δ = 0 δ is sign equivariant, that is if Y=-X then δ(Y) = −δ(X). If F and G are c.d.f’s for X and Y as above and F = 0). The assumption here was that we can easily exchange the estimated parameters of the model to have the coefficient for a model with the negative skewness (δ < 0). The linear and exponential fit were respectively presented in figure 1 and figure 2. Observing the residuals and clustered points around the resistance line, we can easily see that the exponential is the best model. The resistance fit for the exponential models of equation 6 was illustrated in figure 2
The Proposed Modified Boxplot for Extreme Data The half slope ratio for the linear fit in Figure 1 was obtained to be 0.87 and 3.74 for lower and upper fences respectively, which indicates that relation is not linear especially for the upper fence. While the exponential fit has 0.78 and 0.69 as the half slope ratios for lower and upper fence respectively, relatively good values close to one for a proper linear relation. The estimated parameters of the exponential model in equation 6 are m = −3.06 and n = 5.52. We then rounded off the estimates to m = −3 and n = 5 for robustness. However, since we consider earlier that the model is for the
1.0 0.5
Residuals
−0.5
0.0
1.5 1.0
−0.2
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
0.8
1.0
Fitted Values
Fitted linear model for upper fence
Residuals plot upper fence
1.2
80 40 0
40
80
Residuals
120
120
Skewness
0
LHS of equation 4.5 (II)
Residuals plot lorwer fence
0.5
LHS of equation 4.5 (I)
Fitted linear model for Lower fence
−0.2
0.0
0.2
0.4
0.6
0
5
Skewness
10
15
Fitted Values
FIGURE 1. Resistance fit for the linear models in equation 5
Residuals
−1.5
−0.5
0.5
1.5
1.0 0.0 −1.0 0.0
0.1
0.2
0.3
0.4
0.5
0.6
−2.0
−1.5
−1.0
−0.5
Fitted Values
Fitted exponential Model for upper fence
Residuals plot upper fence
2 1 −2
0
Residuals
3
1 2 3 4 5
Skewness
−1
LHS of equation 4.6 (II)
Residuals plot lower fence
−2.0
LHS of equation 4.6
Fitted exponenetial model for Lower fence
0.0
0.1
0.2
0.3
0.4
0.5
Skewness
0.6
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Fitted Values
FIGURE 2. Resistance fit for the linear models in equation 6
selected population values with nonnegative skewness, we, therefore, consider a swap of values between m and n for negative skewness. We then summarise the proposed fence mark-up for the new boxplot by replacing fl (δ) and fu (δ) in equation 1 with the new fence markup as follows.
and
Q1 − 1.5e−3δ IQR Q3 + 1.5e5δ IQR for
for the lower fence for the upper fence δ≥0
(7)
Q1 − 1.5e5.0δ IQR Q3 + 1.5e−3.5δ IQR for
for the lower fence for the upper fence δ