Vol. 73 | No. 3 | Mar 2017
International Journal of Sciences and Research
PERCENTILE BASED HISTOGRAM BIN WIDTH GUVEN M. GUNVER (Corresponding Author) (Department of Biostatistics/Cerrahpasa Faculty of Medicine Istanbul University/Turkey/+905325939078/
[email protected]) MUSTAFA S. SENOCAK (Department of Biostatistics/Cerrahpasa Faculty of Medicine Istanbul University/Turkey/+905353596506/
[email protected]) ERAY YURTSEVEN (Department of Public Health/Cerrahpasa Faculty of Medicine Istanbul University/Turkey/+905367464817/
[email protected])
Abstract
Abstract It is often possible to observe large section of data concentrated in a narrow area, which is distant from the median, especially in skewed data stacks. Existing histogram binning methods pay little attention to this situation. Also, current histogram bin calculation approaches like Scott's normal reference rule [Scott (1979)] or Freedman - Diaconis rule [Freedman and Diaconis (1981)] , contain parameters like ŝ and/or n, which cause to generate variant histogram peaks due to skewness and/or sample size of data stacks. This study attempts to suggest an alternative ad-hoc approach, which aims to generate robust histogram peaks; irrespective of sample size, skewness or standard deviation of the data stack. Keywords: Binning, Density Estimation, Percentiles Introduction The current approaches for histogram bin calculation, like Scott's normal reference rule, which has been admitted as the benchmark of histogram binning, contain ŝ, which usually grows unnecessarily especially in skewed data stacks. As the formula below is affected by standard deviation so much, ĥ is calculated bigger than desired in skewed data stacks;\ ĥ = 3.49σ̂n-1/3 or Freedman–Diaconis rule, which is a modified version of Scott's normal reference rule suited for skewed data stacks, contains n as well as Scott's normal reference rule; ĥ = 2IQRn-1/3 As being members of the medical world, we aimed to develop a new touch to histogram bin, after being questioned by a clinician; “If histogram bin is a matter of density, why should it be decreased by sample size?” . On the other hand, her next comment on current approaches 93
Vol. 73 | No. 3 | Mar 2017
International Journal of Sciences and Research
(Scott's normal reference rule and Freedman - Diaconis rule) were fateful. She stated that she doesn't understand the rationale behind these approaches, and that the essential outcome of statistics on continuous data (the histogram) should be understandable by any scientist, irrespective of his/her statistical knowledge. Methodology Herein, we attempt to present an empiric method for ĥ , which basically depends on percentile mesh; i
Table1. Methodology Sequential Difference (SeDi) Correction P4i − P4(i − 1)
1 P4 − P0 2 P8 − P4 3 P12 − P8 ... ... 23 P92 − P88 24 P96 − P92 25 P100 − P96
If (P4 − P0) = 0 ⇒ Max (SeDi) Else (P4 − P0) If (P8 − P4) = 0 ⇒ Max (SeDi) Else (P8 − P4) If (P12 − P8) = 0 ⇒ Max (SeDi) Else (P12 − P8) ... If (P92 − P88) = 0 ⇒ Max (SeDi) Else (P92 − P88) If (P96 − P92) = 0 ⇒ Max (SeDi) Else (P96 − P92) If (P100 − P96) = 0 ⇒ Max (SeDi) Else (P100 − P96)
ĥ = Min (Correction) By the formula above, we are seeking the most narrow width in the data stack which contains 4 percent of data stack. As it is very well known, median and its surroundings are the area where the histogram peak is most probably expected, especially in symmetric data stacks. 4 is the smallest divider which divides 100, but not 50. That is why we preferred to increase percentiles at a time by fours. If a data stack contains one single member repeated more than 4% of n times, SeDi column may contains zero. As ĥ cannot be equal to zero, we installed a new column (Correction) and replaced all zeros with maximum sequential difference between the percentiles when percentiles are increased at a time by fours. The minimum of corrected column, that is the most narrow distance which is also greater than zero when the percentiles are increased at a time by fours, is set as histogram bin, ĥ. As it is a simple, solid and empiric approach, we do not propose the mathematical rationale behind it. We decided to name this new approach as Pb (percentile bin).
Fig. 1 Methodology
The scale in the figure 1 represents the percentiles of a sample data stack. By the formula presented above, ĥ will be the distance between 52nd and 56th percentile, because that distance is the most narrow width in the data stack which contains 4 percent of data stack, when the percentiles are increased at a time by fours.
94
Vol. 73 | No. 3 | Mar 2017
International Journal of Sciences and Research
Examples The outcomes of calculation of ĥ by 3 formulas (Scott’s normal reference rule, Freedman– Diaconis rule and Pb) are represented in this section. The sample data stacks used as an example are; 1. Real dataset [Cortez et al (2009)] http://archive.ics.uci.edu/ml/datasets/Wine+Quality 2. Simulated data [1] x = 119.69 ŝ = 15.23 n = 1,000 3. Simulated data [2] x = 120.01 ŝ = 15.00 n = 1,000,000 Table 2. Examples Histogram bin ĥ̂
Percentage of data that histogram peak contains
Example
n
Fixed acidity
0.173 0.118
Pb 0.1
Scott
Freedman Diaconis
Pb
11.21% 10.33%
6.29%
Volatile acidity
0.021
0.013
0.01
9.98%
9.82%
5.47%
Citric acid
0.025
0.014
0.01
15.60%
10.35%
6.27%
Residual sugar
1.043
0.966
0.1
23.60%
20.23%
3.88%
Chlorides
0.004
0.002
0.001 17.68%
7.57%
4.10%
3.496
2.709
1
9.58%
7.59%
3.27%
Total sulfur dioxide
8.736
6.948
4
8.96%
7.10%
4.25%
Density
0.0006
0.0005
0.0003 8.43%
6.86%
4.31%
pH
0.031
0.022
0.01
8.70%
7.70%
3.51%
Sulfates
0.024
0.017
0.01
12.13%
8.47%
5.08%
Alcohol rate
0.253
0.224
0.1
12.07%
9.43%
4.76%
Fixed acidity
0.52
0.359
0.1
18.89%
12.51%
4.19%
Volatile acidity
0.054
0.043
0.01
13.38%
10.38%
3.13%
Citric acid
0.058
0.056
0.01
18.39%
18.39%
8.26%
Residual sugar
0.421
0.12
0.1
33.52%
16.32%
9.88%
Chlorides
0.014
0.003
0.001 37.90%
12.95%
4.13%
3.127
2.394
1
19.64%
16.57%
8.63%
Total sulfur dioxide
9.82
6.841
2
18.57%
13.32%
4.25%
Density
0.0006
0.0004
0.0001 14.26%
10.01%
4.94%
pH
0.046
0.032
0.01
13.95%
11.01%
3.56%
Sulfates
0.051
0.031
0.01
18.82%
11.76%
4.32%
Alcohol rate
0.318
0.274
0.1
19.01%
15.95%
8.88%
5.318
4.034
1.28
15.10%
11.20%
4.30%
1,000,000 0.524
0.405
1.5
1.40%
1.09%
4.01%
White wines Free sulfur dioxide
Red wines
Freedman Scott Diaconis
Free sulfur dioxide
Simulated data [1] Simulated data [2]
4,898
1,599
1,000
95
Vol. 73 | No. 3 | Mar 2017
International Journal of Sciences and Research
Conclusions As it is seen in examples section, histogram peak varies due to n (namely both for Scott’s normal reference rule and Freedman–Diaconis rule) and ŝ (namely for Scott’s normal reference rule). The average percentage of data that histogram peak contains for white wines are respectively; for Scott’s normal reference rule 12, 54% , for Freedman - Diaconis rule 9, 59% and for Pb 4, 65% . The average percentage of data that histogram peak contains for red wines are respectively; for Scott’s normal reference rule 20, 58% , for Freedman - Diaconis rule 13, 56% and for Pb 5, 74% . The reason for this increase between white wines and red wines, which is viable for both of Scott’s normal reference rule and Freedman Diaconis rule, is the decrease in sample size. As the sample size decreased from 4,898 to 1,599 , the sample size affects both two rules by cubic root ( 3√(4, 898 ⁄ 1, 599) = 1.45), Percentage of data that histogram peak contains increased correlatively due to the decrease in sample size (20.58% ⁄ 12.54% = 1.64 for Scott’s normal reference rule, 13.56% ⁄ 9.59% = 1.41 for Freedman - Diaconis rule). Correlatively, simulated data [1] and simulated data [2] (these two data stacks are distributed normally) are driven by the same parameters (μ = 120 , σ = 15), and their only difference is their sample size. As the sample size grew 1,000 times (thousand to million), both the histogram bin and the percentage of data that histogram peak contains decreased 10 times. It can be easily claimed that if a data stack is simulated with same parameters which contains billion samples, its histogram bin and percentage of data that histogram peak contains decreases 100 times in respect of simulated data [1]. The average percentage of data that histogram peak contains increased from white wines to red wines for Pb and it is 5.74% ⁄ 4.65% = 1.23 . But, the rationale behind this increase is not the decrease in sample size, and it is explained in limitations section. Standard deviation is very effective in Scott’s normal reference rule, especially in skewed data stacks. Residual sugar of white wines, residual sugar of red wines and chlorides of red wines are highly skewed data stacks and they have a long tail towards the right of the mean. As the standard deviations grow because of the skewness that these data stacks housed, histogram peaks of them contain more than 20% of data. This situation causes to overlook some anomalies housed in data stacks. Here, we should also point out that Freedman Diaconis rule is a modified version of Scott’s normal reference rule to be suited for skewed data stacks, but it failed in some cases. For residual sugar of white wines, Freedman Diaconis rule generated a histogram bin, and this histogram peak still contains more than 20% of data. In the face of two existing methods, Pb generates suitable histogram bins. The histogram bins generated by Pb allows to make the anomalies in the data stack visible, also, it avoids the unnecessary decrease in histogram bin if the data stack contains no anomalies (see the histogram bins and percentage of data - that histogram peak contains - of simulated data [1] and simulated data [2]) . More importantly, it is a simple approach which can be understood by any practitioner of statistics.
96
Vol. 73 | No. 3 | Mar 2017
International Journal of Sciences and Research
Limitations As it was mentioned in methodology section, if a data stack contains one single member repeated more than 4% of n times, sequential difference (SeDi) column may contains zero. We installed a new column (correction) in order to avoid this situation. However, this situation also causes a weakness in our definition. As a data stack contains one single member repeated more than 4% of n times, the percentage of data that histogram peak contains increases correlatively. This is because, the continuity of the data is interrupted by that single member. We suggest that, if the histogram bin of Pb generates the percentage of data that histogram peak contains more than 8% (which are double under-barred in Table 2), there it is certain that SeDi column contains at least one zero, variations on the histogram [Denby and Mallows (2009)] would be more appropriate. References [1] Cortez. P., Cerdeira. A., Almeida. F., Matos. T., & Reis. J. (2009), Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems. 47(4). 547-553. [2] Denby. L., & Mallows. C. (2009), Variations on the histogram. Journal of Computational and Graphical Statistics. 18(1). 21-31. [3] Freedman. D., & Diaconis. P. (1981). On the histogram as a density estimator: L 2 theory. Probability theory and related fields. 57(4). 453-476. [4] Scott. D. W. (1979), On optimal and data-based histograms. Biometrika. 66(3). 605-610.
97