World Environmental and Water Resources Congress 2010: Challenges of Change. © 2010 ASCE
Comparison of the Probability Plot Correlation Coefficient Test Statistics for the General Extreme Value Distribution Sooyoung Kim1, Jun-Haeng Heo2 1
Ph.D. student, School of Civil and Environmental Engineering, Yonsei University, 134 Shinchon-Dong, Seoul, 120-749, Korea ; Tel(+82-2)393-1597 ; Fax(+82-2)393-1597 ; E-mail:
[email protected] 2 School of Civil and Environmental Engineering, Yonsei University, 134 ShinchonDong, Seoul, 120-749, Korea ; Tel(+82-2)2123-2805 ; Fax(+82-2)364-5300 ; E-mail:
[email protected] Abstract A proper probability distribution for estimating a quantile is selected by the goodness of fit tests in frequency analysis. The probability plot correlation coefficient(PPCC) test has been known as powerful and easy test among the goodness of fit tests. Generally, the PPCC test statistics are affected by significance levels, sample sizes, plotting position formulas, and shape parameters in case that a given distribution includes a shape parameter. Therefore, it is important to select an exact plotting position formula for the PPCC test statistics for a given probability distribution. After Cunnane(1978) defined the plotting position that related with the mean of data, many researches have accomplished about the plotting position formulas considered the influence of coefficients of skewness related with shape parameters. In this study, the PPCC test statistics are derived by using a plotting position formula developed from theoretical reduced variates with a term of a coefficient of skewness for the general extreme value(GEV) distribution. In addition, the PPCC test statistics are estimated by considering various sample sizes, significance levels, and shape parameters of the GEV distribution. The performance of derived PPCC test statistics is evaluated by estimating the rejection rate of population from Monte Carlo simulation. 1. INTRODUCTION It is very important to select an appropriate probability distribution in frequency analysis for the estimation of design quantile. An appropriate probability distribution is
1
2456
World Environmental and Water Resources Congress 2010: Challenges of Change. © 2010 ASCE
2457
selected generally based on the goodness of fit test which is the method for examining the fitness between sample data and its population for a given probability distribution. Various goodness of fit tests such as the Kolmogorov-Smirnov test, the Cramer von Mises test, and the chi-square test have been developed in many literatures. Among the goodness of fit tests, the Probability Plot Correlation Coefficient(PPCC) test developed for nornality test by Filliben(1975) has been known as powerful test. Since then, the PPCC test has been applied to various probability distributions. Vogel(1986) derived the PPCC test statistics for the Gumbel distribution, and Vogel and Kroll(1989) applied the PPCC test to the 2-parameters Weibull and uniform distributions for low flow frequency analysis. Vogel and McMartin(1991) computed the PPCC test statistics of 5% significance level for gamma distribution, and the PPCC test statistics for the GEV distribution are calculated by Chowdhury et al.(1991). Heo et al.(2007) proposed the regression equations to estimate the test statistics for several probability distributions. In this study, 100,000 samples for the general extreme value(GEV) distribution were generated to derive the PPCC test statistics considering various sample sizes n (from 10 to 500), significance levels(from 0.005 to 0.995), and shape parameters. The PPCC test statistics are derived by using a plotting position formula developed from theoretical reduced variates with a term of a coefficient of skewness for the GEV distribution and various existing plotting position formula such as Blom(1958), Gringorten(1963), Filliben(1975), and Cunnane(1978). In addition, Monte Carlo simulation is performed to select an appropriate plotting position formula for assumed probability distributions. 2. THE DERIVATION OF THE PPCC TEST STATISTICS 2.1 The GEV distribution The GEV distribution is defined by Eq. (1)~(2)(Jenkinson, 1955). 1/ k F ( x) = exp ⎡ − {1 − k ( x − u ) / α } ⎤ ⎣ ⎦
= exp ⎡⎣ − exp {−( x − u ) / α }⎤⎦
k ≠0
(1)
k =0
(2)
where, x0 is a location parameter, α is a scale parameter, and β is a shape parameter. 2.2 The derivation of plotting position formula regaring the reduced variates After Cunnane(1978) defined the plotting position that related with the mean of data
and proposed the general formula that can be applied to various probability distributions,
2
World Environmental and Water Resources Congress 2010: Challenges of Change. © 2010 ASCE
2458
many researches have derived the plotting position formulas considered the influence of coefficients of skewness related with shape parameters. The mean of density function of the rth smallest value in random sample n is defined as follows(Arnell et al., 1986). E[ xr ] =
∞
n! xr F ( xr ) r −1[1 − F ( xr )]n − r f ( xr )dxr ∫ (r − 1)!(n − r )! −∞
(3)
The reduced variates of the GEV distribution is substituted into Eq.(3) and the theoretical reduced variates from the mean concept are expressed by. 1
E[ y2 r ] =
n! (− ln F ) k F r −1[1 − F ]n − r dF ∫ (r − 1)!(n − r )! 0
(4)
1
E[ y3r ] = −
n! (− ln F ) k F r −1[1 − F ]n − r dF (r − 1)!(n − r )! ∫0
(5)
where, Eqs.(4) and (5) are the reduced variates for the EV2 and EV3 distribution, respectively. In addition, this study adopted the genetic algorithm for the estimation of plotting formula parameters. The objective function of real-coded genetic algorithm(RGA) that is one type of genetic algorithm is the root mean square error(RMSE). Then, population size is 1,000, generation number is 2,000, crossover probability is 0.8, mutation probability is 0.01 in RGA. Table 1. The RMSEs from various plotting position formulas Shape
Coeff. of
Sample
parameter
skewness
size
-0.20
0.10
Arnell
In-na and
Goel
et al.
Nguyen
and De
0.0183
-
0.0156
0.0213
0.0104
0.0140
0.3087
0.0121
0.0166
0.0071
0.0086
0.0119
0.1765
0.0106
0.0143
40
0.0064
0.0077
0.0108
0.1402
0.0096
0.0129
50
0.0059
0.0070
0.0099
0.1162
0.0088
0.0119
100
0.0047
0.0054
0.0077
0.0701
0.0070
0.0093
10
0.0017
0.0036
0.0026
0.0813
0.0022
0.0032
20
0.0011
0.0021
0.0015
0.0358
0.0016
0.0022
30
0.0009
0.0016
0.0011
0.0249
0.0013
0.0018
40
0.0008
0.0015
0.0010
0.0199
0.0011
0.0015
50
0.0007
0.0013
0.0010
0.0165
0.0011
0.0015
100
0.0007
0.0011
0.0008
0.0095
0.0008
0.0010
Derived
Gringorten
Cunnane
10
0.0101
0.0144
20
0.0080
30 3.53507
0.6376
3
World Environmental and Water Resources Congress 2010: Challenges of Change. © 2010 ASCE
2459
The values of RMSE between the theoretical reduced variates and calculated those by derived plotting formula and other plotting formulas are compared for the accuracy of plotting position as shown in Table 1. Finally, the derived plotting formula is applicable for the GEV distribution and that formula is proposed by Eq.(6). i − 0.3155 Pi = (6) n + 0.0050γ 2 − 0.0902γ + 0.3074 where, γ is a skewness coefficient. 2.3 The basic concept of the PPCC test The PPCC test was developed by Filliben(1975) for normality test. This test performs the goodness of fit test by using the correlation coefficient r between the ordered observations X i and the corresponding fitted quantiles M i which is determined by plotting position pi for each X i . The correlation coefficient r is
defined by Eq. (7). n
r=
∑(X i =1
n
∑(X i =1
i
i
− X )( M i − M )
− X)
(7)
n
2
∑ (M i =1
i
−M)
2
where X and M represent the mean values of the observation X i and the fitted quantiles M i , respectively, and n is the sample size. If the value of correlation coefficient r is close to 1.0, the observations can be drawn from the fitted distribution. The estimate of the order statistic median for M i proposed by Filliben(1975) is as follows.
M i = φ −1 (mi )
(8)
where φ −1 (⋅) is the inverse of cumulative distribution function for the standard normal distribution and the median value mi is given by mi = 1 − (0.5)1/ n
i =1
(9a)
mi = (i − 0.3175) / (n + 0.365)
i = 2, 3,L, n − 1
(9b)
mi = (0.5)1/ n
i=n
(9c)
The null hypothesis of a given sample cannot be rejected at the q significant level by following condition..
r > rq (n)
(10) 4
World Environmental and Water Resources Congress 2010: Challenges of Change. © 2010 ASCE
where rq (n) is the test statistic of the PPCC test for given sample size and significance level. 2.4 The derivation of the PPCC test statistics Vogel and McMartin(1991) provided the procedure for deriving the PPCC test statistics as follows : (a) Generate X i of sample size n ( i = 1,L, n ) for an assumed parent distribution
with given shape parameters. (b) Calculate M i using the inverse of the cumulative distribution function and plotting position. (c) Estimate the correlation coefficient r between generated sample X i and calculated plotting position value M i . (d) Repeat the procedure (from step (a) to step (c)) 100,000 times to obtain 100,000 correlation coefficient r . (e) Select 100,000×qth smallest r as rq This procedure applied to the derivation of the PPCC test statistics for the GEV distribution. The conditions in this study for the derivation of the PPCC test statistics are as follows. · Sample sizes( n ) : 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 120, 150, 200, 300, 500 · Significance levels : 0.005, 0.01, 0.05, 0.1, 0.5, 0.9, 0.95, 0.99, and 0.995 · Applied plotting position formulas : derived plotting formula · The range of shape parameters : -0.3, -0.2, -0.1, -0.05, 0.05, 0.1, 0.2, 0.3 2.5 Results of PPCC test statistics The estimated test statistics are plotted in Figure 1. Figure 1 shows that the values of test statistics increase as sample sizes increase. Especially, test statistics are comptuted steeply until sample size is 50 regardless of shape parameter. The test statistics for negative shape parameters are spreaded widely between 0.8 and 1.0, but test statistics for positive shape parameters are distributed between 0.88 and 1.0. In addition, the widths of test statistics for the negative shape parameters are greater than those of positive shape parameters. And then, the values of test statistics for the range of positive shape parameters are closed to each other as sample sizes increase. The results of test statistics which are derived by various plotting formulas such as derived plotting formula, Blom(1958), Gringorten(1963), Filliben(1975), and
5
2460
World Environmental and Water Resources Congress 2010: Challenges of Change. © 2010 ASCE
2461
Cunnane(1978) are shown in Figure 2. Applied plotting formulas except derived plotting formula and Filliben which are shown in Eqs.(6) and (9) are listed in Table 2. Roughly, the values of test statistics for the negative shape parameters are different along sample sizes. On the contrary, the test statistics for positive shape parameters are little different over all sample sizes. In addition, the test statistics by derived plotting formula are greater than other values in case that sample sizes are over 60~70. In other cases, the test statistics by Gringorten(1963) are greater than other test statistics. 1
1
0.95
Test Statistics
Test Statistics
0.95
0.9
0.85
0.8
0.9
0.85
Shape parameter = 0.2
Shape parameter = -0.2
0.75
Significance=0.01 Significance=0.05 Significance=0.1
Significance=0.01 Significance=0.05 Significance=0.1 0.8
0.7 0
50
100
150
0
200
50
100
150
200
Sample size n
Sample size n
(a) Shape parameter is -0.2
(b) Shape parameter is 0.2
Figure 1. The results of PPCC test statistics by derived plotting formula 1
1
Shape = -0.2 Derived Gringorten Cunnane Filliben Blom
Test Statistics
Test Statistics
0.95
0.9
0.95
0.9
Shape = -0.1 Derived Gringorten Cunnane Filliben Blom
0.85
0.8
0.85 0
50
100
150
200
0
50
Sample size n
150
200
(b) Shape parameter is -0.1
(a) Shape parameter is -0.2 1
1
0.95
0.95
Test Statistics
Test Statistics
100
Sample size n
0.9
0.9
Shape = 0.2
Shape = 0.1 Derived Gringorten Cunnane Filliben Blom
0.85
Derived Gringorten Cunnane Filliben Blom
0.85
0.8
0.8 0
50
100
150
200
0
Sample size n
50
100
150
Sample size n
(c) Shape parameter is 0.1
(d) Shape parameter is 0.2
Figure 2. The results of PPCC test statistics by various plotting formulas
6
200
World Environmental and Water Resources Congress 2010: Challenges of Change. © 2010 ASCE
2462
Table 2. The recommended plotting position formulas
Name
Plotting position formula
Blom(1958)
pi =
i − 3/8 n + 1/ 4
Gringorten(1963)
pi =
i − 0.44 n + 0.12
Cunnane(1978)
pi =
i − 0.4 n + 0.2
3. Power test 3.1 Procedure of power test In this study, power test was performed to select an appropriate plotting position formula for the PPCC test statistics of the GEV distribution using the Monte Carlo simulation. The procedure of power test for a given parent distribution with various shape parameters, sample sizes, probability distributions, and plotting position formulas are as follows : (a) Assume the GEV distribution as a parent distribution. (b) Generate data set with given shape parameters considering various sample sizes and plotting position formulas. In this case, power test was performed for considering sample sizes such as n = 10, 25, 50, 100, and 200 and plotting
position formulas by derived plotting formula, Blom(1958), Gringorten(1963), Filliben(1975), and Cunnane(1978). In addition, the assumed shape parameters are -0.3, -0.2, -0.1, 0.1, 0.2, and 0.3. (c) Frequency analysis is applied to the generated data set. The method of probability weight moments is used for the parameter estimation in this case. In addition, PPCC test using different plotting position formulas is applied to the generated data set at 5% significance level. (d) Repeat the procedure (from step (a) to step (c)) 10,000 times to obtain 10,000 rejection ratio(%). (e) Estimate the ability of rejection by counting rejection ratio(%). 3.2 Results of power test The results of power tests are shown in Fig. 3. The rejection results for the negative shape parameters for the specific sample size(=25) are shown the value over 5% and 10% significance levels without the effect of plotting position formulas, respectively. However, the rejection results for positive shape parameters in same sample size are
7
World Environmental and Water Resources Congress 2010: Challenges of Change. © 2010 ASCE
2463
under 5% and 10% significance level without the effect of plotting position formulas. In addition, the rejection ratios of some shape parameters(-0.3 and -0.2) in case of another sample size(=100) are over both significance levels, respectively. Contrarily, the rejection results of other shape parameters are under both significance levels. Accordingly, the rejection ratios are sensitive for the values of shape parameter in case of relatively small sample size. Especially, the rejection ratios in case that significance level is 0.05 are shown in Table 3. The rejection ratios of derived plotting position formula is higher than others in case of large sample size(100, 200) and some shape parameters(-0.3~0.1), but the rejection ability by plotting position formula of Gringorten(1963) is higher than others in other cases. These results are related with the values of PPCC test statistics estimated by derived plotting formula and Gringorten(1963). 70
60
n=25
The % of Rejection
50
40
30
20
n=100 Shape=-0.3 Shape=-0.2 Shape=-0.1 Shape=0.1 Shape=0.2 Shape=0.3
50
The % of Rejection
Shape=-0.3 Shape=-0.2 Shape=-0.1 Shape=0.1 Shape=0.2 Shape=0.3
60
40
30
20
10
10
0
0 Derived
Gringorten
Blom
Filliben
Derived
Cunnane
(a) Sample size =25, significance level = 0.05
Blom
Filliben
Cunnane
(b) Sample size =100, significance level = 0.05
70
60
n=25
50
40
30
20
n=100 Shape=-0.3 Shape=-0.2 Shape=-0.1 Shape=0.1 Shape=0.2 Shape=0.3
50
The % of Rejection
Shape=-0.3 Shape=-0.2 Shape=-0.1 Shape=0.1 Shape=0.2 Shape=0.3
60
The % of Rejection
Gringorten
Plotting positioin formula
Plotting positioin formula
40
30
20
10
10
0
0 Derived
Gringorten
Blom
Filliben
Derived
Cunnane
Gringorten
Blom
Filliben
Cunnane
Plotting positioin formula
Plotting positioin formula
(c) Sample size =25, significance level = 0.10
(d) Sample size =100, significance level = 0.10
Figure 3. The results of power test Table 3. The comparison of rejection ratio(%) in case of significance level 5% Shape parameters Sample size
Plotting formulas -0.3
-0.2
-0.1
0.1
0,2
0.3
Derived
81
74.81
67.3
47.49
37.3
29.79
Gringorten
81.25
75.13
67.73
48.08
38.22
30.92
Blom
81.13
74.92
67.37
47.49
37.36
29.81
Filliben
80.93
74.72
67.08
46.88
36.68
28.91
10
8
World Environmental and Water Resources Congress 2010: Challenges of Change. © 2010 ASCE
25
50
100
200
2464
Cunnane
81.15
74.98
67.51
47.69
37.65
30.2
Derived
63.5
46.57
26.85
2.79
1.06
1.79
Gringorten
63.52
46.6
26.87
2.85
1.16
1.91
Blom
63.48
46.54
26.8
2.78
1.01
1.71
Filliben
63.44
46.5
26.78
2.72
0.98
1.58
Cunnane
63.5
46.55
26.83
2.8
1.06
1.75
Derived
56.78
30.95
9.37
0.51
0.86
1.86
Gringorten
56.79
30.95
9.4
0.53
0.92
1.97
Blom
56.78
30.94
9.38
0.48
0.78
1.83
Filliben
56.78
30.94
9.39
0.49
0.77
1.78
Cunnane
56.78
30.94
9.39
0.48
0.8
1.88
Derived
53.89
18.41
2.47
0.95
0.88
1.71
Gringorten
53.88
18.34
2.37
0.93
0.91
1.74
Blom
53.88
18.37
2.36
0.93
0.9
1.76
Filliben
53.88
18.36
2.4
0.94
0.89
1.77
Cunnane
53.88
18.35
2.36
0.94
0.9
1.72
Derived
50.52
9.67
2.2
1.44
1.25
2.02
Gringorten
50.43
9.31
1.93
1.44
1.25
2.1
Blom
50.43
9.31
1.95
1.48
1.18
2.08
Filliben
50.43
9.32
1.98
1.56
1.2
2.07
Cunnane
50.43
9.31
1.93
1.47
1.26
2.14
4. CONCLUSIONS In this study, an exact plotting formula was derived by using genetic algorithm. In addition, the PPCC test statistics for the GEV distribution were derived by considering various sample sizes, significance levels, shape parameters, and plotting formulas. The power test was performed to select an appropriate plotting position formula by Monte Carlo simulation. As a result, the rejection capability of derived plotting position formula is higher than others in case of large sample size and some shape parameters(0.3~0.1), but the rejection ability by plotting position formula of Gringorten(1963) is higher than others in other cases. 5. ACKNOWLEDGEMENT This study was financially supported by the Construction Technology Innovation Program(08-Tech-Inovation-F01) through the Research Center of Flood Defence
9
World Environmental and Water Resources Congress 2010: Challenges of Change. © 2010 ASCE
Technology for Next Generation in Korea Institute of Construction & Transportation Technology Evaluation and Planning(KICTEP) of Ministry of Land, Transport and Maritime Affairs(MLTM). 6. REFERENCES Arnell, N. W., Beran, M., and Hosking, J. R. M. (1986). "Unbiased plotting positions for the general extreme value distribution." Journal of Hydrology, Vol. 86, pp. 59-69. Blom, G. (1958). Statistical estimates and transformed beta variables. John Wiley and Sons, New York. Chowdhury, J.D., Stedinger, J.R., and Lu, L.H. (1991). "Goodness-of-fit tests for regional generalized extreme value flood distributions", Water Resources Research, Vol.27, No.7, pp.1765-1776. Cunnane, C. (1978). "Unbiased plotting positions - A review", Journal of Hydrology, Vol. 37, No. 3/4, pp. 205-222. Filliben, J.J. (1975). "The Probability Plot Correlation Coefficient Test for Normality", Technometrics, Vol. 17, No. 1, pp. 111~117.
Goel, N. K. and De, M. (1993). "Development of unbiased plotting position formula for General Extreme Value distribution." Stochastic Environmental Research and Risk Assessment, Vol. 7, pp. 1-13. Gringorten, I.I. (1963). "A plotting rule for extreme probability paper", Journal of Geophysical Research, Vol.68, No.3, pp.813-814. Heo, J., Kho, Y., Shin, H., Kim, S., Kim, T. (2007). “Regression Equations of Probability Plot Correlation Coefficient Test Statistics from Several Probability Distributions”, Journal of Hydrology, Vol.355, No.1-4, pp. 1-15. In-na, N. and Nguyen, V-T-V. (1989). "An unbiased plotting position formula for the generalized extreme value distribution." Journal of Hydrology, Vol. 106, pp. 193209. Jenkinson, A. F.(1955). “The frequency distribution of the annual maximum(or minimum) values of meteorological elements”. Quarterly Journal of the Royal Meteorological Society, Vol.87, pp.158-171. Vogel, R.M. (1986). "The probability plot correlation coefficient test for the normal, lognormal, and Gumbel distributional hypothesis", Water Resources Research, Vol.22, No.4, pp.587-590. Vogel, R.M. and Kroll, C.N. (1989). “Low-flow frequency analysis using probability plot correlation coefficients”, Journal of Water Resources Planning and Management, Vol.115, No.3, pp.338-357.
10
2465
World Environmental and Water Resources Congress 2010: Challenges of Change. © 2010 ASCE
Vogel, R.M. and McMartin. D.E. (1991). "Probability plot goodness-of-fit and skewness estimation procedures for the Pearson type distribution", Water Resources Research, Vol.27, No.12, pp.3149-3158.
11
2466