derivation of the probability plot correlation ... - Semantic Scholar

2 downloads 0 Views 260KB Size Report
May 25, 2010 - gamma distribution are studied by Vogel and McMartin(1991), and the PPCC test statistics for the GEV ..... John Wiley and Sons, New York.
International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

DERIVATION OF THE PROBABILITY PLOT CORRELATION COEFFICIENT TEST STATISTICS FOR THE GENERALIZED LOGISTIC DISTRIBUTION by

Sooyoung Kim(1), Hongjoon Shin(2) , Taesoon Kim(3) and Jun-Haeng Heo(4) (1)

Department of Civil and Environmental Engineering, Yonsei University, Korea ([email protected])

(2)

Department of Civil and Environmental Engineering, Yonsei University, Korea ([email protected]) (3)

(2)

Korea Hydro and Nuclear Power Co., Ltd. ([email protected])

Department of Civil and Environmental Engineering, Yonsei University, Korea ([email protected])

ABSTRACT The selection of an appropriate probability distribution is very important in hydrology to estimate the accurate design rainfall. An appropriate probability distribution is generally chosen by using the goodness of fit tests. The PPCC test has been known as a powerful test among the goodness of fit tests. Generally, the PPCC test statistics are calculated by considering significance levels, sample sizes, plotting position formulas, and shape parameters of a given probability distribution. It is important to select an exact plotting position formula for a given probability distribution because the PPCC test statistics are defined from the correlation coefficient values based on the selected plotting position formula. After Cunnane(1978) defined the plotting position that related with the mean of data and proposed the general formula that can be applied to various probability distributions, various plotting position formulas have been developed for considering the effect of coefficients of skewness related with the shape parameter for a given distribution. In this study, the PPCC test statistics are derived by using a plotting position formula contained a term of a coefficient of skewness that can be considered as the effect of shape parameters for the generalized logistic distribution(GLO). In addition, the PPCC test statistics for the GLO is derived based on various sample sizes, significance levels, and shape parameters of the generalized logistic distribution. And then, the power test to estimate the rejection ratios of the derived PPCC test statistics is performed and the comparisons between derived plotting position formula and other plotting position are accomplished. Keywords: Generalized logistic distribution, probability plot correlation coefficient tet, test statistics, plotting position formula

1

INTRODUCTION

An appropriate probability distribution in frequency analysis using annual maximum rainfall or flood data provides an accurate design quantile for the hydrologic structures such as bank, retention, and dam. Therefore, the selection of an appropriate probability distribution is very important for the reaseonable design. Generally, the selection of an appropriate probability distribution is based on the goodness of fit test which is the decision-making method to evaluate the fitness between sample data and its population for a given probability distribution. Many goodness of fit tests have been developed in literatures and the Kolmogorov-Smirnov test, the Cramer von Mises test, and the chi-square test are popular especially. The Probability Plot Correlation Coefficient(PPCC) test developed by Filliben(1975) has been known as a powerful test among many goodness of fit tests. Since then, the PPCC test has been applied to various probability distributions. Looney and Gulledge(1985) applied various plotting position formulas to normal distribution and chose the Blom(1958)’s plotting position formula for the derivation of normal PPCC test statistics. Vogel(1986) proposed the PPCC test statistics for the Gumbel distribution, and Vogel and Kroll(1989) derived the PPCC test statistics for the 2-parameter Weibull and uniform distributions in frequency analysis for low flow data. In addition, the PPCC test statistics of 5% significance level for gamma distribution are studied by Vogel and McMartin(1991), and the PPCC test statistics for the GEV distribution are provided by Chowdhury et al.(1991). Recently, Heo et al.(2007) proposed the regression equations to estimate the test statistics for normal, gamma, Gumbel, GEV and Weibull distributions. In this study, the PPCC test statistics are derived by using Monte Carlo simulation for the generalized logistic distribution. To estimate the test statistics of the generalized logistic distribution, the plotting position formula contained a term of a coefficient of skewness is developed by using the genetic algotithm. Kim et al., Derivation of the Probability Plot Correlation Coefficient Test Statistics for the Generalized Logistic Distribution

1

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

In addition, the PPCC test statistics are derived by using a plotting position formula contained a term of a coefficient of skewness that can be considered as the effect of shape parameters for the generalized logistic distribution(GLO). And then, the power test to estimate the rejection ratios of the derived PPCC test statistics is performed and the comparisons between derived plotting position formula and other plotting position formulas such as Blom(1958), Gringorten(1963), Filliben(1975), and Cunnane(1978) are accomplished.

2 2.1

DERIVATION OF THE PPCC TEST STATISTICS The generalized logistic distribution

The generalized logistic distribution is defined by Eq. (1)(Hosking, 1986). 1/ k    x − x0    F ( x) = 1 + 1 − k     α     

−1

(1)

where x0 is a location parameter, α is a scale parameter, and k is a shape parameter. Then,

x0 +

α k

≤ x < ∞ for k < 0 and −∞ < x ≤ x0 +

α k

for k > 0 .

The generalized logistic distribution was recommended for the regional flood frequency analysis in England by Flood Estimation Handbook(Institute of Hydrology, 1999).

2.2

The derivation of plotting position formula using genetic algorithm

Since Cunnane(1978) discussed the unbiased plotting position that was related with the mean of sample data and proposed the general formula for various probability distributions, many researchers have developed the plotting position formulas considering the influence of a coefficients of skewness related with shape parameters of a given probability distribution. However, those literatures were restricted for several probability distributions such as the GEV, Gumbel, and log-Pearson Ⅲ distributions. In case of the generalized logistic distribution, Gringorten(1963)’s formula was recommended as an appropriate plotting position formula by FEH(Institute of Hydrology, 1999). Therefore, this research proposes the plotting position formula with the concept of unbiased plotting position realted with the mean of reduced variates for the generalized logistic distribution. The mean of density function of the rth smallest value in random sample n is defined as follows,

E[ xr ] =

∞ n! xr F ( xr ) r −1[1 − F ( xr )]n − r f ( xr )dxr ∫ −∞ (r − 1)!(n − r )!

(2)

The reduced variates of the generalized logistic distribution are assumed as follws,

 1− F  y1 =    F 

k

 1− F  y2 = −    F 

(3) k

(4)

where, Eqs.(3) and (4) are the reduced variates in case of negative( k < 0 ) and positive( k > 0 ) shape parameters, respectively. And then, each reduced variate has the ranges of 0 < y1 < ∞ and −∞ < y2 < 0 , respectively. The reduced variates are substituded into Eq.(2) and the theoretical reduced variates are expressed by Eqs. (5)~(6). k

E[ y1 ] =

1 1− F  n! r −1 n− r   F [1 − F ] dF ∫ 0 (r − 1)!(n − r )!  F 

E[ y2 ] = −

k

1 1 − F  n! r −1 n− r   F [1 − F ] dF (r − 1)!(n − r )! 0  F 



(5)

Kim et al., Derivation of the Probability Plot Correlation Coefficient Test Statistics for the Generalized Logistic Distribution

(6)

2

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

In addition, this study aopted the real-coded genetic algorithm(RGA) that is one of genetic algorithm to estimate the parameters of plotting position formula with the theoretical reduced variates of the generalized logistic distribution. The objective function of RGA is the root mean square error(RMSE) between theoretical reduced variates and calculated those from plotting position formula contained parameters. Then, population size is 1,000, generation number is 2,000, crossover probability is 0.8, and mutation probability is 0.01. The RGA runs 10 times in total because of the influence of seed number(0.123) . Derived plotting position is examined by comparing the RMSEs between theoretical reduced variates and calculated those from other plotting position formulas such as Blom(1958), Gringorten(1963), and Cunnane(1978). The results with several coefficients of skewness are shown in Table I. According to Table I, the RMSEs of derived plotting formula in this study over all sample sizes and coefficients of skewness related with shape parameters are the smallest. As the results, the calculated reduced variates by the derived plotting formula are more accurate than those by other plotting formulas for the generalized logistic distribution. Table I - The comparison of RMSE from various plotting position formulas Shape Coefficients of parameters skewness

-0.3

-0.1

0.1

0.3

10.90355

0.93667

-0.93667

-10.90355

Plotting Position Formulas Sample size

Derived

Blom

Gringorten

Cunnane

5

0.0794

0.1124

0.0908

0.1043

10

0.0682

0.0984

0.0761

0.0901

20

0.0590

0.0859

0.0650

0.0782

30

0.0544

0.0793

0.0595

0.0719

50

0.0489

0.0718

0.0536

0.0650

100

0.0426

0.0625

0.0464

0.0566

5

0.0057

0.0109

0.0070

0.0091

10

0.0039

0.0087

0.0049

0.0071

20

0.0026

0.0071

0.0038

0.0057

30

0.0021

0.0061

0.0032

0.0049

50

0.0016

0.0050

0.0026

0.0041

100

0.0011

0.0039

0.0019

0.0031

5

0.0057

0.0109

0.0070

0.0091

10

0.0039

0.0087

0.0049

0.0071

20

0.0026

0.0071

0.0038

0.0057

30

0.0021

0.0061

0.0032

0.0049

50

0.0016

0.0050

0.0026

0.0041

100

0.0011

0.0039

0.0019

0.0031

5

0.0794

0.1124

0.0908

0.1043

10

0.0682

0.0984

0.0761

0.0901

20

0.0590

0.0859

0.0650

0.0782

30

0.0544

0.0793

0.0595

0.0719

50

0.0489

0.0718

0.0536

0.0650

100

0.0426

0.0625

0.0464

0.0566

The derived plotting formula in this study is proposed by Eq.(7).

Kim et al., Derivation of the Probability Plot Correlation Coefficient Test Statistics for the Generalized Logistic Distribution

3

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

Pi =

i + 0.0176γ -0.4855 n + 0.0290

(7)

where, n is sample size, i is an order, and γ is a coefficient of skewness from sample data.

2.3

The derivation of the PPCC test statistics

The PPCC test by using the correlation coefficient between the ordered observation and the corresponding fitted quantiles was provided by Filliben(1975) for normality test. The corresponding fitted quantiles of this test are determined by plotting position for each observation. The correlation coefficient CC is expressed as follows, n

∑(X

i

− X )( M i − M )

i =1

CC =

n

∑(X i =1

(8)

n

i

− X )2

∑ (M

i

− M )2

i =1

where X and M represent the mean values of the observation X i and the fitted quantiles M i , respectively, and n is the sample size. If correlation coefficient CC is close to 1.0, the observations can be defined by the fitted probability distribution. The order statistic median for M i by Filliben(1975) is explained as follow. M i = ϕ −1 (mi )

(9)

where φ −1 (⋅) means the inverse of cumulative distribution function for the standard normal distribution and the median value mi is given in Eqs.(10)~(12). mi = 1 − (0.5)1/ n for i = 1

(10)

mi = (i − 0.3175) / (n + 0.365) for i = 2, 3, L , n − 1

(11)

mi = (0.5)1/ n for i = n

(12)

If the following condition is satisfied, the null hypothesis cannot be rejected at the q significant level. r > rq (n)

(13)

where rq (n) is the test statistic of the PPCC test for a given sample size and significance level. As stated above, the PPCC test statistics are influenced by the characteristics of various significance levels, sample sizes, plotting position formulas, and shape parameters for the fitted probability distribution. Therefore, the application of a proper plotting position formula for the fitted probability distribution is important to estimate the PPCC test statistics. The recommended plotting position formulas to derive the PPCC test statistics are different for each probability distribution in many researches(Stedinger et al., 1993) and the recommended plotting position formulas are shown in Table II. Table II - The recommended plotting position formulas Type

Plotting position formula

Recommended probability distributions

Blom(1958)

pi = (i − 3 / 8) / (n + 1 / 4)

Normal, gamma, lognormal, log-Pearson type Ⅲ

Gringorten(1963)

pi = (i − 0.44) / (n + 0.12)

Gumbel, Weibull

Cunnane(1978)

pi = (i − 0.4) / (n + 0.2)

GEV, log-Gumbel

The procedure to estimate the PPCC test statistics are as follows(Vogel and McMartin, 1991); Kim et al., Derivation of the Probability Plot Correlation Coefficient Test Statistics for the Generalized Logistic Distribution

4

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

(a) Generate X i of sample size n ( i = 1, L , n ) for an assumed parent distribution with given shape parameters, (b) Calculate M i using the inverse of the cumulative distribution function and plotting position, (c) Estimate the correlation coefficient between generated sample X i and calculated plotting position value

Mi , (d) Repeat the procedure (from step (a) to step (c)) 100,000 times to obtain 100,000 correlation coefficients, (e) Select 100,000×qth smallest r as rq , This study applies the following conditions to derive of the PPCC test statistics for the generalized logistic distribution, - Sample sizes( n ) : 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 120, 150, 200, 300, and 500, - Significance levels : 0.005, 0.01, 0.05, 0.1, 0.5, 0.9, 0.95, 0.99, and 0.995, - The range of shape parameters : -0.3, -0.2, -0.1, -0.05, 0.05, 0.1, 0.2, and 0.3. In this study, the derived plotting formula for the generalized logistic distribution shown in Eq.(7) is used for the estimation of test statistics. In addition, the test statistics of the derived plotting formua are compared with those of existing plotting formulas such as Blom(1958), Gringorten(1963), Filliben(1975), and Cunnane(1978) for other probability distributions.

2.4

The test statistics of the generlaized logistic distribution

The values of the PPCC test statistics using the derived plotting formula are shown in Figure 1. Figure 1(a) shows the PPCC test statistics with 5% significance leve ans Figure 1(b) shows those with 10% significance level. The PPCC test statistics of the generalized logistic distribution are similar to the same absolute values of negative shape parameters and poisitive shape parameters, respectively. For example, the test statistics in case that shape parameter is -0.3 are similar to those in case that shape parameter is +0.3. This tendancy is caused by the symmetry of coefficients of skewness related with shape parameters of the generalized logistic distribution. The PPCC test statistics increase as sample sizes and siginificance levels increase, and the absolute values of shape parameters decrease. The results of test statistics which are derived by various plotting formulas such as the derived plotting position formula, Blom(1958), Gringorten(1963), Filliben(1975), and Cunnane(1978) are plotted in Figure 2. The PPCC test statistics by the derived plotting formula are greater than other values over all sample sizes. In addition, the differences between the test statistics by the derived plotting formulas and those by other formulas increase as the absolute values of shape parameter increase. 1

Test Statistics

Test Statistics

1

0.9

q=0.05 shape=-0.3 shape=-0.2 shape=-0.1 shape=-0.05 shape=0.05 shape=0.1 shape=0.2 shape=0.3

0.9

q=0.10 shape=-0.3 shape=-0.2 shape=-0.1 shape=-0.05 shape=0.05 shape=0.1 shape=0.2 shape=0.3 0.8

0.8 0

100

200

300

Sample size(n)

(a) Significance level = 0.05

400

500

0

100

200

300

400

500

Sample size n

(b) Significance level = 0.10

Figure 1 – The PPCC test statistics with several significance levels by derived plotting position formula

Kim et al., Derivation of the Probability Plot Correlation Coefficient Test Statistics for the Generalized Logistic Distribution

5

1

1

0.95

0.95

Test Statistics

Test Statistics

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

0.9

Shape =-0.3

0.9

Shape = -0.2

Derived Blom Gringorten Filliben Cunnane

Derived Blom Gringorten Filliben Cunnane

0.85

0.85 0

50

100

150

200

250

300

350

400

450

500

0

Sample size n

50

150

200

250

300

350

400

450

500

Sample size n

(a) Shape parameter = -0.3

(b) Shape parameter = -0.2

1

1

0.95

0.95

Test Statistics

Test Statistics

100

0.9

Shape =+0.1

0.9

Shape =+0.2

Derived Blom Gringorten Filliben Cunnane

Derived Blom Gringorten Filliben Cunnane

0.85

0.85 0

50

100

150

200

250

300

350

Sample size n

(c) Shape parameter = +0.1

400

450

500

0

50

100

150

200

250

300

350

400

450

500

Sample size n

(d) Shape parameter = +0.2

Figure 2 – The comparison of the PPCC test statistics with various plotting position formulas.

3 3.1

POWER TEST The procedure of power test

The performance of the test statistics by various plotting position formulas are examined by the power test which used the Monte Carlo simulation. The power tests with various shape parameters, sample sizes, significance levels, and plotting position formulas for the generalized logistic distribution are as follows; (a) Assume that the parent probability distribution is the generalized logistic distribution, (b) Generate data set with given shape parameters considering various sample sizes and plotting formulas, (c) General frequency analysis contained the goodness of fit test which means the PPCC test is applied to the generated data set with various significance levels, (d) Repeat the procedure(from step (a) to step (c)) at 10,000 times, (e) Estimate the performance of various plotting position formulas by calculating the rejection ratios that are expressed as percentage. This study sets up the following conditions to accomplish the power test. - Sample sizes : 10, 25, 50, 100, and 200, - Assumed shape parameters : -0.3, -0.2, -0.1, 0.1, 0.2, and 0.3, - Applied plotting formulas : Blom(1958), Gringorten(1963), Filliben(1975), Cunnane(1978), and the derived plotting formulas in this study, - Significance levels : 0.05(5%) and 0.1(10%), - Applied method of parameter estimation : PWM(Probability Weigthed Moments).

Kim et al., Derivation of the Probability Plot Correlation Coefficient Test Statistics for the Generalized Logistic Distribution

6

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

3.2

The results of power test

The results of power test are shown in Figure 3. The rejection ratios are computed as percentages to divide counted rejection numbers by 10,000. The rejection ratios increase as the absolute values of shape parameters decrease because the PPCC test statistics in the same cases are higher than others. In addition, the rejection ratios by derived plotting position formula are generally higher than those by other plotting position formulas over all occasions. Therefore, the test statistics by the derived plotting position formula are effective to estimate the fitness between sample data and the parent distribution – the generalized logistic distribution. 10

10

n=50

The ratio of rejection(%)

8

6

4

n=200 Shape=-0.3 Shape=-0.2 Shape=-0.1 Shape=0.1 Shape=0.2 Shape=0.3

8

The ratio of rejection(%)

Shape=-0.3 Shape=-0.2 Shape=-0.1 Shape=0.1 Shape=0.2 Shape=0.3

2

6

4

2

0

0 Derived

Blom

Gringorten

Filliben

Cunnane

Derived

Plotting positioin formulas

Blom

Gringorten

Filliben

Cunnane

Plotting positioin formulas

(a) Significance level=0.05 and sample size=50

(b) Significance level=0.05 and sample size=200

10

10

n=50

The ratio of rejection(%)

8

6

4

2

n=200 Shape=-0.3 Shape=-0.2 Shape=-0.1 Shape=0.1 Shape=0.2 Shape=0.3

8

The ratio of rejection(%)

Shape=-0.3 Shape=-0.2 Shape=-0.1 Shape=0.1 Shape=0.2 Shape=0.3

6

4

2

0

0 Derived

Blom

Gringorten

Filliben

Cunnane

Plotting positioin formulas

(c) Significance level=0.10 and sample size=50

Derived

Blom

Gringorten

Filliben

Cunnane

Plotting positioin formulas

(d) Significance level=0.10 and sample size=200

Figure 3 – The comparison of the rejection ratio(%) by various plotting position formulas

4

CONCLUSIONS

The exact plotting position formula for the generalized logistic distribution was derived by using genetic algorithm and the theoretical reduced variates in this study. In addition, the PPCC test statistics for the generalized logistic distribution were developed based on various sample sizes, significance levels, shape parameters of the generalized logistic distribution, and plotting position formulas including the derived plotting position formula. The PPCC test statistics by the derived plotting position formula were generally higher than those by other plotting position formulas. The power tests by using used Monte Carlo simulation were performed to figure out the ability of goodness of fit test by various plotting position formulas. As a result, the test statistics by the derived plotting position formula shows more powerful rejection to sample data for the generalized logistic distribution.

5

ACKNOWLEDGEMENT

This study was financially supported by the Construction Technology Innovation Program(08-TechInovation-F01) through the Research Center of Flood Defence Technology for Next Generation in Korea

Kim et al., Derivation of the Probability Plot Correlation Coefficient Test Statistics for the Generalized Logistic Distribution

7

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

Institute of Construction & Transportation Technology Evaluation and Planning(KICTEP) of Ministry of Land, Transport and Maritime Affairs(MLTM).

6

REFERENCES

Blom, G. (1958). Statistical estimates and transformed beta variables. John Wiley and Sons, New York. Chowdhury, J.D., Stedinger, J.R., and Lu, L.H. (1991). Goodness-of-fit tests for regional generalized extreme value flood distributions. Water Resources Research, 27(7): 1765-1776. Cunnane, C. (1978). Unbiased plotting positions - A review. Journal of Hydrology, 37(3/4): 205-222. Filliben, J.J. (1975). The Probability Plot Correlation Coefficient Test for Normality. Technometrics, 17(1): 111-117. Gringorten, I.I. (1963). A plotting rule for extreme probability paper. Journal of Geophysical Research, 68(3): 813-814. Heo, J., Kho, Y., Shin, H., Kim, S., and Kim, T. (2007). Regression Equations of Probability Plot Correlation Coefficient Test Statistics from Several Probability Distributions. Journal of Hydrology, 355(1-4): 1-15. Hosking, J.R.M. and Wallis, J.R. (1986a). Paleoflood hydrology and flood frequency analysis. Water Resources Research, 22(4): 543-550. Hosking, J.R.M. and Wallis, J.R. (1986b). The value of historical data in flood frequency analysis. Water Resources Research, 22(11): 1606-1612. Institute of Hydrology (1999). Flood Estimation Handbook. Wallingford, UK. Looney, S.W. and Gulledge, T.R. (1985). Use the correlation coefficient with normal probability plots. The American Statistician, 39(1): 78-79. Stedinger, J.R., Vogel, R.M., and Foufoula-Georgious, E. (1993). Frequency analysis of extreme events - chapter 18 of Handbook of Hydrology(ED. Maidment, D. R.), McGraw-Hill, New York. Vogel, R.M. (1986). The probability plot correlation coefficient test for the normal, lognormal, and Gumbel distributional hypothesis. Water Resources Research, 22(4): 587-590. Vogel, R.M. and Kroll, C.N. (1989). Low-flow frequency analysis using probability plot correlation coefficients. Journal of Water Resources Planning and Management, 115(3): 338-357. Vogel, R.M. and McMartin. D.E. (1991). Probability plot goodness-of-fit and skewness estimation procedures for the Pearson type Ⅲ distribution. Water Resources Research, 27(12): 3149-3158.

Kim et al., Derivation of the Probability Plot Correlation Coefficient Test Statistics for the Generalized Logistic Distribution

8

Suggest Documents