Archives of Economic History l Aq~riovO i x o v o p t x ~Icrsopia~, ~ XVIII1112006
57
VARIANCE ESTIMATION AND PREDICTION OF GIN1 COEFFICIENT FOR INCOME DISTRIBUTION USING ASYMPTOTIC EXPANSIONS AND RESAMPLING METHODS C.C. FRANGOS*
Abstract In the present paper we consider the Coefficient of Gini (Kendall and Stuart,1982), for Income Distribution. This statistic is derived from data showing the annual taxes paid by the Greek taxpayers, for years 1960 to 1996.An intractable issue is the estimation of its vatiance because of the complication of its distribution.Applying von Mises asymptotic expansions for differentiable Statistical functionals (Hinkley,1978), (Frangos, 1980) we estimate the variance of Gini coefficient for a set of frequency data regarding a particular year. Robust confidence intervals for Gini coefficient are constructed using the Jackknife Statistical methodology. Treating the series of Gini estimates from 1960 from to 1996 as a Time Series from an Autoregressive model, we estimate its parameters using the Bootstrap methodology, (Davison and Hinkley, 1997) and we compare the Bootstrap confidence intervals for the parameters with the Least Squares ones.Finally, three autoregressive models for prediction of Gini Coefficient are compared with respect to their predictive errors. A simple Autoregressive model is selected for prediction purposes. The above comparisons support the argument of a clear superiority of the bootstrap confidence intervals,with respect to their length, over the "classical" method of Least Squares.
JEL classification: C, C8, C12, C13, C14, C15, C22, C32. Keywords: Gini Coefficient, Variance Estimation, Von Mises Expansions, Bootstrap, Jackknife, Income Distribution, Autoregressive Models.
1. Introduction Gini Coefficient is one of the best statistical measures for measuring the equality of the distribution of a random variable.
* Professor
Technological Educational Institution of Athens, Department of Business Administration, e-mail:
[email protected]
Electronic copy available at: http://ssrn.com/abstract=1372025
58
C.C. Frangos
For this reason Gini coefficient is extremely useful to government Departments, United Nations Agencies and Welfare Organizations For a simple formula for calculating Gini, the reader is referred to Lerman and Yitzhaki (1984) and Chotikapanich and Griffiths (2001). In section 2, we define Gini Coeffiicient (G for convenience) and, in section 3, we study the second order properties of it by means of a von Mises expansion for differentiable statistical functionals.Expanding Gini in terms of first and second-order influence functions (Hampel, 1974), and estimating the first and second moments of them, we find estimates of the variance of Gini, following the method of Frangos (1980) and Frangos and Knott(1983). In section 4, applying the resampling statistical methodology, especially the techniques of Jackknife and Bootstrap, we estimate the variance of G by the jackknife method and we construct Jackknife confidence intervals for it following Efron and Tibshirani (1986). In section 5, treating the values of G as a Time Series, we estimate the parameters of an autoregressive model of order 1 and we compare the "classical" method of Least Squares with the "computer intensive" methods of bootstrap and Jackknife. We find that the length of confidence intervals derived by the bootstrap methods is smaller than the one derived by the "classical". Least Squares method.In section 6, we compare, with respect to the prediction error, three ARIMA models for fitting the data of Gini coefficient Time Series. Based on this comparison, we choose an ARIMA (1,0,O) model for predicting future values of Gini.Finally, we present some concluding comments and suggestions for future research.
2. Definition of Gini Coefficient Following Kendall and Stewart, 1982), Lerman and Yitzhaki (1984), Chotikapanich and Griffiths (2001), Cowell (1995), we find that:
Electronic copy available at: http://ssrn.com/abstract=1372025
Variance Estimation and Prediction o f Gine Coefficient for Income Distribution
...
59
+00
al = j
xf(x)dx,
-00
f(x) is the probability density function (p.d.f.) of a random variable X and F(x) is the probability distribution function at the point X. The sample equivalent of the mean difference A, is 2 N-l A, = G, - G,). N2 ,,=l
1 m
where G, = NF(x) Therefore, we have:
2.1. Computation of Gini Coefficient from a Frequency Distribution Example l: Suppose that the salaries of 100 employees of a South African company are as follows (in RandsPay) Table 1: Distribution of salaries of 100 employees.
Randspay
fi
120-130 130-140 140-150 150-160 160-170 170-180 180-190 Total
5 15 25 35 15 3 2 n = 100
Cumulative Frequency mi 5 20 45 80 95 98 100
The width of each class is 6 = 10
n
95 80 55 20 5 2 0
(n-Qi)Qi 475 1600 2475 1600 475 196 0 682 1
60
CC. Frangos
The mean difference is:
In general, if we have classes with unequal widths 6i, it is true that:
2
A, = - xbi(n - @,)Qi n2
For this example we have:
and
-
X =
150,70
Then, we have
2.2. Lorenz or Concentration Curve The concentration curve or Lorenz curve is a curve which presents the degree of concentration (or degree of inequality) of the values of a frequency distribution. Example 2: Suppose that the following table shows the distribution of the number of taxpayers and their declared income classified in groups of annual income: Table 2: Classes of declared income and corresponding number of taxpayers.
Classes of annual income (in 10 Euro) Less than 200 200-300 300-400
No. of taxpayers
236576 360830 359096
Declared income (in 10 Euro) 23657000 90207500 108000000
Variance Estimation and Prediction o f Gine Coefficient for Income Distribution
...
61
The following table shows the "less than" cumulative frequencies of the number of taxpayers and their declared annual income as well as the corresponding relative cumulative frequencies. Table 3: Cumulative Frequencies of number of taxpayers and declared income.
Classes of Annual income
200>x 300>x 400>x 600>x 800>x 1000>x 2000>x 3000>x 5000>x 8000>x
Unknown value>x
Cumulative Frequency of Number of taxpayers
Cumulat. Rel. Freq. of number taxpayers
Cumulative Frequency of Declared Income (in 10 Euro)
Cumulat. Rel. Freq. of declared income 22 1 1064 2075 2776 3562 4420 5753 7390 8971 9883
62
C.C. Frangos
The graph of Lorenz curve is,for the above data: Graph 1: Lorenz curve for example 1. LORENZ CURVE
la
We have: (Kendall and Stuart, vol. l, 1982)
We compare E with the area of triangle inside which is the Lorenz curve This area is 0.5 The greater the ratio
d -E- - L/ (OS) 0,5
4Ji
-the greater is the inequality of the values of the declared income or, in other words, very small number of people have very great annual income.(Here we exclude the effectsof the "black" economy). Example 3: Calculation of Mean Difference A, of the random variable x,from non classified data.
Variance Estimation and Prediction o f Gine Coefficientfor Income Distribution
Consider the following data: 50 65
70
...
75
63
85
N=5
We construct the following table: Table 4: Calculation of Mean Differencefrom a list of unclassified data.
M=Median Obs.
50
65
70
75
85
Ixi -
20
5
0
5
15
2
1
0
1
2
40
5
0
5
30
Classes yi Ixi - M Iyi
3. Derivation of a Variance Estimator for Gini Coefficient Our aim is to find a variance estimate of G and to investigate the asymptotic properties of it, using Von Mises second-order expansions, following Frangos, (1980), Frangos and Knott (1983) We consider G or G, as a regular differentiable statistical functional of the form where F, is the usual empirical distribution function. The corresponding estimator is
64
C.C. Frangos
Suppose that G, admits the second-order expansion:
In the above expansion, g,(yj) and g2(yi,y,) are the first and second von Mises derivatives and are defined as follows:
where Q(x) is an arbitrary distribution function The derivative g,(y) has, also, the name: influence fumction (Hampel, 1974) The first and second-order influence functions have the properties:
Then, taking the variance of both sides of (2), we obtain the following:
This is the variance estimate of Gini coefficient We find the estimates of qll, q12and q22as follows: Consistent estimates of and B2,jk (j
+ k)
65
...
Variance Estimation and Prediction o f Gine Coefficientfor Income Distribution
are the following: 'Bl,, = (n
-
1) (Gn - Gn,-J
E B ~=, ~(n~+ I)G,,+~- 2nGn + (n - l)Gn,-i
{
EB2.jk
=n
{nGn - (n - l)(Gn,-j +
I
1.
+ (n - 2)Gn,-j,-k
(J # k)
where =
(y,, y2, -..Yi-i, Yitl, ...*Y.)
Gn,+i= G n + l ( ~ l * Yn, yi) *-*,
-j-k
)=Gn-, (yl, -..,yj-1, Y ~ + ~ ~ . . . ~ Ykil, Y ~ - ~ --., , Y")
Now, estimates of vll, ql2 and 1), are formed from the above, by computing their corresponding sample moments. Thus, we have:
Another, consistent estimate of BlViis: ( Hinkley, 1978) =
(n + 1) (G,,+, - G,)
Hence, another estimate of the variance of Gini coefficient G, is:
66
C.C. Frangos
4. Jackknife Confidence Intervals for Gini Coefficient Let (fi,xi)i = 1,2, ..., n be the frequencies and the corresponding centers of each class from a distribution of a random variable X. Gini coefficient for these data is a function of n pairs, independent and identically distributed, according to an empirical distribution F,(x). (Miller, 1974). It is true that: G, = AIQZ = Gn(fl,xl; f2, x2; ..., f,; X,) Let
i = l, 2, ..., n have the form: The jackknife pseudovalues PnVI
The jackknife estimate of Gini coefficient is
(Miller, 1974) The advantages of G; over G, are two: 1. If G, has systematic bias of order n-2, then the jackknife version G: has bias of order n-3. , is: 2. A distribution free estimate of the standard error of G, and GX
where
is the sample variance of the pseudovalues.
Variance Estimation and Prediction o f Gine Coefficientfor Income Distribution
...
67
Miller (1974) has shown that under easily obtainable regularity conditions, it holds that (G, - @)/S, and
are asymptotically normal as n + a. A third advantage of the jackknife estimate is that approximate confidence intervals for O can be obtained as follows:
5. Bootstrap Estimation of Parameters in an Autoregressive Time Series Model for Gini Coefficient Consider the Time Series y,, y2, ..., ~ 4 where 3 yi represents Gini coefficient for the year (i), 1960
It is true that
where i,, iP, ..., In are uniform random numbers. For example: If, i, = 3, then E~
= E;
If, i, = 5 , then E~ = E;
70
C.C. Frangos
4. We consider the bootstrap Time Series: From the model we find the Least Squares estimates of p and Q,, p* and O;, respectively. 5 . We repeat steps 2, 3,4, 1000 times (or B times, with B large) and we find estimates p; and
i = l, 2, ..., B
of 0,and p respectively. The bootstrap estimates of p and 0, are
with bootstrap variance estimate
In the above model we assume, 10,I C 1 in order to obtain bootstrap estimates of O, that are consistent and asymptotically Normal (Basawa et al. (1991a)) We obtain bootstrap prediction intervals for a future observation of the Gini coefficient by applying the method of Davison and Hinkley (1997), section 6.3.3, p. 284. In order to compare the "classical" method of estimation with the "computer intensive" methods in a problem of statistical inference in Time Series, we consider the following autoregressive model:
Variance Estimation and Prediction o f Gine Coefficient f o r Income Distribution
...
71
where b,, b,, b, are unknown parameters to be estimated from the data, and (et) (t = 1,2, ..., n) are random errors with E(EJ = 0, Var (EJ = 0: and Gi is the value of Gini coefficient for the ith year, G,-, is the value of Gini coefficient for the (i year, G,-, is the value of Gini coefficient for the (i - 2)th year. In the above model, it is not known if the errors are distributed according to Normal Distribution. The coefficients b,, b,, b2 are estimated by Least Squares by the corresponding estimates go,g,, 5,. We shall construct confidence intervals for the coefficients b,, b,, b2 by the following methods: The Least squares method and two versions of Bootstrap methodology,namely the percentile Bootstrap and the BCa (Bias Corrected accelerated Bootstrap). We shall describe the above methods. Least Squares Estimation of the parameters b,, b,, b, in the autoregressive model (18). The parameters to be estimated in model (18) are b,, b,, b,. We have as initial observations: go = g, = 0,306 and g_, = g, = 0,306 Customary estimators of b,, b,, b, are the Least Squares Estimators (a).
A
b,, S,,
g.that minimize:
In order to introduce the idea of Least Squares, we assume P, = 0 5,)'. Let P = (p,, 8,)' and $ = Then
(cl,
where S, is a 2 n
element is
X
2 matrix whose (i, j)th
I&-, t=l
(b). Bootstrap estimation of parameters in the autoregressive model (18) (Tu and Shao, 1995), (Davison and Hinkley, 1997).
72
C.C. Frangos
The Bootstrap methodology is derived using the special structure of autoregressive models. We form the residuals: Since the residuals have a behavior similar to the errors E,, we extent the Bootstrap based on residuals, derived in linear models, to autoregressive models. Let et, t = 0, +l, k2, ... be independent and identically distributed random variables from the empirical distribution of et, t = 0, 1,..., n, putting mass n-l to et - C, t = 1,2, ..., n, where n
We consider the Bootstrap observations:
c:,
The Bootstrap estimates S:, S;, of PO,p,, B2 are the Least Squares Estimates, given in paragraph (a) of this section, where the observations g,, G_,, are substituted by g;, g;-,, g;+ fort = - 1,0, 1,2...,n (Tu and Shao, 1995). We can, then, estimate the distribution of - 6 by the Bootstrap - P. distribution of We assume that the autoregressive Time Series is stationary.This implies that the roots of the equation:
p*
are outside of the unit circle. The basic percentile Bootstrap confidence limits with coefficient 1 - 2a are the following: where t is the parameter to be Bootstrapped and t * ( ( + ~ 1)a) is the apercentile of the R simulated estimates of t. (Davison and Hinkley, 1997,p. 203) (c). The BCa (Bias Corrected accelerated) Bootstrap Confidence limits
Variance Estimation and Prediction o f Gine Coefficient for Income Distribution
...
73
with coefficient 1 - 2a, are the following:
-0,
+ l)&-),% l _ a = t * ( ( ~+ 1) a), simulated ((R + l)E)th Bootstrap
=t*((~
where t* is the parameter t, R = number of simulated Bootstrap estimates of t,
(22) estimate of the
@(z) is the Cumulative Probability Function of N(0,1), W
= @-l
number of times {tr 5 t} R+l
Ii is the influence function at the ith value gi which is estimated by
The following table shows the results of Least Squares and Bootstrap Estimation of parameters in the autoregressive model (18) : Table 6: Least Squares and Percentile Bootstrap estimation of parameters in the autoregressive model. G, = bo + blGt-l + b2Gt_;!+ et t = - 1 , 0 , 1 , 2 ..., n
74
C.C. Frangos
Method of Estimation Least Squares
Percentile Bootstrap
Parameter
Estimate
Std. Error
Estimate
Std. Error
"0
0,0246 1,0679 -0,1378
0,0255 0,1620 0,1710
0,0235 1,0974 -0,1633
0,0226 0,1353 0,1497
b~ "2
Least Squares Method
Multiple R-Squared F-Statistic p-value Residual Standard Error
: 0,8077 : 79,78 : 0,000 : 0,0228
Table 7: Comparison of 95% Confidence Intervals for parameters b, b,, b, in the
autoregressive model. G, = bo + blGt-l
+ b2Gtb2+ Method
Percentile Bootstrap
Bias Corr. Bootstr. BCa
Parameter Lower limit
Upper limit
Length
Lower limit
Upper L e n g t h limit
-0,014 0,966 -0,431
0,060 1,374 0,021
0,074 0,407 0,453
-0,008 0,955 -0,390
0,066 1,311 0,033
b~ "1 b2
Least Squares
Parameter Lower limit Upper Limit
Length 0,103 0,655
0,074 0,356 0,424
Variance Estimation and Prediction o f Gine Coefficientfor Income Distribution
...
75
6. Prediction of Gini Coefficients using Box-Jenkins ARIMA Models We would like to find a model for the Gini data series with good predictive power. Using the statistical package SPSS 12.00 we identify and estimate the parameters of Box-Jenkins models. (Tu and Shao, 1995, p.400). We examine the series of Gini Coefficient Series from 1960 to 1996, for stationarity. From a simple graph of the Gini Coefficient Time Series against time we see that the data are non-stationary. We take the natural log of the series and the first difference. From the examination of the autocorrelation and partial autocorrelation graphs. we conclude that we have a significant autocorrelation and partial autocorrelation at lag 4, so an AR(1) or MA(1) model seems appropriate. We introduce the following notation: G, is Gini Coefficient computed for the t th year. G,-,, is Gini Coefficient computed for the (t - 1) th year. The three ARIMA Models which are introduced for estimation of their parameters and for prediction of future values of Gini Coefficient, are the following: a. Model: ARIMA (1, 0, 1) logG, = a,
+ al .10gG,-~ + E, - rpl - E,-,
(25)
where a,, a,, rp, are the unknown parameters to be estimated from the data, and E* is the random error with mean 0 and variance b. Model: ARIMA (l, 0,2)
4.
C.
IogG, = bo + blloggt-l
-F Et
- TIEt-l - CP2Et-2
(26)
where b,, b,, Q,, Q2 are the unknown parameters to be estimated from the data and E, is the random error with mean 0 and variance a.: c. Model: ARIMA (1,0,0) d. logG, = yo + yllogG,-
+ E,
(27)
where y,, y, are unknown parameters to be estimated from the data and is E, a random error with mean 0 and variance 02,.
76
C.C. Frangos
The following table shows the estimated parameters, the variances and the predictive errors of the three models: Table 8: Statistical inference for models (25), (26), (27). MODEL: logG, = a. + allogG,-I + E, - cp,~,-, ARIMA (1,0, 1) parameter
Estimate
Standard Error of Estimate
Approx. p-value
a~ a1
-1,155 0,836
0,069 0,113
0,oOO 0,m
'P1
-0,158
0,199
0'43 1
AKAIKE'S Information criterion = -99,45
,
SCHWARZ'S Bayesian criterion = -94,312
Standard Error = 0,02 Residual Mean Square = 0,004
MODEL: logG, = b,
+ bllogG,-, + E,
- rpl~,-, - cp2~t-2
ARIMA (1,0,2) parameter
Estimate
Standard Error of Estimate
Approx. p-value
b~
-1,120 0,874
0,089 0,114
0,000
-0,152 -0,059
0,197 0,192
0,443 0,759
"1 1
v2
Standard Error = 0,04 Residual Mean Square = 0,06
@m
AKAIKE'S Information criterion = 97,49 SCHWARZ'S Bayesian criterion = -90,63
-
Variance Estimation and Prediction o f Gine Coefficient for Income Distribution
MODEL: logG, = yo + y ,logG,-,
...
77
+ E,
ARIMA (1,0,0) parameter
Estimate
Standard Error of Estimate
Approx. p-value
y0
- 1,110
l' 1
0,913
0,099 0,073
O,ooO 0,m
AKAIKE'S Information criterion = - 100,65 SCHWARZ'S Bayesian criterion = -97,23
Standard Error = 0.06 Residual Mean Square = 0.004 The following table shows the prediction errors of the three models for the years 1998,1999,2000. Table 9: Comparison of the prediction eflors for the three models given in (25), (26), (27) for years 1998, 1999,2000.
year
Model (25) prediction error
1998 1999 2000
0,007 0,101 0,040
( Mode1 (26)
I
Mode1 (27) prediction error prediction error 0,009 0,107 0,040
-0,007 0,098 0,052
Hence, the three models have equivalent prediction errors. They differ in the values of AKAIKE's Information Criterion and SCHWARZ's Bayesian Criterion In terms of the simplicity of the three models, we choose the third model ARIMA (1,0,O) which is the following:
78
C.C. Fran~os
7. Concluding Remarks Gini coefficient is the best statistic for measuring the degree of equal distribution of a random variable& can be found in many situations: Income distribution, Health facilities distribution, Education resources distribution, Environment pollution distribution Employment (or unemployment ) patterns, Agricultural facilities distribution, Fertility pattern distribution Gini coefficient is another useful arena for comparing the two methodologies, Jackknife and Bootstrap. Using von Mises Expansions we investigated the second-order properties of Gini Coefficient .We estimated its variance using estimates of first and second order influence functions.
In this paper we used as an estimate of the first -order influence function the statistic: Frangos and Schucany (1990) present the following estimate of the firstorder influence function by means of a "adding one observation" jackknife estimate: Considering the income distribution in a particular year we apply the jackknife technique for finding confidence intervals about the mean of Gini coefficient. Using the data values of Gini Coef. from 1960 to 1996 as a Time Series and two versions of Bootstrap statistical methodology for estimating the distribution of a statistic,namely the percentile Bootstrap and the Bias Corrected Bootstrap, we estimate the parameters of an autoregressive Model of order 2 and we compare the confidence limits about the parameters obtained by the "classical" method of Least Squares and the "computer intensive" methods of empirical Bootstrap and Bias corrected Bootstrap.From this comparison we find that the lengths of the confidence
Variance Estimation and Prediction o f Gine Coefficient for Income Distribution
...
79
intervals obtained by the computer intensive methods are shorter than the lengths of the confidence limits obtained by the Least Squares method.The lengths of the confidence limits obtained by the Bias Corrected Bootstrap are shorter than the corresponding lengths obtained by the Percentile Bootstrap. In the last section 6 we compare the estimates, their standard errors, Akaike's and Schwarz's criteria, the residual mean square and the prediction errors for three ARIMA models Fitting the data of Gini Time Series., we conclude that the third model given by (27), which is an ARIMA model (1, 0, 0) is preferable to the other two,because of its simplicity, and the smaller magnitude of standard errors of estimates,compared to the other two models given by (25) and (26).Also, the two criteria of Akaike's and Schwarz's are smaller for the model given by (27) than the same criteria for the other two models,given by (25) and (26). Finally, we can conclude that the main finding of this paper is that the estimation and prediction of a final observation of Gini Time Series is an area where the superiority of the "computer intensive" methods with respect to confidence intervals, is established over the more "classical" methods of least Squares. Also,the paper contains variance estimates for Gini coefficient, given by (5), (7) and (10) overcoming the difficulty of the analytical derivation,using, for the first time, von Mises expansions for differentiable statistical functionals. A Monte Carlo simulation study is of interest to compare the variance estimate given in (5) whith that obtained in (7)' and the performance of the Bias Corrected Accelerating Bootstrap,in deriving confidence limits in Time Series,where the acceleration parameter (a) is estimated by (20) or (29). Also'the performance of "computer intensive" methods could be investigated in more complex situations, like log-linear models for Binomial and Poisson observations.
80
C.C. Frangos
REFERENCES Basawa, J.V, Mallik, A.K., McCormick, W.P., Reeves, J.H. and Taylor, R. L. (199 l), Bootstrapping unstable first order autoregressive processes. Ann. Statist., 19, 1098-1 101 . Chotikapanich, D. and Griffiths, W. (2001), On calculation of the extended Gini Coefficient.Review of Income and Wealth, 47, 4, 541-547. Cowell, F. (1995), Measuring Inequality, Prentice Hall, Harvester. Davison, A.C. and Hinkley, D.V. (1997), Bootstrap Methods and their Application. Cambridge University Press, Cambridge. Efron, B. and Tibshirani, R.J. (1986), Bootstrap Methods for standard errors, confidence Intervals and other measures of Statistical accuracy, Statist. Science. 1,54-77. Efron, B. and Tibshirani, R.J. (1993), An introduction to the Bootstrap, Chapman and Hall, New York. Frangos, C.C., (198O), Vaxiance estimation for the second-order Jackknife, Biometrika, 67,3, 715-718. Frangos, C.C. and Knott, M. (1983), Variance Estimation for the Jackknife using von Mises expansions, Biometrika, 70, 2, 501-504. Frangos, C.C. and Schucany, W.R., (1990), Jackknife Estimation of the Bootstrap acceleration constant, Compu. Statist.Data Anal., 9, 271-282. Hampel, F.R., (1974), The influence curve and ist role in robust estimation, J. Arner. Statist. Assoc., 69, 383-393. Hinkley, D .V., (1g%), Improving the Jackknife with special reference to correlatiom estimation, Biometrika, 65, 1, 13-2 1.
Variance Estimation and Prediction o f Gine Coefficient for Income Distribution
...
81
Kendall, M.G. and Stuart, A., (1982), The Advanced Theory of Statistics, vol. I, fourth Edition, C. Griffin, London. Lerman, I. and Yitzhaki, S.,(1984), A note on the calculation and interpretation of the Gini Index., Economics Letters, 15, 363-368. Miller, R.G.(1974), The Jackknife:Areview. Biometrika, 61, 1-15.
Tu, D. and Shao, J., (1995), The Jackknife and Bootstrap, Springer Series in Statistics, New York, 85,486492.