Bayesian estimation of NIG-parameters by Markov chain Monte ... - Core

14 downloads 0 Views 227KB Size Report
data, see also Rydberg (1997). The NIG-framework has several ..... We see that the parameter estimates of the ML-method and the Mc- method turned out ...
Bayesian estimation of NIG-parameters by Markov chain Monte Carlo methods

Jostein Lillestl

Institute of Finance and Management Science The Norwegian School of Economics and Business Adminstration Bergen, Norway E-mail: [email protected] First draft September 16, 1998 Revision of January 15, 2001

Abstract The Normal Inverse Gaussian (NIG) distribution recently introduced by Barndor -Nielsen (1997) is a promising alternative for modelling nancial data exhibiting skewness and fat tails. In this paper we explore the Bayesian estimation of NIG-parameters by Markov Chain Monte Carlo Methods.

KEY WORDS: Normal Inverse Gaussian distribution, Bayesian Analysis, Markov Chain Monte Carlo.

 This work is partly done at Institut fur Statistik und Okonometrie, Humboldt Universitat zu Berlin with support from the Ruhrgas Arena program.

1

Introduction

The empirical distributions of stock returns are typically skew and heavy tailed, and di erent families of distributions have been proposed to model returns. Most notably is the stable Paretian family with a long history, see McCulloch (1996) and Adler et.al. (1998). More recently the class of NIGdistributions, an acornym for Normal Inverted Gaussian, has been proposed by Barndor -Nielsen (1997). He investigated its properties, the modelling of NIG-processes, the estimation of NIG-parameters and the t to real nancial data, see also Rydberg (1997). The NIG-framework has several desirable properties and opportunities for modeling. A number of problems in nance can be recasted within this framework, thus taking skewness and heavy tails into account, as demonstrated by Lillestl (2000). In this paper we focus on the estimation of NIG-parameters from the Bayesian viewpoint using Markov Chain Monte Carlo Methods. We will explore a convenience prior leading to simple updating formulas. This is a conjugate prior when an unobserved variate is included in the parameter set. We will give some examples on estimating real and simulated data, and make comparisons with the maximum likelihood estimates.

2

The NIG-distribution and a convenience prior

The NIG-distribution may be characterized by four parameters: ; Æ; ; which relates mainly to location, scale, peakedness/heavy tail and skewness respectively. The moment-generating function of a NIG( ; ; Æ )-variate X is given by

p p From this it is easily derived (let = MX (t) = E exp(tX ) = exp(t + Æ( 2

2 2

EX =  + Æ 



q

2)

2

3 1 Skewness = 3   (Æ )1=2 1 Kurtosis = 3  (1 + 4( )2 )  Æ varX = Æ

2

2

( + t2 )))

The density function of X is fairly complicated involving Bessel functions. However, the distribution has a simple characterization as the marginal distribution of (X; Z ) where

p

X j Z = z  N ( + z; z ) Z  IG(Æ; 2 2 ) where

0 j j<

Here IG(Æ; ) is the well known Inverted Gaussian distribution (also named Wald distribution), see Johnson and Kotz (1995). Barndor -Nielsen has studied the estimation of NIG-parameters by maximum likelihood methods. Given the complicated density this leads to likelihood equations that are very complicated and requires extensive programming combined with numerical skills. The program "hyp" developed by Bl sild et. al. (1992) solves the task, but this may not be readily available. Another possilibility is to use an EM type algorithm for maximum likelihood. This is more easily programmable with less challenging numerics, as long as the computer environment supports easy calculation of Bessel functions, as demonstrated by Karlis (2000). Still another possibility is to use the simple characterization of the distribution above and the fact that IG-variates are easy to simulate, see Michael et.al. (1976), and explore simulation-based approaches to the estimation problem within a Bayesian framework. The Bayesian approach to estimation of NIG parameters is concurrently explored by Karlis (2001), who also considers the case of covariates. Now let Y = (X; Z ) with distribution as in the characterization above, i.e. probability density f (y) = f (x; z ) = f (z )  f (x j z ) where

f (x j z ) = (2) f (z ) = (2)

1 2 2z (x (+ z ))

1=2

z 1=2 e 1=2 ÆeÆ  z

3=2

e

1 2 1 2 2 (Æ z + z )

that is

f (x; z ) / ÆeÆ

 z

2

e x+ z

x

1 ( 2 + 2 )z 2

1 (2 +Æ 2 )z 1 2

If we let  = (; Æ; ; ) we see that the joint density is within the exponential family

P4=1 

f (y) = g()h(y)e 3

j

t y

j( ) j( )

where

1 () = 2 () =  3 () = ( 2 + 2 ) 4 () = (2 + Æ2 ) and

t1 ( y ) = x t2 (y) = xz

1 2 1 t4 (y) = z 2

1

t3 ( y ) = z

g() = ÆeÆ

1



This means that the conjugate prior distribution of  is then of the form

P4=1 

p() / g()a0 e

j

a

j( ) j

where (a0 ; a1 ; a2 ; a3 ; a4 ) is a vector of superparameters, where a0 , a3 and a4 > 0. Given n independent realizations y1 ; y2 ; :::; yn of the variate Y , the posterior of  is of the same form with

a00 = a0 + n and a0j = aj +

X t (y ) n

i=1

j

i

This is an augmented posterior in the sense that we are really interested in a distribution of  given the observable X , and the Z in Y = (X; Z ) is treated as unobservable or missing. However, we can get at the desired posterior by using Markov chain Monte Carlo ideas. A closer look at the chosen distribution for  = (; Æ; ; ) shows that (; ) is independent of (Æ; ) and that (; ) is bivariate normal with correlation

=

1 a0 p 2 a3 a4 4

The other parameters are

=  = 2 =  2 =

2(1 2(1 2(1 2(1

1

 )a4 2

1

2 )a3

(a2 (a1

1

a0 a ) 2a3 1 a0 a ) 2a4 2

2 )a4

1

2 )a3

 2 given by the two parenNote therefore the expressions for =2 and = theses respectively. Note also the binding equation  2 =2 = a4 =a3 . By the reparametrization (; ) to (; ) where  =  + 2aa04  we get  independent of . The distribution of (Æ; ) is

p(Æ; ) / Æa0 e

a3 2 +a0 Æ a4 Æ2

for Æ  0;  0:

By a linear transformation to get rid of the cross-term it is seen that

Æ2  Gamma(

j Æ  Normal(

a0 + 1 2

a0 1 Æ; ) 2a3 2a3

; a4

a20 ) 4a3

truncated at zero

In the case that the conjugate prior is too restrictive to represent our data, it may be worthwhile to introduce another layer of superparameters. Most convenient is to take ai 's to be independent and

ai  Normal(ci ; bi ) for i = 1; 2 ai  Exponential(bi ) for i = 0; 3; 4: If we let a = (a0 ; a1 ; a2 ; a3 ; a4 ), the posterior of a given  is determined as

a1  Normal(c1 + b1 ; b1 ) a2  Normal(c2 + b2 ; b2 ) a3  Exponential(b3 + 2 + 2 ) a4  Exponential(b4 + 2 + Æ2 ) a0  Exponential(b0 +  Æ log(Æ)): 5

We see that the estimation problem essentially splits in two parts, that of (; ) and that of (Æ; ), and that the conjugate prior imposes cross restrictions. These may be unrealistic and not remedied by introducing superparameters. A possibility would be to model the two parts separately, which leads to estimation of heteroscedastic normal regression and estimation of inverse Gaussian parameters, as suggested by Karlis (2001). There exists various partametrizations of the IG-distribution that may lead to di erent solutions to this Bayesian estimation problem.

3

A MCMC scheme for the posterior

In order to get at the posterior p( j x) we use the MCMC scheme using full conditional. Let w = (x; z; ) where x is observed, z unobserved,  parameters. Let ws be a subset of the components of w and w s the complementary set.

p(ws j w s ) / p(w) where only the factors involving components of ws in any product formula need to be retained. A special case of this is p(; z j x), where the marginal p( j x) is our interest. Various schemes for sampling from the posterior is given Robert and Cassela (1999). The NIG-model ts under the heading of data augmentation, which is a special case of the Gibbs sampler. The full conditionals suÆcient for the current problem are given by 1. p(; z j x) / p(x; z j )p() 2. p(s j x; z;  s ) / p(x; z j )p(s j  s) 3. p(z j x; ) / p(x j z; )p(z j ) If this is written out in our case, we will see that formula 3 leads to

Zi i:i:d: IG (Æ0 ; 0 ) where IG (Æ; ) denotes the distribution with density similar to IG, but with z 2 as multiplicative factor replacing z 3=2 . Both distributions are member of the family of generalized inverse Gaussian distributions GIG(; Æ; ) with  = 1=2 and  = 1 respectively. 6

The parameters in our case are given by

p

Æ0 = (Æ2 + (Xi )2 )1=2

0 = 2 + 2 = The Zi0 s can be simulated using the ideas in Michael et.al. (1976), or by rejection sampling methods, as described in some detail in an appendix. Formula 2 for the complete parameter set  as well as for (; ) and ( ; Æ) separately just reiterates the result of our choice of conjugate prior for the augmented (X; Z ), where the superparameters are updated according to

X X a0 = a + Z X 1X a0 = a + Z 2 1X a0 = a + Z a00 = a0 + n a01 = a1 + Xi 2

2

3

3

4

4

i

1

i

i

2

i

1

New  can then be simulated according to the distributional structure given above. That is, (; ) bivariate normal and Æ2 Gamma and computed from Æ and a simulated Normal variate. In the case of another layer of parameters we get another set to visit in each round, determined by the posterior given at the end of the previous section.

4

The choice of superparameter values

We will now examine the choice of superparameters in order to re ect our (lack of) knowledge about parameters. It is of course convenient to have few superparameters to address as is the case with our conjugate prior. A drawback may be little exibility, i.e. our choices a ect the parameters jointly in a manner which may not be transparent. A basic restriction for the expressions of variances to stay positive is 1 4 From the ve updating equations we see that more information is accumulated as a0 , a3 and a4 are increasing (since the z's are non-negative). The

a3  a4 > a20

7

choice of a small a0 to re ect prior ignorance may go together with small ai 's, but with some prior knowledge, and choosing a larger a0 , this has to go along with larger a3 and/or a4 as well to match the restriction. The parameter  may then be helful for calibration purposes. A possible consequence of few parameters is when we try to express ignorance in some sense, it may have unwanted and even contradictory implications. This is in fact the case here, where a0 = 0 at rst sight, is a natural choice for ignorance. This means that  = 0 and consequently

=  =

a2

2a4

a1 2a3

1 = E 2 2a4 1 = EÆ2  2 = 2a3

2 =

i.e. we have implicitely assumed that the more certain you are about  (resp. , the smaller you expect Æ (resp. ) to be, and judged from the corresponding variances of Æ2 and 2 you are even more certain about that. So, a0 = 0 is an ignorants choice to represent ignorance! It is likely that opinions are initially formed by observed centers and shapes of empirical distributions. It seems therefore natural to rst re ect on the parameters (; ) (stage 1), and then on (Æ; ) (stage 2), We may rst ask to what extent our knowledge of either of these is a ected by the other. In order to match the expression for the expected value EX =  + Æ 1 , a large/small  departing from its prior expected value has to be balanced by a small/large compared to its expectation, i.e.  and have to be negatively correlated, as re ected by the conjugate prior. It is easily checked that the linear combination  =  + t  with the smallest prior variance is given by

 =+

a0  2a4

Note from section 2 that this  is stochastically independent of . EX is such a linear combination, and it is not unresonable to assume that we are more certain about EX than any other linear combination of of  and . This assumption as well as its immediate consequenses will be referred to later as A0. We see that the corresponding prior mean and variance are 8

given by

a2 2a4

1 2a4 Note that  now has disappeared and that the expressions are the same as for  in the case of  = 0. If we go back to the original expression for EX and use the prior independence of (; ) and (Æ; ) we get the equation A0 :

A0 :

 =

2 =

r

a Æ a E( ) = 0 =  3

2a4 a4

i.e. our assumption A0 has implications for the two other parameters as well. Note also that if the "ignorance" assumption a0 = 0 is used in conjunction with A0, it implies that Æ = 0, which leads to the one-point distribution at , which is quite the opposite of ignorance! Let assumption A1 be that the the prior mean equal to zero. We then get A0 + A1 : a2 = 0 In the case of a non-zero prior mean, we could as well subtract the prior mean from all observations and start from there. In order to determine a1 we have to be more speci c about  or . In the case of a2 = 0 we see that

=  =

a0 , 2a4

=2 =

a0  2 = a1 :  a , = 2a3 1

Thus  determines the sign of a1 and  (opposite) in this case, which holds in particular for A0 + A1. Let us also look into assuming  = 0 (assumption A2) and  = 0 (assumption A3). This gives the following restrictions on the superparameters respectively: A2 : A3 :

a0 a 2a3 1 a a1 = 0  a2 2a4 a2 =

with obvious inequalities for the prior expectations less than or greater than zero. Note that any two of A1, A2 and A3 taken together are equivalent and corresponds to a1 = a2 = 0, omitting the case of 2 = 1 leading to in nite prior variances of , and Æ2 . We may just want combine A0 and A2 to get

Æ a E( ) = 2

a1

A0 + A2 : 9

Let us now turn to the prior of (Æ; ). It may be harder to have opinions about this than the prior of (; ), and it is likely that we want to express ignorance. We therefore have to choose parameters at the rst stage carefully in order to give room for this. Note that the prior of (Æ; ). is determined by the parameters a0 , a3 and a4 . Recall the binding resctriction above, where in fact the ratio of a3 and a4 is determined by the prior variances of  and . As said earlier this is an unwanted restriction, which we have to balance o according to where our knowledge is best. Note that

Æ2  Gamma(

a0 + 1 2

; a4 (1 2 ))

It will be of interest to see numerically how ad hoc assumptions representing various types of knowledge will work, when trying to balance or ignore the consequences for other parameters. Among these are ignorance assumptions taking a0 = 1 or a0 = 2 (assumption B1 and B2), and saying you are equally uncertain about  and , thus taking a3 = a4 . Of particular interest is the case a0 = a3 = a4 , which means that  = 1=2.

10

5

Examples of estimation of NIG parameters

Reliable estimates of the parameters of a heavy tailed distiribution will require a minimum of observations in the tails, and small data sets of independent observations are not likely to give good results. Our experience so far with the NIG family based on simulated data suggests that diÆculties may arise even for 100 observations and that about 400 observations are desirable, see Lillestl (2000). We provide here two examples of NIG parameter tting, one for simulated data and one for nance data. The rst data set NIG2121 is 400 simulated independent NIG(2,1,2,1) variates. The second data set FTARET is the monthly nominal returns of the FTActuaries All-Share Index for the UK from January 1965 to December 1995 (372 observations). The empirical distributions are shown as histograms in Figure 1. FTA returns

0

0

0.1

0.02

0.2

0.3

0.04

0.4

0.06

0.5

0.6

Simulated NIG(2,1,2,1)

0

2

4

6

-20

x

0

20 return

40

Figure 1: Histograms of data sets NIG2121 and FTARET Descriptive measures are given in Table 1. NIG2121 FTARET

Mean StdDev Skewness Kurtosis 2.55 0.85 1.22 3.51 5.53 6.05 1.12 14.43 Table 1: Descriptive measures 11

60

In Table 2 we give estimates by the MC-method with single layer of parameters and two layers of parameters (MC2) and compare this with the corresponding ML-estimate. The MC estimation with a single layer is based on a prior taking a0 = a3 = a4 = 1 and a1 = a2 = 0 and iterating 10 rounds before a sample of , ,  and Æ is taken. This is repeated 400 times and is the basis for calculating the posterior means and the smoothed histograms to represent posterior densities. The posterior means are estimated by the average of the sampled values. Estimates obtained by Rao-Blackwellization is also computed for comparison and as a check for convergence. For the twolayer estimation the values of the superparameters bi 's and ci 's are chosen to match the expected values of the ai 's for the single-layer speci cation above. The ML-estimates are obtained by the program 'hyp', and the results are con rmed by the EM-algorithm of Karlis mentioned above. We have also computed moment estimates (MM) obtained by inverting the expressions for the rst four moments given in section 2. NIG2121 (MC) NIG2121 (MC2) NIG2121 (ML) NIG2121 (MM) FTARET (MC) FTARET (MC2) FTARET (ML) FTARET (MM)







1.360 0.868 2.205 1.669 0.965 2.142 2.490 1.343 1.876 2.445 1.396 1.872 0.823 0.742 1.428 0.818 0.733 1.486 0.174 -0.004 5.662 0.083 0.015 5.001

Æ

0.407 0.579 1.049 0.972 1.971 2.019 5.697 2.872

Table 2: Parameter estimates We see that the parameter estimates of the ML-method and the MCmethod turned out somewhat di erent for both data sets. Let us rst comment on the NIG2121 data. The MC-method has overestimated  while the ML-method has underestimated this parameter. The other three parameters are underestimated by the MC-method and overrstimated by the ML-method, except for Æ which are about on target by the latter method. The low estimate for Æ by the MC-method is somewhat disturbing. We see that adding the second parameter layer has lead to improvement for the MC-method. Now the estimate of both and are closer to their true values by this method than the ML-method, with about on target. The estimate of Æ is also improved somewhat, but is still at some distant from the true value. We see that the 12

MM-estimates are fairly close to the ML-estimates. It is also of interest to compare the rst four moments by plugging in the estimated parameter values in the theoretical expressions for the moments in section 2 and compare with their true values. The main di erence between the methods is that MC attributes more skewness and kurtosis than the true ones of the parent distribution, while ML attributes less skewness and kurtosis, and is closer to "target". The smoothed histograms of sampled posterior densities for NIG2121 based on the MC2 simulations are given in Figure 2. Posterior density beta

0

0

0.5

0.5

1

1

1.5

1.5

Posterior density alpha

1.5

2

0.5

1

1.5 beta

Posterior density mu

Posterior density delta

1.4

0.5

1

1.5

1.5

1.6

2

1.7

2.5

3

1.8

alpha

2

2.1

2.2

2.3

0.4

mu

0.5

0.6 delta

0.7

0.8

Figure 2: Smoothed histograms of sampled posterior densities for NIG2121 Looking at the posterior distributions for the simulated data we see that the true parameter values for is at the central part of the distribution, while the true  and are out in the tails. For Æ we have been rather unsuccessful indeed! For the data set FTARET there are even more striking di erences. We see that The ML-method has given a much higher estimate for  and Æ than the MC-method. On the other hand both and are higher with 13

the MC-method than the ML-method. We see that adding the second layer has not lead to appreciable changes. Her the MM-estimates di er more from the ML-estimates, but stay closer the the MC-estimates, except for the parameter Æ. The smoothed histograms of sampled posterior densities for FTARET based on the MC2 simulations are given in Figure 3. Posterior density beta

1.5 1 0.5 0

0

0.5

1

1.5

2

2

Posterior density alpha

0.5

1

0.5

1 beta

Posterior density mu

Posterior density delta

0

0

0.5

0.2

1

0.4

1.5

0.6

2

0.8

2.5

alpha

0

0.5

1

1.5

2

2.5

1.8

mu

2

2.2

2.4

delta

Figure 3: Smoothed histograms of sampled posterior densities for FTARET It may be instructive to compare the ts in terms of QQ-plots. For the NIG2121 data we may compare with the true distribution. Exact computations require access to a Bessel function routine, and we may as well simulate the distribution. We have simulated n=10000 observations from the true distribution as well as from the distribution with the estimated parameters by the ML-method and MC-method respectively. The QQ-plot is given in Figure 4. We clearly see that the MC- t is inferior to the ML- t, and that the MC2- t is an imrovement which makes the t comparable to the success of the ML- t, but that the two ts obviously have some distinct features that separates them. 14

QQ-plot NIG2121 MC2 vs True

5 x

x

0

0

0

5

x

5

10

10

QQ-plot NIG2121 MC vs True

10

QQ-plot NIG2121 ML vs True

0

5

10

0

5 x

x

10

0

5

10

x

Figure 4: QQ-plot for distributions tted by ML and MC vs True NIG2121

QQ-plot FTARET MC vs True

QQ-plot FTARET MC2 vs True

40 20 x

x 0

50

0 -20

-20

0

0

x

20

40

50

QQ-plot FTARET ML vs True

-20

0

x

20 x

40

60

-20

0

20

40

60

x

Figure 5: QQ-plot for distributions tted by ML and MC vs Observed FTARET

15

For the FTARET data we compare the observed data with data simulated from the distributions tted by the ML-method and MC-method respectively. We chose here to simulate data of the same size, namely n=372 observations. The resulting QQ-plot is given in Figure 5. It seems that the ML- t is superior to both MC- ts. The comparisons above are based on limited experience so far, and more suitable choices of prior speci cations may hopefully lead to further improvement. Admittedly, the results obtained with the current parametrization are not impressive, and alternative parametrizations should be compared before general conclusions can be drawn.

Acknowledgements This research is supported from Deutsch-Norwegisches Stipendprogramm Ruhrgas Stipendium for visiting Institut fur Statistik und Okonometrie, Humboldt Universitat zu Berlin. I am grateful for this support, and to Wolfgang Hardle and his sta at Sonderforchungsbereich 373: Quanti ka tion und Simulation Okonomische Prozesse, for providing an excellent environment for reaserch and the computing facilitites based on their program XploRe. I also have to thank Michael Kjr Srensen and Dimitris Karlis for providing the ML-estimates based on my data, and Dimitris Karlis for valuable personal communications.

16

References Adler, R.J., Feldman R.E. and Taqqu, M.S. (eds.) (1998) A Practical Guide to Heavy Tails: Statistical Techniques and Applications. Boston: Birkh auser. Barndor -Nielsen, O. (1997) Normal inverse Gaussian processes and stochastic volatility modelling. Scandinavian Journal of Statistics, 24, 1-13. Blsild, P. and Srensen, M.K. (1992) 'Hyp' - A computer program for analyzing by means of the hyperbolic distribution. Research Report No.300, Department of Theoretical Statistics, University of Aarhus. Damien, P. and Walker, S. (1997) Sampling nonstandard distributions via the Gibbs sampler. Preprint. Johnson, N.L., Kotz, S. and Balakrishnan, N. (1995) ate Distributions vol.2. New York: Wiley.

Continuous Univari-

Karlis, K. (2000) An EM type algorithm for maximum lilelihood estimation of the Normal-Inverse Gaussian distribution. Research report Athens University of Economics and Business.

Karlis, K. (2001) Bayesian estimation for the NIG distribution. communications. Lillestl, J. (2000) Risk analysis and the NIG distribution. 2, 41-55.

Personal

Journal of Risk

,

McCulloch, J.H. (1996) Financial applications of stable distributions. Statistical Methods in Finance, Handbook of Statistics, vol.14 Maddala, G.S. and Rao, C.R. (eds.) New York: North Holland. Michael, J.R., Schuhany, W.R. and Haas, R.W. (1976) Generating random variables using transformations with multiple roots. The American Statistician, 30, 88-89. Robert, C.P. and Cassela, G. (1999) Monte Carlo Statistical Methods. New York: Springer Verlag. Rydberg, T.H. (1997) The normal inverse Gaussian Levy process: Simulation and approximations. Communications in Statistics-Stochastic Models , 13, 887-910.

17

Appendix: Simulation of IG*-variates Let V be a chisquare variate with 2 degrees of freedom and compute the roots with respect to Z of

V=

( Z

Z

They are given by

Z=

Æ 1 + 2 (V

2

Æ )2

p

 V

2

+ 4 ÆV )

Let Z1 and Z2 be the minus and plus root respectively, and note that Z2 = Æ2 =Z1 . Let  = Æ= and

Z = Z1

with probability

= Z2

with probability

2

2 + Z12 Z12 2 = 2 + Z12 2 + Z22

Then Z is IG (Æ; ) An alternative way if simulating IG* variates is by rejection sampling as follows 1. Generate Z by Z

1

2. Compute T = exp(

being exponential with parameter  = 21 Æ2 . 1 2

2)

3. Generate U Uniform[0,1], if T > U keep Z otherwise not. This procedure follows from taking q(z j ) = z 2 exp( 21 Æ2 z 1 ) as envelope. A problem with this procedure is that many observations are likely to be rejected. Some improvement are obtained by taking z 1 Gamma(k, ) instead, and adjust k to the situation at hand. However the improvement is only slight. A simulation method that works for any generalized inverse Gaussian distribution is proposed by Damien and Walker (1997), using a Gibbs sampler with auxiliary latent variables.

18