Journal of Probability and Statistics
Advances in Applied Econometrics Guest Editors: Efthymios M. Tsionas, William Greene, and Kajal Lahiri
Advances in Applied Econometrics
Journal of Probability and Statistics
Advances in Applied Econometrics Guest Editors: Efthymios M. Tsionas, William Greene, and Kajal Lahiri
Copyright q 2011 Hindawi Publishing Corporation. All rights reserved. This is a special issue published in “Journal of Probability and Statistics.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Editorial Board M. F. Al-Saleh, Jordan V. V. Anh, Australia Zhidong Bai, China Ishwar Basawa, USA Shein-chung Chow, USA Dennis Dean Cox, USA Junbin B. Gao, Australia Arjun K. Gupta, USA
Debasis Kundu, India Nikolaos E. Limnios, France Chunsheng Ma, USA Hung T. Nguyen, USA M. Puri, USA Jos´e Mar´ıa Sarabia, Spain H. P. Singh, India Man Lai Tang, Hong Kong
Robert J. Tempelman, USA A. Thavaneswaran, Canada P. van der Heijden, The Netherlands Rongling Wu, USA Philip L. H. Yu, Hong Kong Ricardas Zitikis, Canada
Contents Advances in Applied Econometrics, Efthymios M. Tsionas, William Greene, and Kajal Lahiri Volume 2011, Article ID 978530, 2 pages Some Recent Developments in Efficiency Measurement in Stochastic Frontier Models, Subal C. Kumbhakar and Efthymios G. Tsionas Volume 2011, Article ID 603512, 25 pages Estimation of Stochastic Frontier Models with Fixed Effects through Monte Carlo Maximum Likelihood, Grigorios Emvalomatis, Spiro E. Stefanou, and Alfons Oude Lansink Volume 2011, Article ID 568457, 13 pages Panel Unit Root Tests by Combining Dependent P Values: A Comparative Study, Xuguang Sheng and Jingyun Yang Volume 2011, Article ID 617652, 17 pages Nonparametric Estimation of ATE and QTE: An Application of Fractile Graphical Analysis, Gabriel V. Montes-Rojas Volume 2011, Article ID 874251, 23 pages Estimation and Properties of a Time-Varying GQARCH1,1-M Model, Sofia Anyfantaki and Antonis Demos Volume 2011, Article ID 718647, 39 pages The CSS and The Two-Staged Methods for Parameter Estimation in SARFIMA Models, Erol Egrioglu, Cagdas Hakan Aladag, and Cem Kadilar Volume 2011, Article ID 691058, 11 pages
Hindawi Publishing Corporation Journal of Probability and Statistics Volume 2011, Article ID 978530, 2 pages doi:10.1155/2011/978530
Editorial Advances in Applied Econometrics Efthymios G. Tsionas,1 William Greene,2 and Kajal Lahiri3 1
Department of Economics, Athens University of Economics and Business, 10434 Athens, Greece Department of Economics, Stern School of Business, New York University, New York, NY 10012, USA 3 Department of Economics, University at Albany, State University of New York, Albany, NY 12222, USA 2
Correspondence should be addressed to Efthymios G. Tsionas,
[email protected] Received 29 November 2011; Accepted 29 November 2011 Copyright q 2011 Efthymios G. Tsionas et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The purpose of this special issue is to bring together some contributions in mainly two fields: time series analysis and estimation of technical inefficiency. The fields are important in applied econometrics as they have attracted a lot of interest in recent years. In time series analysis, the analysis focuses on seasonal ARIMA models and the GQARCH-M model. In efficiency estimation a review is provided and a paper that deals with estimation of frontier models with fixed effects. Moreover, the special issue includes a paper on panel unit rooting testing and a paper on nonparametric estimation of treatment effects. In “Some recent developments in efficiency measurement in stochastic frontier models” by S. C. Kumbhakar and E. G. Tsionas, addressed are some of the recent developments in efficiency measurement using stochastic frontier SF models in some selected areas. The following three issues are discussed in detail. First, estimation of SF models with input-oriented technical efficiency. Second, estimation of latent class models to address technological heterogeneity as well as heterogeneity in economic behavior. Finally, estimation of SF models using local maximum likelihood methods. Estimation of some of these models in the past was considered to be too difficult. The authors focus on the advances that have been made in recent years to estimate some of these so-called difficult models. They complement these with some developments in other areas as well. The reader is advised to consult Greene 2008 for a comprehensive review of stochastic frontier models. G. Emvalomatis et al., in their paper “Estimation of stochastic frontier models with fixed effects through Monte Carlo maximum likelihood,” propose a procedure for choosing appropriate densities for integrating the incidental parameters from the likelihood function in a general context. The densities are based on priors that are updated using information from the data and are robust for possible correlation of the group-specific constant terms with the explanatory variables. Monte Carlo’s experiments are performed in the specific context of stochastic frontier models to examine and compare the sampling properties of the proposed
2
Journal of Probability and Statistics
estimator with those of the random effects and correlated random-effect estimators. The results suggest that the estimator is unbiased even in short panels. An application to a crosscountry panel of EU manufacturing industries is presented as well. The proposed estimator produces a distribution of efficiency scores suggesting that these industries are highly efficient, while the other estimators suggest much poorer performance. In “Nonparametric estimation of ATE and QTE: an application of fractile graphical analysis,” G. V. Montes-Rojas constructs nonparametric estimators for average and quantile treatment effects using fractile graphical analysis, under the identifying assumption that selection to treatment is based on observable characteristics. The proposed method has two steps: first, the propensity score is estimated, and, second, a blocking estimation procedure using this estimate is used to compute treatment effects. In both cases, the estimators are proved to be consistent. Monte Carlo results show a better performance than other procedures based on the propensity score. These estimators are applied to a job training dataset. In “Estimation and properties of a time-varying GQARCH(1,1)-M model,” S. Anyfantaki and A. Demos outline the issues arising from time-varying GARCH-M models and suggest to employ a Markov’s chain Monte Carlo algorithm which allows the calculation of a classical estimator via the simulated EM algorithm or a simulated Bayesian solution in only OT computational operations, where T is the sample size. The theoretical dynamic properties of a time-varying GQARCH1,1-M are derived. They discuss them and apply the suggested Bayesian estimation to three major stock markets. In “The CSS and the two-staged methods for parameter estimation in SARFIMA models,” E. Egrioglu et al. focus on analysis of seasonal autoregressive fractionally integrated moving average SARFIMA models. Two methods, which are conditional sum of squares CSS and two-staged methods introduced by Hosking 1984, are proposed to estimate the parameters of SARFIMA models. However, no simulation study has been conducted in the literature. Therefore, it is not known how these methods behave under different parameter settings and sample sizes in SARFIMA models. The aim of this study is to show the behavior of these methods by a simulation study. According to results of the simulation, advantages and disadvantages of both methods under different parameter settings and sample sizes are discussed by comparing the root mean square error RMSE obtained by the CSS and twostaged methods. As a result of the comparison, it is seen that CSS method produces better results than those obtained from two-staged method. In “Panel unit root tests by combining dependent P values: a comparative study,” X. Sheng and J. Yang conduct a systematic comparison of the performance of four commonly used P -value combination methods applied to panel unit root tests: the original Fisher test, the modified inverse normal method, Simes’ test, and the modified truncated product method TPM. The simulation results show that under cross-sectional dependence, the original Fisher test is severely oversized, but the other three tests exhibit good size properties. Sims’ test is powerful when the total evidence against the joint null hypothesis is concentrated in one or very few of the tests being combined, but the modified inverse normal method and the modified TPM have good performance when evidence against the joint null is spread among more than a small fraction of the panel units. The differences are further illustrated through one empirical example on testing purchasing power parity using a panel of OECD quarterly real exchange rates. Efthymios G. Tsionas William Greene Kajal Lahiri
Hindawi Publishing Corporation Journal of Probability and Statistics Volume 2011, Article ID 603512, 25 pages doi:10.1155/2011/603512
Review Article Some Recent Developments in Efficiency Measurement in Stochastic Frontier Models Subal C. Kumbhakar1 and Efthymios G. Tsionas2 1 2
Department of Economics, State University of New York, Binghamton, NY 13902, USA Department of Economics, Athens University of Economics and Business, 76 Patission Street, 104 34 Athens, Greece
Correspondence should be addressed to Subal C. Kumbhakar,
[email protected] Received 13 May 2011; Accepted 2 October 2011 Academic Editor: William H. Greene Copyright q 2011 S. C. Kumbhakar and E. G. Tsionas. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This paper addresses some of the recent developments in efficiency measurement using stochastic frontier SF models in some selected areas. The following three issues are discussed in details. First, estimation of SF models with input-oriented technical efficiency. Second, estimation of latent class models to address technological heterogeneity as well as heterogeneity in economic behavior. Finally, estimation of SF models using local maximum likelihood method. Estimation of some of these models in the past was considered to be too difficult. We focus on the advances that have been made in recent years to estimate some of these so-called difficult models. We complement these with some developments in other areas as well.
1. Introduction In this paper we focus on three issues. First, we discuss issues mostly econometric related to input-oriented IO and output-oriented OO measures of technical inefficiency and talk about the estimation of production functions with IO technical inefficiency. We discuss implications of the IO and OO measures from both the primal and dual perspectives. Second, the latent class finite mixing modeling approach is extended to accommodate behavioral heterogeneity. Specifically, we consider profit- revenue- maximizing and cost-minimizing behaviors with technical inefficiency. In our mixing/latent class model, first we consider a system approach in which some producers maximize profit while others simply minimize cost, and then we use a distance function approach, and mix the input and output distance functions in which it is assumed, at least implicitly, that some producers maximize revenue while others minimize cost. In the distance function approach the behavioral assumptions are not explicitly taken into account. The prior probability in favor of profit revenue maximizing behavior is assumed to depend on some exogenous variables. Third, we consider stochastic frontier SF models that are estimated using local maximum likelihood LML
2
Journal of Probability and Statistics
method to address the flexibility issue functional form, heteroskedasticity, and determinants of technical inefficiency.
2. The IO and OO Debate The technology with or without inefficiency can be looked at from either a primal or a dual perspective. In a primal setup two measures of technical efficiency are mostly used in the efficiency literature. These are i input-oriented IO technical inefficiency and ii output oriented OO technical inefficiency.1 There are some basic differences between the IO and OO models so far as features of the technology are concerned. Although some of these differences and their implications are well-known except for Kumbhakar and Tsionas 1, no one has estimated a stochastic production frontier model econometrically with IO technical inefficiency using cross-sectional data.2 Here we consider estimation of a translog production model with IO technical inefficiency.
2.1. The IO and OO Models Consider a single output production technology where Y is a scalar output and X is a vector of inputs. Then the production technology with the IO measure of technical inefficiency can be expressed as Yi fXi · Θi ,
i 1, . . . , n,
2.1
where Yi is a scalar output, Θi ≤ 1 is IO efficiency a scalar, Xi is the J × 1 vector of inputs, and i indexes firms. The IO technical inefficiency for firm i is defined as ln Θi ≤ 0 and is interpreted as the rate at which all the inputs can be reduced without reducing output. On the other hand, the technology with the OO measure of technical inefficiency is specified as Yi fXi · Λi ,
2.2
where Λi ≤ 1 represents OO efficiency a scalar, and ln Λi ≤ 0 is defined as OO technical inefficiency. It shows the percent by which actual output could be increased without increasing inputs for more details, see Figure 1. It is clear from 2.1 and 2.2 that if f· is homogeneous of degree r then Θri Λi , that is, independent of X and Y . If homogeneity is not present their relationship will depend on the input quantities and the parametric form of f·. We now show the IO and OO measures of technical efficiency graphically. The observed production plan Y, X is indicated by the point A. The vertical length AB measures OO technical inefficiency, while the horizontal distance AC measures IO technical inefficiency. Since the former measures percentage loss of output while the latter measures percentage increase in input usage in moving to the production frontier starting from the inefficient production plan indicated by point A, these two measures are, in general, not directly comparable. If the production function is homogeneous, then one measure is a constant multiple of the other, and they are the same if the degree of homogeneity is one. In the more general case, they are related in the following manner: fX · Λ fXΘ. Although we consider technologies with a single output, the IO and OO inefficiency can be discussed in the context of multiple output technologies as well.
Journal of Probability and Statistics
3 Y = f (X) B
Y
OO C
IO
A
0 X
Figure 1: IO and OO technical inefficiency.
2.2. Economic Implications of the IO and OO Models Here we ask two questions. First, does it matter whether one uses the IO or the OO representation so far as estimation of the technology is concerned? That means, whether features of the estimated technology such as elasticities, returns to scale, and so forth, are invariant to the choice of efficiency orientation. Second, are efficiency rankings of firms invariant to the choice of efficiency orientation? That is, does one get the same efficiency measures converted in terms of either output loss or increase in costs in both cases? It is not possible to provide general theoretical answers to these questions. These are clearly empirical issues so it is necessary to engage in applied research to get a feel for the similarities and differences of the two approaches. Answers to these questions depend on the form of the production technology. If it is homogeneous, then there is no difference between these two models econometrically. This is because for a homogeneous function r ln Θi ln Λi , where r is the degree of homogeneity. Thus, rankings of firms with respect to ln Λi and ln Θi will be exactly the same one being a constant multiple of the other. Moreover, since fX · Λ fXΘr , the input elasticities as well as returns to scale measures based on these two specifications of the technology will be the same.3 This is, however, not the case if the technology is nonhomogenous. In the OO model the elasticities and returns to scale will be independent of the technical inefficiency because technical efficiency i.e., assumed to be independent of inputs enters multiplicatively into the production function. This is not true for the IO model, where technical inefficiency enters multiplicatively with the inputs. This will be shown explicitly later for a nonhomogeneous translog production function.
2.3. Econometric Modeling and Efficiency Measurement Using the lower case letters to indicate the log of a variable, and assuming that f· has a translog form the IO model can be expressed as 1 1 yi β0 xi − θi 1J β xi − θi 1J Γ xi − θi 1J βT Ti βT T Ti2 2 2 Ti xi − θi 1J ϕ vi , i 1, . . . , n,
2.3
4
Journal of Probability and Statistics
where yi is the log of output, 1J denotes the J × 1 vector of ones, xi is the J × 1 vector of inputs in log terms, Ti is the trend/shift variable, β0 , βT and βT T are scalar parameters, β, ϕ are J × 1 parameter vectors, Γ is a J × J symmetric matrix containing parameters, and vi is the noise term. To make θ nonnegative we defined it as − ln Θ θ. We rewrite the IO model above as 1 1 yi β0 xi β xi Γxi βT Ti βT T Ti2 xi ϕTi − gθi , xi vi , i 1, . . . , n, 2.4 2 2 where gθi , xi −1/2θi2 Ψ − θΞi , Ψ 1j Γ1J , and Ξi 1j β Γxi ϕTi , i 1, . . . , n. Note that if the production function is homogeneous of degree r, then Γ1J 0, 1J β r, and 1J ϕ 0. In such a case the gθi , xi function becomes a constant multiple of θ, namely, 1/2θi2 Ψ − θΞi −rθi , and consequently, the IO model cannot be distinguished from the OO model. The gθi , xi function shows the percent by which output is lost due to technical inefficiency. For a well-behaved production function gθi , xi ≥ 0 for each i. The OO model, on the other hand, takes a much simpler form, namely, 1 1 2.5 yi β0 xi β xi Γxi βT Ti βT T Ti2 xi ϕTi − λi vi , i 1, . . . , n, 2 2 where we defined − ln Λ λ to make it nonnegative.4 The OO model in this form is the one introduced by Aigner et al. 2 and Meeusen and van den Broeck 3, and since then it has been used extensively in the efficiency literature. Here we follow the framework used in Kumbhakar and Tsionas 1 when θ is random.5 We write 2.4 more compactly as 1 yi zi α θi2 Ψ − θΞi vi , 2
i 1, . . . , n.
2.6
Both Ψ and Ξi are functions of the original parameters, and Ξi also depends on the data xi and Ti . Under the assumption that vi ∼ N0, σ 2 and θi is distributed independently of vi with the density function fθi ; ω, where ω is a parameter, the probability density function of yi can be expressed as
f yi ; μ 2πσ 2
−1/2 ∞ 0
⎡ 2 ⎤ yi − zi α − 1/2θi2 Ψ θi Ξi ⎦fθi ; ωdθi , exp⎣− 2σ 2
i 1, . . . , n, 2.7
where μ denotes the entire parameter vector. We consider a half-normal and an exponential specification for the density fθi ; ω, namely, −1/2
θi2 πω2 exp − 2 , θi ≥ 0, fθi ; ω 2 2ω 2.8 fθi ; ω ω exp−ωθi ,
θi ≥ 0.
Journal of Probability and Statistics
5
The likelihood function of the model is then n l μ; y, X f yi ; μ ,
2.9
i1
where fyi ; μ has been defined above. Since the integral defining fyi ; μ is not available in closed form we cannot find an analytical expression for the likelihood function. However, we can approximate the integrals using a simulation as follows. Suppose θi,s , s 1, . . . , S is a random sample from fθi ; ω. Then it is clear that ⎡ 2 ⎤ 2 S yi − zi α − 1/2θi,s Ψ θi,s Ξi ⎥ ⎢ f yi ; μ ≈ f yi ; μ ≡ S−1 exp⎣− ⎦, 2 2σ s1
2.10
and an approximation of the log-likelihood function is given by log l ≈
n
log f yi ; μ ,
2.11
i1
which can be maximized by numerical optimization procedures to obtain the ML estimator. For the distributions we adopted, random number generation is trivial, so implementing the SML estimator is straightforward.6 Inefficiency estimation is accomplished by considering the distribution of θi conditional on the data and estimated parameters ⎡ ⎢ , Di ∝ exp⎣− f θi | μ
Ξ i θi yi − zi α − 1/2θi2 Ψ
2 ⎤
2 σ2
⎥ ⎦fθi ; ω,
i 1, . . . , n,
2.12
where a tilde denotes the ML estimate, and Di xi , Ti denotes the data. For example, when is half-normal we get fθi ; ω ⎡
⎢ f θi | μ , y, X ∝ exp⎣−
θi Ξi yi − zi α − 1/2θi2 Ψ 2 σ2
2
⎤ θi2 ⎥ − ⎦, 2ω 2
θi ≥ 0, i 1, . . . , n. 2.13
This is not a known density, and even the normalizing constant cannot be obtained in closed form. However, the first two moments and the normalizing constant can be obtained by numerical integration, for example, using Simpson’s rule. To make inferences on efficiency, define efficiency as ri exp−θi and obtain the distribution of ri and its moments by changing the variable from θi to ri . This yields , Di ri−1 f − ln ri | μ , y, X , 0 < ri ≤ 1, i 1, . . . , n. 2.14 fr ri | μ
6
Journal of Probability and Statistics
The likelihood function for the OO model is given in Aigner et al. 2 hereafter ALS.7 The Maximum likelihood method for estimating the parameters of the production function in the OO models are straightforward and have been used extensively in the literature starting from ALS.8 Once the parameters are estimated, technical inefficiency λ is estimated from Eλ | v − λ—the Jondrow et al. 4 formula. Alternatively, one can estimate technical efficiency from Ee−λ | v − λ using the Battese and Coelli 5 formula. For an application of this approach see Kumbhakar and Tsionas 1.
2.4. Looking Through the Dual Cost Functions 2.4.1. The IO Approach We now examine the IO and OO models when behavioral assumptions are explicitly introduced. First, we examine the models when producers minimize cost to produce the given level of outputs. The objective of a producer is to Min.
w X
subject to Y fX · Θ
2.15
from which conditional input demand functions can be derived. The corresponding cost function can then be expressed as w X Ca
Cw, Y , Θ
2.16
where Cw, Y is the minimum cost function cost frontier and Ca is the actual cost. Finally, one can use Shephard’s lemma to obtain Xja Xj∗ w, Y /Θ ≥ Xj∗ w, Y for all j, where the superscripts a and ∗ indicate actual and cost-minimizing levels of input Xj . Thus, the IO model implies i a neutral shift in the cost function which in turn implies that RTS and input elasticities are unchanged due to technical inefficiency, ii an equiproportional increase at the rate given by θ in the use of all inputs due to technical inefficiency, irrespective of the output level and input prices. To summarize, result i is just the opposite of what we obtained in the primal case see 6. Result ii states that when inefficiency is reduced firms will move horizontally to the frontier as expected by the IO model.
2.4.2. The OO Model Here the objective function is written as
Min.
w X
subject to Y fX · Λ
2.17
Journal of Probability and Statistics
7
from which conditional input demand functions can be derived. The corresponding cost function can then be expressed as Y ≡ Cw, Y · qw, Y, Λ, w X Ca C w, Λ
2.18
where as, before, Cw, Y is the minimum cost function cost frontier and Ca is the actual cost. Finally, q· Cw, Y/Λ/Cw, Y ≥ 1. One can then use Shephard’s lemma to obtain ∂q· Cw, Y a ∗ ≥ Xj∗ w, Y ∀j, Xj Xj w, Y q· 2.19 Xj∗ ∂wj where the last inequality will hold if the cost function is well behaved. Note that Xj∗ w, Y for all j unless q· is a constant. Xja / Thus, the results from the OO model are just the opposite from those of the IO model. Here i inefficiency shifts the cost function nonneutrally meaning that q· depends on output and input prices as well as Λ; ii increases in input use are not equiproportional depends on output and input prices; iii the cost shares are not independent of technical inefficiency, iv the model is harder to estimate similar to the IO model in the primal case.9 More importantly, the result in i is just the opposite of what we reported in the primal case. Result ii is not what the OO model predicts increase in output when inefficiency is eliminated. Since output is exogenously given in a cost-minimizing framework, input use has to be reduced when inefficiency is eliminated. The results from the dual cost function models are just the opposite of what the primal models predict. Since the estimated technologies using cost functions are different in the IO and OO models, as in the primal case, we do not repeat the results based on the production/distance functions results here.
2.5. Looking Through the Dual Profit Functions 2.5.1. The IO Model Here we assume that the objective of a producer is to Max. subject to
π p·Y −wX ≡p·Y −
w Θ
X · Θ,
2.20
Y fX · Θ,
from which unconditional input demand and supply functions can be derived. Since the above problem reduces to a standard neoclassical profit-maximizing problem when X is replaced by X · Θ, and w is replaced by w/Θ, the corresponding profit function can be expressed as w w , p ≡ π w, p · h w, p, Θ ≤ π w, p , 2.21 X·Θ π πa p · Y − Θ Θ where π a is actual profit, πw, p is the profit frontier homogeneous of degree one in w and p and hw, p, Θ πw/Θ, p/πw, p ≤ 1 is profit inefficiency. Note that the hw, p, Θ
8
Journal of Probability and Statistics
function depends on w, p, and Θ in general. Application of Hotelling’s lemma yields the following expressions for the output supply and input demand functions:
π w, p ∂h· ≤ Y ∗ w, p , Y Y w, p h· ∗ Y ∂p
π w, p ∂h· a ∗ ∀j, ≤ Xj∗ w, p Xj Xj w, p h· − Xj∗ ∂wj ∗
a
2.22
where the superscripts a and ∗ indicate actual and optimum levels of output Y and inputs Xj . The last inequality in the above equations will hold if the underlying production technology is well behaved.
2.5.2. The OO Model Here the objective function can be written as Max.
π p · Y − w X ≡ p · Λ ·
Y − w X · Θ, Λ
2.23
subject to Y fX · Λ, which can be viewed as a standard neoclassical profit-maximizing problem when Y is replaced by Y/Λ and p is replaced by p·Λ, the corresponding profit function can be expressed as πa
p·Λ·Y − w X π w, p · Λ ≡ π w, p · g w, p, Λ ≤ π w, p , Λ
2.24
where gw, p, Λ πw, p · Λ/πw, p ≤ 1. Similar to the IO model using Hotelling’s lemma, we get
π w, p ∂g· ≤ Y ∗ w, p , Y Y w, p g· Y∗ ∂p
π w, p ∂g· a ∗ ≤ Xj∗ w, p ∀j. Xj Xj w, p g· − Xj∗ ∂wj a
∗
2.25
The last inequality in the above equations will hold if the underlying production technology is well behaved. To summarize i a shift in the profit functions for both the IO and OO models is non-neutral. Therefore, estimated elasticities, RTS, and so on, are affected by the presence of technical inefficiency, no matter what form is used. ii Technical inefficiency leads to a decrease in the production of output and decreases in input use in both models, however, prediction of the reduction in input use and production of output are not the same under both models. Even under profit maximization that recognizes endogeneity of both inputs and outputs, it matters which model is used to represent the technology!! These results are
Journal of Probability and Statistics
9
different from those obtained under the primal models and from the cost minimization framework. Thus, it matters both theoretically and empirically whether one uses an inputor output-oriented measure of technical inefficiency.
3. Latent Class Models 3.1. Modeling Technological Heterogeneity In modeling production technology we almost always assume that all the producers use the same technology. In other words, we do not allow the possibility that there might be more than one technology being used by the producers in the sample. Furthermore, the analyst may not know who is using what technology. Recently, a few studies have combined the stochastic frontier approach with the latent class structure in order to estimate a mixture of several technologies frontier functions. Greene 7, 8 proposes a maximum likelihood for a latent class stochastic frontier with more than two classes. Caudill 9 introduces an expectationmaximization EM algorithm to estimate a mixture of two stochastic cost frontiers with two classes.10 Orea and Kumbhakar 10 estimated a four-class stochastic frontier cost function translog with time-varying technical inefficiency. Following the notations of Greene 7, 8 we specify the technology for class j as ln yi ln f xi , zi , βj |j vi |j − ui |j ,
3.1
where ui |j is a nonnegative random term added to the production function to accommodate technical inefficiency. We assume that the noise term for class j follows a normal distribution with mean zero 2 . The inefficiency term ut |j is modeled as a half-normal random and constant variance, σvj variable following standard practice in the frontier literature, namely, ui |j ≥ 0. ui |j ∼ N 0, ωj2 3.2 That is, a half-normal distribution with scale parameter ωj for each class. With these distributional assumptions, the likelihood for firm i, if it belongs to class j, can be written as 11 λj ε i | j ε i|j 2 Φ − , l i|j φ σj σj σj
3.3
where σj2 ωj2 σj2 , λj ωj /σvj and εi | j ln yi − ln fxi , zi , βj |j . Finally, φ· and Φ· are the pdf and cdf of a standard normal variable. The unconditional likelihood for firm i is obtained as the weighted sum of their j-class likelihood functions, where the weights are the prior probabilities of class membership. That is, li
J l i | j · πij , j1
0 ≤ πij ≤ 1,
j
πijt 1,
3.4
10
Journal of Probability and Statistics
where the class probabilities can be parameterized by, for example, a logistic function. Finally, the log likelihood function is ⎧ ⎫ J n n ⎨ ⎬ 3.5 ln li ln l i | j · πij . ln L ⎩ j1 ⎭ i1 i1 The estimated parameters can be used to compute the conditional posterior class probabilities. Using Bayes’ theorem see Greene 7, 8 and Orea and Kumbhakar 10 the posterior class probabilities can be obtained from l ij · πij P ji J . j1 l ij · πij
3.6
This expression shows that the posterior class probabilities depend not only on the estimated parameters in πij , but also on parameters of the production frontier and the data. This means that a latent class model classifies the sample into several groups even when the πij are fixed parameters independent of i. In the standard stochastic frontier approach where the frontier function is the same for every firm, we estimate inefficiency relative to the frontier for all observations, namely, inefficiency from Eui | εi and efficiency from Eexp−ui | εi . In the present case, we estimate as many frontiers as the number of classes. So the question is how to measure the efficiency level of an individual firm when there is no unique technology against which inefficiency is to be computed. This is solved by using the following method, ln EFi
J P j | i · ln EFi j ,
3.7
j1
where P j | i is the posterior probability to be in the jth class for a given firm i defined in 3.9, and EFi j is its efficiency using the technology of class j as the reference technology. Note that here we do not have a single reference technology. It takes into account technologies from every class. The efficiency results obtained by using 3.10 would be different from those based on the most likely frontier and using it as the reference technology. The magnitude of the difference depends on the relative importance of the posterior probability of the most likely cost frontier, the higher the posterior probability the smaller the differences. For an application see Orea and Kumbhakar 10.
3.2. Modeling Directional Heterogeneity In Section 2.3 we talked about estimating IO technical inefficiency. In practice most researchers use the OO model because it is easy to estimate. Now we address the question of choosing one over the other. Orea et al. 12 used a model selection test procedure to determine whether the data support the IO, OO, or the hyperbolic model. Based on such a test result, one may decide to use the direction that fits the data best. This implictly assumes that all producers in the sample behave in the same way. In reality, firms in a particular industry, although using the same technology, may choose different direction to move to the frontier. For example, some producers might find it costly to adjust input levels to attain
Journal of Probability and Statistics
11
the production frontier, while for others it might be easier to do so. This means that some producers will choose to shrink their inputs while others will augment the output level. In such a case imposing one direction for all sample observations is not efficient. The other practical problem is that no one knows in advance, which producers are following what direction. Thus, we cannot estimate the IO model for one group and the OO model for another. The advantage of the LCM is that it is not necessary to impose a priori criterion to identify which producers are in what class. Moreover, we can formally examine whether some exogenous factors are responsible for choosing the input or the output direction by making the probabilities function of exogenous variables. Furthermore, when panel data is available, we do not need to assume that producers follow one direction for all the time, so we can accommodate switching behaviour and determine when they go in the input output direction.
3.2.1. The Input-Oriented Model Under the assumption that vi ∼ N0, σ 2 , and θi is distributed independently of vi , according to a distribution with density fθ θi ; ω, where ω is a parameter, the distribution of yi has density ⎡ 2 ⎤ 2 −1/2 ∞ − z α − Ψ θ Ξ y 1/2θ i i i i i ⎦fθ θi ; ωdθi , exp⎣− fIO yi | zi , Δ 2πσ 2 2σ 2 3.8 0 i 1, . . . , n, where Δ denotes the entire parameter vector. We use a half-normal specification for θ, namely, −1/2
θi2 πω2 exp − 2 , θi ≥ 0. 3.9 fθ θi ; ω 2 2ω The likelihood function of the IO model is n LIO Δ; y, X fIO yi | zi , Δ ,
3.10
i1
where fIO yi | zi , Δ has been defined in 3.8. Since the integral defining fIO yi | zi , μ in 3.11 is not available in closed form, we cannot find an analytical expression for the likelihood function. However, we can approximate the integrals using Monte Carlo simulation as follows. Suppose θi,s , s 1, . . . , S is a random sample from fθ θi ; ω. Then it is clear that fIO yi | zi , μ ≈ fIO yi | zi , μ
−1/2 πω2 −1/2 2 ≡ 2πσ 2 ⎤ ⎡ 2 2 2 S yi − zi α − 1/2θi,s Ψ θi,s Ξi θ i,s ⎥ ⎢ , × S−1 exp⎣− − 2 2⎦ 2σ 2ω s1
3.11
12
Journal of Probability and Statistics
and an approximation of the log-likelihood function is given by log lIO ≈
n
log fIO yi | zi , μ ,
3.12
i1
which can be maximized by numerical optimization procedures to obtain the ML estimator. To perform SML estimation, we consider the integral in 3.11. We can transform the range of integration to 0, 1 by using the transformation ri exp−θi which has a natural interpretation as IO technical efficiency. Then, 3.11 becomes
fIO yi | zi , μ 2πσ 2
−1/2 1 0
⎡ ⎢ exp⎣−
yi − zi α − 1/2ln ri 2 Ψ − ln ri Ξi 2σ 2
2 ⎤ ⎥ ⎦
3.13
× fθ − ln ri ; ωri−1 dri . Suppose ri,s is a set of standard uniform random numbers, for s 1, . . . , S. Then the integral can be approximated using the Monte Carlo estimator −1/2 fIO yi | zi , μ 2πσ 2
πω2 2
−1/2
Gi μ ,
3.14
where Gi μ S−1 Ss1 exp−yi − zi α − 1/2ln ri,s 2 Ψ − ln ri,s Ξi 2 /2σ 2 − ln ri,s 2 /2ω2 − ln ri,s . The standard uniform random numbers and their log transformation can be saved in an n × S matrix before maximum likelihood estimation and reused to ensure that the likelihood function is a differentiable function of the parameters. An alternative is to maintain the same random number seed and redraw these numbers for each call to the likelihood function. This option increases computing time but implies considerable savings in terms of memory. An alternative to the use of pseudorandom numbers is to use the Halton sequence to produce quasi-random numbers that fill the interval 0, 1. The Halton sequence has been used in econometrics by Train 13 for the multinomial probit model, and Greene 14 to implement SML estimation of the normal-gamma stochastic frontier model.
3.2.2. The Output-Oriented Model Estimation of the OO is easy since the likelihood function is available analytically. The model is yi zi α vi − λi ,
i 1, . . . , n.
3.15
We make the standard assumptions that vi ∼ N0, σv2 , λi ∼ N 0, σλ2 , and both are mutually independent as well as independent of zi . The density of yi is 11, page 75 fOO
2 ei ei τ yi | zi , μ ϕN ΦN − , ρ ρ ρ
3.16
Journal of Probability and Statistics
13
where ei yi − zi α, ρ2 σv2 σλ2 , τ σλ /σv , and φN and ΦN denote the standard normal pdf and cdf, respectively. The log likelihood function of the model is n n n 1 ei τ 2 2 2 − ln 2πρ − 2 ei . ln ΦN − ln lOO μ; y, Z n ln ρ 2 ρ 2ρ i1 i1
3.17
3.3. The Finite Mixture (Latent Class) Model The IO and OO models can be embedded in a general model that allows model choice for each observation in the absence of sample separation information. Specifically, we assume that each observation yi is associated with the OO class with probability p, and with the IO class with probability 1 − p. To be more precise, we have the model yi zi α − λi vi ,
i 1, . . . , n,
3.18
with probability p, and the model 1 yi zi α θi2 Ψ − θi Ξi vi , 2
i 1, . . . , n,
3.19
with probability 1 − p, where the stochastic elements obey the assumptions that we stated previously in connection with the OO and IO models. Notice that the technical parameters, α, are the same in the two classes. Denote the parameter vector by ψ α , σ 2 , ω2 , σv2 , σλ2 , p . The density of yi will be fLCM yi | zi , ψ p · fOO yi | zi , O 1 − p fIO yi | zi , Δ ,
i 1, . . . , n,
3.20
where O α , σ 2 , ω2 , and Δ α , σv2 , σλ2 are subsets of ψ. The log likelihood function of the model is n n ln fLCM yi | zi , ψ ln p · fOO yi | zi , O 1 − p fIO yi | zi , Δ . log lLCM ψ; y, Z i1
i1
3.21
The log likelihood function depends on the IO density fIO yi | zi , Δ, which is not available in closed form but can be obtained with the aid of simulation using the principles presented previously to obtain n n ! " ln fLCM yi | zi , ψ ln pfOO yi | zi , O 1 − p fIO yi | zi , Δ , log lLCM ψ; y, Z i1
i1
3.22
where fIO yi | zi , Δ has been defined in 3.14 and fOO yi | zi , O in 3.16. This log likelihood function can be maximized using standard techniques to obtain the SML estimates of the LCM.
14
Journal of Probability and Statistics
3.3.1. Technical Efficiency Estimation in the Latent Class Model A natural output-based efficiency measure derived from the LCM is TELCM Pi TEOO 1 − Pi TEIO i i i , where
pfOO yi | zi , O Pi , 1 − p fIO yi | zi , Δ pfOO yi | zi , O
3.23
i 1, . . . , ·n,
3.24
is the posterior probability that the ith observation came from the OO class. These posterior probabilities are of independent interest since they can be used to provide inferences on whether a firm came from the OO or IO universe, depending on whether, for example, Pi ≥ 1/2 or Pi < 1/2. This information can be important in deciding which type of adjustment cost input- or output-related is more important for a particular firm. From the IO component of the LCM we have the IO-related efficiency measure, say , and its standard deviation, say STELCM , that can be compared with IOTEi and STEi IOTELCM i i from the IO model. Similarly we can compare TEOO with the output efficiency of the IO model i and/or the output efficiency of the OO component of the LCM. TEIO i
3.4. Returns to Scale and Technical Change Note that returns to scale defined as RTS j ∂y/∂xj is not affected by the presence of technical inefficiency in the OO model. The same is true for input elasticities and elasticities of substitution that are not explored here. This is because inefficiency in the OO model shifts the production function in a neutral fashion. On the contrary, the magnitude of technical inefficiency affects RTS in the IO models. Using the translog specification in 2.4, we get RTSIO i 1J β Γxi ϕTi − θi 1J Γ,
3.25
whereas the formula for RTS in the OO model is 1J β Γxi ϕTi . RTSOO i
3.26
We now focus on estimates of technical change from the IO and OO models. Again TC in the IO model can be measured conditional on θ TCIO I and TC defined at the frontier TCIO II, namely, TCIO I βT βT T Ti xi ϕ 1J ϕ θi , TCIO II βT βT T Ti xi ϕ.
3.27
These two formulas will give different results if technical change is neutral and/or the production function is homogeneous i.e., 1J ϕ / 0. The formula for TCOO is the same as
Journal of Probability and Statistics
15
TCIO II except for the fact that the estimated parameters in 3.3 are from the IO model, whereas the parameters to compute TCOO are from the OO model. It should be noted that in the LCM we enforce the restriction that the technical parameters, α, are the same in the IO and OO components of the mixture. This implies that RTS and TC will be the same in both components if we follow the first approach, but they will be different if we follow the second approach. In the second approach, a single measure of RTS and TC can be defined as the weighted average of both measures using II is the the posterior probabilities, Pi , as weights. To be more precise, suppose RTSIO,LCM i OO,LCM type II RTS measure derived from the IO component of the LCM, and RTSi is the RTS measure derived from the OO component of the LCM. The overall LCM measure of RTS will Pi RTSOO,LCM 1 − Pi RTSIO,LCM . Similar methodology is followed for the TC be RTSLCM i i i measure.
4. Relaxing Functional form Assumptions (SF Model with LML) In this section we introduce the LML methodology 15 in estimating SF models in such a way that many of the limitations of the SF models originally proposed by Aigner et al. 2, Meeusen and van den Broeck 3, and their extensions in the last two and a half decades are relaxed. Removal of all these deficiencies generalizes the SF models and makes them comparable to the DEA models. Moreover, we can apply standard econometric tools to perform estimation and draw inferences. To fix ideas, suppose we have a parametric model that specifies the density of an observed dependent variable yi conditional on a vector of observable covariates xi ∈ X ⊆ Rk , a vector of unknown parameters θ ∈ Θ ⊆ Rm , and let the density be lyi ; xi , θ. The parametric ML estimator is given by n ln l yi ; xi , θ . 4.1 θ arg max: θ∈Θ
i1
The problem with the parametric ML estimator is that it relies heavily on the parametric model that can be incorrect if there is uncertainty regarding the functional form of the model, the density, and so forth. A natural way to convert the parametric model to a nonparametric one is to make the parameter θ a function of the covariates xi . Within LML this is accomplished as follows. For an arbitrary x ∈ X, the LML estimator solves the problem θx arg max: θ∈Θ
n
ln l yi ; xi , θ KH xi − x,
4.2
i1
where KH is a kernel that depends on a matrix bandwidth H. The idea behind LML is to choose an anchoring parametric model and maximize a weighted log-likelihood function that places more weight to observations near x rather than weight each observation equally, as the parametric ML estimator would do.11 By solving the LML problem for several points x ∈ X, we can construct the function θx that is an estimator for θx, and effectively we have a fully general way to convert the parametric model to a nonparametric approximation to the unknown model. Suppose we have the following stochastic frontier cost model: yi xi β vi ui , vi ∼ N 0, σ 2 , ui ∼ N μ, ω2 , ui ≥ 0, for i 1, . . . , n, β ∈ Rk , 4.3
16
Journal of Probability and Statistics
where y is log cost and xi is a vector of input prices and outputs12 ; vi and ui are the noise and inefficiency components, respectively. Furthermore, vi and ui are assumed to be mutually independent as well as independent of xi . To make the frontier model more flexible nonparametric, we adopt the following strategy. Consider the usual parametric ML estimator for the normal v and truncated normal u stochastic cost frontier model that solves the following problem 16: θ arg max: θ∈Θ
n
ln l yi ; xi , θ ,
4.4
i1
where
l yi ; xi , θ Φ ψ
2 "−1/2 σ 2 ψ ω yi − xi β ! yi − x i β − μ 2 2 2πω σ , Φ exp − 2ω2 σ 2 σω2 σ 2 1/2 4.5
−1
ψ μ/ω, and Φ denotes the standard normal cumulative distribution function. The parameter vector is θ β, σ, ω, ψ and the parameter space is Θ Rk × R × R × R. Local ML estimation of the corresponding nonparametric model involves the following steps. First, we choose a kernel function. A reasonable choice is 1 4.6 KH d 2π−m/2 |H|−1/2 exp − d H −1 d , d ∈ Rm , 2 where m is the dimensionality of θ, H h · S, h > 0 is a scalar bandwidth, and S is the sample covariance matrix of xi . Second, we choose a particular point x ∈ X, and solve the following problem: 2 n σ ψ ω yi − xi β − ln Φ ψ ln Φ θx arg max: σω2 σ 2 1/2 θ∈Θ i1 4.7 1 yi − x β − μ2 1 2 i − ln ω σ 2 − KH xi − x. 2 2 ω2 σ 2 A solution to this problem provides the LML parameter estimates βx, σ x, ωx, and ψx. Also notice that the weights KH xi − x do not involve unknown parameters if h is known so they can be computed in advance and, therefore, the estimator can be programmed in any standard econometric software.13 For an application of this methodology to US commercial banks see Kumbhakar and Tsionas 17, 18 and Kumbhakar et al. 15.
5. Some Advances in Stochastic Frontier Analysis 5.1. General There are many innovative empirical applications of stochastic frontier analysis in recent years. One of them is in the field of auctions, a particular field of game theory. Advances
Journal of Probability and Statistics
17
and empirical applications in this field are likely to accumulate rapidly and contribute positively to the advancement of empirical Game theory and empirical IO. Kumbhakar et al. 19 propose Bayesian analysis of an auction model where systematic over-bidding and under-bidding is allowed. Extensive simulations are used to show that the new techniques perform well and ignoring measurement error or systematic over-bidding and under-bidding is important in the final results. Kumbhakar and Parmeter 20 derive the closed-form likelihood and associated efficiency measures for a two-sided stochastic frontier model under the assumption of normal-exponential components. The model receives an important application in the labor market where employees and employers have asymmetric information, and each one tries to manipulate the situation to his own advantage, Employers would like to hire for less and employees to obtain more in the bargaining process. The precise measurement of these components is, apparently, important. Kumbhakar et al. 21 acknowledge explicitly the fact that certain decision making units can be fully i.e., 100% efficient, and propose a new model which is a mixture of i a half-normal component for inefficient firms and ii a mass at zero for efficient firms. Of course, it is not known in advance which firms are fully efficient or not. The authors propose classical methods of inference organized around maximum likelihood and provide extensive simulations to explore the validity and relevance of the new techniques under various data generating processes. Tsionas 22 explores the implications of the convolution ε v u in stochastic frontier models. The fundamental point is that even when the distributions of the error components are nonstandard e.g., Student-t and half-Student or normal and half-Student, gamma, symmetric stable, etc. it is possible to estimate the model by ML estimation via the fast Fourier transform FFT when the characteristic functions are available in closed form. These methods can also be used in mixture models, input-oriented efficiency models, twotiered stochastic frontiers, and so forth. The properties of ML and some GLS techniques are explored with an emphasis on the normal-truncated normal model for which the likelihood is available analytically and simulations are used to determine various quantities that must be set in order to apply ML by FFT. Starting with Annaert et al. 23, stochastic frontier models have been applied very successfully in finance, especially the important issue of mutual funds performance. Schaefer and Maurer 24 apply these techniques to German funds to find that they “may be able to reduce its costs by 46 to 74% when compared with the best-practice complex in the sample.” Of course, much remains to be done in this area and connect more closely stochastic frontier models with practical finance and better mutual fund performance evaluation.
5.2. Panel Data Panel data have always been a source of inspiration and new models in stochastic frontier analysis. Roughly speaking, panel data are concerned with models of the form yit αi xit β vit , where the αi ’s are individual effects, random or fixed, xit is a k × 1 vector of covariates, β is a k × 1 parameter vector and, typically, the error term vit ∼ iidN0, σv2 . An important contribution in panel data models of efficiency is the incorporation of factors, as in Kneip et al. 25. Factors arise from the necessity of incorporating more structure into frontier models, a point that is clear after Lee and Schmidt 26. The authors use smoothing techniques to perform the econometric analysis of the model.
18
Journal of Probability and Statistics
In recent years, the focus of the profession has shifted from the fixed effects model e.g., Cornwell et al. 27 to a so-called “true fixed effects model” TFEM first proposed by Greene 28. Greene’s model is yit αi xit β vit ± uit , where uit ∼ iidN 0, σv2 . In this model, the individual effects are separated from technical inefficiency. Similar models have been proposed previously by Kumbhakar 29 and Kumbhakar and Hjalmarsson 30, although in these models firm-effects were treated as persistent inefficiency. Greene shows that the TFEM can be estimated easily using special Gauss-Newton iterations without the need to explicitly introduce individual dummy variables, which is prohibitive if the number of firms N is large. As Greene 28 notes: “the fixed and random effects estimators force any time invariant cross unit heterogeneity into the same term that is being used to capture the inefficiency. Inefficiency measures in these models may be picking up heterogeneity in addition to or even instead of inefficiency.” For important points and applications see Greene 8, 31. Greene’s 8 findings are somewhat at odds with the perceived incidental parameters problem in this model, as he himself acknowledges. His findings motivated a body of research that tries to deal with the incidental parameters problem in stochastic frontier models and, of course, efficiency estimation. The incidental parameters problem in statistics began with the well-known contribution of Neyman and Scott 32 see also 33. In stochastic frontier models of the form: yit αi xit β vit , for i 1, . . . , N and t 1, . . . , T . The essence of the problem is that as N gets large, the number of unknown parameters the individual effects αi , i 1, . . . , N increase at the same rate so consistency cannot be achieved. Another route to the incidental parameters problem is well-known in the efficiency estimation with cross-sectional data T 1, where JLMS estimates are not consistent. To appreciate better the incidental parameters problem, the TFEM implies a density for the ith unit, say pi αi , δ; Yi ≡ pi αi , β, σ, ω; Yi ,
where Yi yi , xi
is the data.
5.1
The problem is that the ML estimator max :
α1 ,...,αn ,δ
n i1
pi αi , δ; Yi max : δ
n
pi # αi , δ; Yi
5.2
i1
is not consistent. The source of the problem is that the concentrated likelihood using α # i the ML estimator will not deliver consistent estimators for all elements of δ. In frontier models we know that ML estimators for β and σ 2 ω2 seem to be alright but the estimator for ω or the ratio λ ω/σ can be wrong. This is also validated in a recent paper by Chen et al. 34. There are some approaches to correct such biases in the literature on nonlinear panel data models. i Correct the bias to first order using a modified score first derivatives of log likelihood. ii Use a penalty function for the log likelihood. This can of course be related to 2 above. iii Apply panel jackknife. Satchachai and Schmidt 35 has done that recently in a model with fixed effects but without one-sided component. He derives some interesting results regarding convergence depending on whether we have ties or
Journal of Probability and Statistics
19
not. First differencing produces OT −1 but with ties we have OT −1/2 for the estimator applied when you have a tie. ∗ iv In $ line with ii one could use a modified likelihood of the form pi δ; Yi pi αi , δ; Yi wαi dαi , where wαi is some weighting function for which it is clear that there is a Bayesian interpretation. Since Greene 8 derived a computationally efficient algorithm for the true fixed effects model, one would think that application of panel jackknife would reduce the first order bias of the estimator and for empirical purposes this might be enough. For further reductions in the bias there remains only the possibility of asymptotic expansions along the lines of related work in nonlinear panel data models. This point has not been explored in the literature but it seems that it can be used profitably. Wang and Ho 36 show that “first-difference and within-transformation can be analytically performed on this model to remove the fixed individual effects, and thus the estimator is immune to the incidental parameters problem.” The model is, naturally, less general than a standard stochastic frontier model in that the authors assume uit fzit δui , where ui is a positive half-normal random variable and f is a positive function. In this model, the dynamics of inefficiency are determined entirely by the function f and the covariates that enter into this function. Recently, Chen et al. 34 proposed a new estimator for the model. If the model is yit αi xit β vit − uit , deviation from the mean gives yit xit β vit − u it .
5.3
Given β β#OLS , we have eit ≡ yit − xit β vit − u it , when eit “data.” The distribution of eit it , belongs to the family of the multivariate closed skew-normal CSN, so estimating λ vit − u and σ is easy. Of course, the multivariate CSN depends on evaluating a multivariate normal integral in RT 1 . With T > 5, this is not a trivial problem see 37. There is reason to believe that “average likelihood” or a fully Bayesian approach can perform much better relative to sampling-theory treatments. Indeed, the true fixed effects model is nothing but another instance of the incidental parameters problem. Recent advances suggest that the best treatment can be found in “average” or “integrated” likelihood functions. For work in this direction, see Lancaster 33 and Arellano and Bonhomme 38, Arellano and Hahn 39, 40, Berger et al. 41, and Bester and Hansen 42, 43. The performance of such methods in the context of TFE remains to be seen. Tsionas and Kumbhakar 44 propose a full Bayesian solution to the problem. The approach is obvious in a sense, since the TFEM can be cast as a hierarchical model. The authors show that the obvious parameterization of the model does not perform well in simulated experiments and, therefore, they propose a new parameterization that is shown to effectively eliminate the incidental parameters problem. They also extend the TFEM to models with both individual and time effects. Of course, the TFEM is cast in terms of a random effects model so it is at first sight not directly related to 35.
6. Thoughts on Current State of Efficiency Estimation and Panel Data 6.1. Nature of Individual Effects If we think about the model: yit αi xit β vit uit , uit ≥ 0, as N → ∞ one natural question is: Do we really expect ourselves to be so agnostic about the fixed effects as to allow αn1 to
20
Journal of Probability and Statistics
be completely different from what we already know about α1 , . . . , αn ? This is rarely the case. But we do not adopt the true fixed effects model for that reason. There are other reasons. If we ignore this choice we can adopt a finite mixture of normal distribution for the effects. In principle this can approximate well any distribution of the effects, so with enough latent classes we should be able to approximate the weight function wαi quite well. That would pose some structure in the model, it would avoid the incidental parameters problem if the number of classes grows slowly and at a lower rate than N so for fixed T there should be no significant bias. For really small T a further bias correction device like asymptotic expansions or the jackknife could be used. Since we do not adopt the true fixed effects model for that reason, why do we adopt it? Because the effects and the regressors are potentially correlated in a random effect framework so it is preferable to think of them as parameters. It could be that αi hxi1 , . . . , xiT , or perhaps αi Hxi , when xi T −1 Tt1 xit . Mundlak 45 first wrote about this model. In some cases it makes sense. Consider the alternative model: yit αit xit β vit uit , αit α 1 − ρ ραit−1 xit γ εit .
6.1
Under stationarity we have the same implications with Mundlak’s original model but in many cases it makes much more sense: mutual fund rating and evaluation is one of them. But even if we stay with Mundlak’s original specification, many other possibilities are open. For small T , the most interesting case, approximation of αi hxi1 , . . . , xiT by some flexible functional form should be enough for practical purposes. By “practical purposes” we mean bias reduction to order OT −1 or better. If the model becomes yit αi xit β vit uit hxi1 , . . . , xiT xit β vit uit ,
t 1, . . . , T,
6.2
adaptation of known nonparametric techniques should provide that rate of convergence. Reduction to yit αi xit β vit uit Hxi xit β vit uit ,
t 1, . . . , T,
6.3
would facilitate the analysis considerably without sacrificing the rate. It is quite probable that h or H should be low order polynomials or basis functions with some nice properties e.g., Bernstein polynomials are nice that can overcome the incidental parameters problem.
6.2. Random Coefficient Models Consider yit αi xit βi vit uit ≡ zit γi vit uit 46. Typically we assume that γi ∼ iidNK γ, Ω. For small to moderate panels say T 5 to 10 adaptation of the techniques in the paper by Chen et al. 34 would be quite difficult to implement in the context of fixed effects. The concern is again with evaluation of T 1-dimensional normal integrals, when T is large. Here, again we are subject to the incidental parameters problem—we never really escape the “small sample” situation.
Journal of Probability and Statistics
21
One way to proceed is the so-called CAR conditionally autoregressive prior model of the form: γi | γj ∼ N γj , ϕ2ij
∀i / j,
% % log ϕ2ij δ0 δ1 %wi − wj %,
if γi is scalar the TFEM. 6.4
In the multivariate case we would need something like the BEKK factorization of a covariance matrix as in multivariate GARCH processes. The point is that the coefficients cannot be too dissimilar and their degree of dissimilarity depends on a parameter ϕ2 that can be made a function of covariates, if any. Under different DGPs, it would be interesting to know how the Bayesian estimator of this model behaves in practice.
6.3. More on Individual Effects Related to the above discussion, it is productive to think about sources, that is, where these αi s or γi s come from. Of course we have Mundlak’s 45 interpretation in place. In practice we have different technologies represented by cost functions, say. The standard cost function yit α0 xit β vit with input-oriented technical inefficiency results in yit α0 xit β vit uit , uit ≥ 0. Presence of allocative inefficiency results in a much more complicated model: yit α0 xit β vit uit Git xit , β, ξit where ξit is the vector of price distortions see 47, 48. So under some reasonable economic assumptions and common technology we end up with a nonlinear effects model through the G· function. Of course one can apply the TFEM here but that would not correspond to the true DGP. So the issues of consistency are at stake. It is hard to imagine a situation where in a TFEM, yit αi xit β vit uit the αi s can be anything and they are subjected to no “similarity” constraints. We can, of course, accept that αi ≈ α0 Git xit , β, ξit , so at least for the translog we should have a rough guide on what these effects represent under allocative inefficiency first-order approximations to the complicated G term are available when the ξit s are small. Of course then one has to think about the nature of the allocative distortions but at least that’s an economic problem.
6.4. Why a Bayesian Approach? Chen et al.’s transformation that used the multivariate CSN is one class of transformations, but there are many transformations that are possible because the TFEM does not have the property of information orthogonality 33. The “best” transformation, the one that is “maximally bias reducing,” cannot be taking deviations from the means because the information matrix is not block diagonal with respect to αi , λ ω/σ signal/noise. Other transformations would be more effective and it is not difficult to find them, in principle. Recently Tsionas and Kumbhakar 44 considered a different model, namely, yit αi δi xit β vit uit , where δi ≥ 0 is persistent inefficiency. They have used a Bayesian approach. Colombi et al. 49 used the same model but used classical ML approach to estimate the parameters as well as the inefficiency components. The finite sample properties of the Bayes estimators posterior means and medians in Tsionas and Kumbhakar 44 were found to be very good for small samples with λ values typically encountered in practice of course one needs to keep λ away from zero in the DGP. The moral of the story is that in the random effects model, an integrated likelihood approach based on reasonable priors,
22
Journal of Probability and Statistics
a nonparametric approach based on low-order polynomials or a finite mixture model might provide an acceptable approximation to parameters like λ. Coupled with a panel jack knife device these approaches can be really effective in mitigating the incidental parameters problem. For one, in the context of TFE, we do not know how the Chen et al. 34 estimator would behave under strange DGPs-under strange processes for the incidental parameters that is. We have some evidence from Monte Carlo but we need to think about more general “mitigating strategies.” The integrated likelihood approach is one, and is close to a Bayesian approach. Finite mixtures also hold great promise since they have good approximating properties. The panel jack knife device is certainly something to think about. Also analytical devices for bias reduction to order OT −1 or OT −2 are available from the likelihood function of the TFEM score and information. Their implementation in software should be quite easy.
7. Conclusions In this paper we presented some new techniques to estimate technical inefficiency using stochastic frontier technique. First, we presented a technique to estimate a nonhomogeneous technology using the IO technical inefficiency. We then discussed the IO and OO controversy in the light of distance functions, and the dual cost and profit functions. The second part of the paper addressed the latent class-modeling approach incorporating behavioral heterogeneity. The last part of the paper addressed LML method that can solve the functional form issue in parametric stochastic frontier. Finally, we added a section that deals with some very recent advances.
Endnotes 1. Another measure is hyperbolic technical inefficiency that combines both the IO and OO measures in a special way see, e.g., 50, Cuesta and Zofio 1999, 12. This measure is not as popular as the other two. 2. On the contrary, the OO model has been estimated by many authors using DEA see, e.g., 51 and references cited in there. 3. Alvarez et al. 52 addressed these issues in a panel data framework with time invariant technical inefficiency using a fixed effects models. 4. The above equation gives the IO model when the production function is homogeneous by labeling λi rθi . 5. Alvarez et al. 52 estimated an IO primal model in a panel data model where technical inefficiency is assumed to be fixed and parametric. 6. Greene 14 used SML for the OO normal-gamma model. 7. See also Kumbhakar and Lovell 11, pages 74–82 for the log-likelihood functions under both half-normal and exponential distributions for the OO technical inefficiency term. 8. It is not necessary to use the simulated ML method to estimate the parameters of the frontier models if the technical inefficiency component is distributed as half-normal, truncated normal, or exponential along with the normality assumption on the noise component. For other distributions, for example, gamma for technical inefficiency and normal for the noise component the standard ML method may not be ideal see Greene
Journal of Probability and Statistics
23
14 who used the simulated ML method to estimate OO technical efficiency in the gamma-normal model. 9. Atkinson and Cornwell 53 estimated translog cost functions with both inputand output-oriented technical inefficiency using panel data. They assumed technical inefficiency to be fixed and time invariant. See also Orea et al. 12. 10. See, in addition, Beard et al. 54, 55 for applications using a non-frontier approach. For applications in social sciences, see, Hagenaars and McCutcheon 56. Statistical aspects of the mixing models are dicussed in details in Mclachlan and Peel 57. 11. LML estimation has been proposed by Tibshirani 58 and has been applied by Gozalo and Linton 59 in the context of nonparametric estimation of discrete response models. 12. The cost function specification is discussed in details in Section 5.2. 13. An alternative, that could be relevant in some applications, is to localize based on a vector of exogenous variables zi instead of the xi ’s. In that case, the LML problem becomes n ω yi − xi β 2 − ln Φ ψ ln Φ σ ψ θz arg max: σω2 σ 2 1/2 θ∈Θ i1
2 yi − xi β − μ 2 2 −1/2 ln ω σ − 1/2 KH zi − z, ω2 σ 2 where z are the given values for the vector of exogenous variables. The main feature of this formulation is that the β parameters as well as σ, ω, and ψ will now be functions of z instead of x.
References 1 S. C. Kumbhakar and E. G. Tsionas, “Estimation of stochastic frontier production functions with input-oriented technical efficiency,” Journal of Econometrics, vol. 133, no. 1, pp. 71–96, 2006. 2 D. Aigner, C. A. K. Lovell, and P. Schmidt, “Formulation and estimation of stochastic frontier production function models,” Journal of Econometrics, vol. 6, no. 1, pp. 21–37, 1977. 3 W. Meeusen and J. van den Broeck, “Efficiency estimation from Cobb-Douglas production functions with composed error,” International Economic Review, vol. 18, no. 2, pp. 435–444, 1977. 4 J. Jondrow, C. A. Knox Lovell, I. S. Materov, and P. Schmidt, “On the estimation of technical inefficiency in the stochastic frontier production function model,” Journal of Econometrics, vol. 19, no. 2-3, pp. 233–238, 1982. 5 G. E. Battese and T. J. Coelli, “Prediction of firm-level technical efficiencies with a generalized frontier production function and panel data,” Journal of Econometrics, vol. 38, no. 3, pp. 387–399, 1988. 6 C. Arias and A. Alvarez, “A note on dual estimation of technical efficiency,” in Proceedings of the 1st Oviedo Efficiency Workshop, University of Oviedo, 1998. 7 W. Greene, “Fixed and random effects in nonlinear models,” Working Paper, Department of Economics, Stern School of Business, NYU, 2001. 8 W. Greene, “Reconsidering heterogeneity in panel data estimators of the stochastic frontier model,” Journal of Econometrics, vol. 126, no. 2, pp. 269–303, 2005. 9 S. B. Caudill, “Estimating a mixture of stochastic frontier regression models via the em algorithm: a multiproduct cost function application,” Empirical Economics, vol. 28, no. 3, pp. 581–598, 2003. 10 L. Orea and S. C. Kumbhakar, “Efficiency measurement using a latent class stochastic frontier model,” Empirical Economics, vol. 29, no. 1, pp. 169–183, 2004. 11 S. C. Kumbhakar and C. A. K. Lovell, Stochastic Frontier Analysis, Cambridge University Press, Cambridge, UK, 2000.
24
Journal of Probability and Statistics
12 L. Orea, D. Roib´as, and A. Wall, “Choosing the technical efficiency orientation to analyze firms’ technology: a model selection test approach,” Journal of Productivity Analysis, vol. 22, no. 1-2, pp. 51– 71, 2004. 13 K. E. Train, Discrete Choice Methods with Simulation, Cambridge University Press, Cambridge, UK, 2nd edition, 2009. 14 W. H. Greene, “Simulated likelihood estimation of the normal-gamma stochastic frontier function,” Journal of Productivity Analysis, vol. 19, no. 2-3, pp. 179–190, 2003. 15 S. C. Kumbhakar, B. U. Park, L. Simar, and E. G. Tsionas, “Nonparametric stochastic frontiers: a local maximum likelihood approach,” Journal of Econometrics, vol. 137, no. 1, pp. 1–27, 2007. 16 R. E. Stevenson, “Likelihood functions for generalized stochastic frontier estimation,” Journal of Econometrics, vol. 13, no. 1, pp. 57–66, 1980. 17 S. C. Kumbhakar and E. G. Tsionas, “Scale and efficiency measurement using a semiparametric stochastic frontier model: evidence from the U.S. commercial banks,” Empirical Economics, vol. 34, no. 3, pp. 585–602, 2008. 18 S. C. Kumbhakar and E. Tsionas, “Estimation of cost vs. profit systems with and without technical inefficiency,” Academia Economic Papers, vol. 36, pp. 145–164, 2008. 19 S. Kumbhakar, C. Parmeter, and E. G. Tsionas, “Bayesian estimation approaches to first-price auctions,” to appear in Journal of Econometrics. 20 S. C. Kumbhakar and C. F. Parmeter, “The effects of match uncertainty and bargaining on labor market outcomes: evidence from firm and worker specific estimates,” Journal of Productivity Analysis, vol. 31, no. 1, pp. 1–14, 2009. 21 S. Kumbhakar, C. Parmeter, and E. G. Tsionas, “A zero inflated stochastic frontier model,” unpublished. 22 E. G. Tsionas, “Maximum likelihood estimation of non-standard stochastic frontier models by the Fourier transform,” to appear in Journal of Econometrics. 23 J. Annaert, J. Van den Broeck, and R. V. Vennet, “Determinants of mutual fund underperformance: a Bayesian stochastic frontier approach,” European Journal of Operational Research, vol. 151, no. 3, pp. 617–632, 2003. 24 A. Schaefer, R. Maurer et al., “Cost efficiency of German mutual fund complexes,” SSRN working paper, 2010. 25 A. Kneip, W. H. Song, and R. Sickles, “A new panel data treatment for heterogeneity in time trends,” to appear in Econometric Theory. 26 Y. H. Lee and P. Schmidt, “A production frontier model with flexible temporal variation in technical inefficiency,” in The Measurement of Productive Efficiency: Techniques and Applications, H. O. Fried, C. A. K. Lovell, and S. S. Schmidt, Eds., Oxford University Press, New York, NY, USA, 1993. 27 C. Cornwell, P. Schmidt, and R. C. Sickles, “Production frontiers with cross-sectional and time-series variation in efficiency levels,” Journal of Econometrics, vol. 46, no. 1-2, pp. 185–200, 1990. 28 W. Greene, “Fixed and random effects in stochastic frontier models,” Journal of Productivity Analysis, vol. 23, no. 1, pp. 7–32, 2005. 29 S. C. Kumbhakar, “Estimation of technical inefficiency in panel data models with firm- and timespecific effects,” Economics Letters, vol. 36, no. 1, pp. 43–48, 1991. 30 S. C. Kumbhakar and L. Hjalmarsson, “Technical efficiency and technical progress in Swedish dairy farms,” in The Measurement of Productive Efficiency—Techniques and Applications, H. O. Fried, C. A. K. Lovell, and S. S. Schmidt, Eds., pp. 256–270, Oxford University Press, Oxford, UK, 1993. 31 W. Greene, “Distinguishing between heterogeneity and inefficiency: stochastic frontier analysis of the World Health Organization’s panel data on national health care systems,” Health Economics, vol. 13, no. 10, pp. 959–980, 2004. 32 J. Neyman and E. L. Scott, “Consistent estimates based on partially consistent observations,” Econometrica. Journal of the Econometric Society, vol. 16, pp. 1–32, 1948. 33 T. Lancaster, “The incidental parameter problem since 1948,” Journal of Econometrics, vol. 95, no. 2, pp. 391–413, 2000. 34 Y.-Y. Chen, P. Schmidt, and H.-J. Wang, “Consistent estimation of the fixed effects stochastic frontier model,” in Proceedings of the the European Workshop on Efficiency and Productivity Analysis (EWEPA ’11), Verona, Italy, 2011.
Journal of Probability and Statistics
25
35 P. Satchachai and P. Schmidt, “Estimates of technical inefficiency in stochastic frontier models with panel data: generalized panel jackknife estimation,” Journal of Productivity Analysis, vol. 34, no. 2, pp. 83–97, 2010. 36 H. J. Wang and C. W. Ho, “Estimating fixed-effect panel stochastic frontier models by model transformation,” Journal of Econometrics, vol. 157, no. 2, pp. 286–296, 2010. 37 A. Genz, “An adaptive multidimensional quadrature procedure,” Computer Physics Communications, vol. 4, no. 1, pp. 11–15, 1972. 38 M. Arellano and S. Bonhomme, “Robust priors in nonlinear panel data models,” Working paper, CEMFI, Madrid, Spain, 2006. 39 M. Arellano and J. Hahn, “Understanding bias in nonlinear panel models: some recent developments,” in Advances in Economics and Econometrics, Ninth World Congress, R. Blundell, W. Newey, and T. Persson, Eds., Cambridge University Press, 2006. 40 M. Arellano and J. Hahn, “A likelihood-based approximate solution to the incidental parameter problem in dynamic nonlinear models with multiple effects,” Working Papers, CEMFI, 2006. 41 J. O. Berger, B. Liseo, and R. L. Wolpert, “Integrated likelihood methods for eliminating nuisance parameters,” Statistical Science, vol. 14, no. 1, pp. 1–28, 1999. 42 C. A. Bester and C. Hansen, “A penalty function approach to bias reduction in non-linear panel models with fixed effects,” Journal of Business and. Economic Statistics, vol. 27, pp. 29–51, 2009. 43 C. A. Bester and C. Hansen, “Bias reduction for Bayesian and frequentist estimators,” unpublished. 44 E. G. Tsionas, S. C. Kumbhakar et al., “Firm-heterogeneity, persistent and transient technical inefficiency,” MPRA REPEC working paper, 2011. 45 Y. Mundlak, “Empirical production function free of management bias,” Journal of Farm Economics, vol. 43, no. 1, pp. 44–56, 1961. 46 E. G. Tsionas, “Stochastic frontier models with random coefficients,” Journal of Applied Econometrics, vol. 17, no. 2, pp. 127–147, 2002. 47 S. C. Kumbhakar, “Modeling allocative inefficiency in a translog cost function and cost share equations: an exact relationship,” Journal of Econometrics, vol. 76, no. 1-2, pp. 351–356, 1997. 48 S. C. Kumbhakar and E. G. Tsionas, “Measuring technical and allocative inefficiency in the translog cost system: a Bayesian approach,” Journal of Econometrics, vol. 126, no. 2, pp. 355–384, 2005. 49 R. Colombi, G. Martini, and G. Vittadini, “A stochastic frontier model with short-run and longrun inefficiency random effects,” Working Papers, Department of Economics and Technology Management, University of Bergamo, 2011. 50 R. F¨are, S. Grosskopf, and C. A. K. Lovell, The Measurement of Efficiency of Production, Kluwer Nijhoff Publishing, Boston, Mass, USA, 1985. 51 S. C. Ray, Data Envelopment Analysis, Cambridge University Press, Cambridge, UK, 2004. 52 A. Alvarez, C. Arias, and S. C. Kumbhakar, Empirical Consequences of Direction Choice in Technical Efficiency Analysis, SUNY, Binghamton, NY, USA, 2003. 53 S. E. Atkinson and C. Cornwell, “Estimation of output and input technical efficiency using a flexible functional form and panel data,” International Economic Review, vol. 35, pp. 245–256, 1994. 54 T. Beard, S. Caudill, and D. Gropper, “Finite mixture estimation of multiproduct cost functions,” Review of Economics and Statistics, vol. 73, pp. 654–664, 1991. 55 T. R. Beard, S. B. Caudill, and D. M. Gropper, “The diffusion of production processes in the U.S. Banking industry: a finite mixture approach,” Journal of Banking and Finance, vol. 21, no. 5, pp. 721– 740, 1997. 56 J. A. Hagenaars and A. L. McCutcheon, Applied Latent Class Analysis, Cambridge University Analysis, New York, NY, USA, 2002. 57 G. McLachlan and D. Peel, Finite Mixture Models, Wiley Series in Probability and Statistics: Applied Probability and Statistics, Wiley-Interscience, New York, NY, USA, 2000. 58 R. Tibshirani, Local likelihood estimation, Ph.D. thesis, Stanford University, 1984. 59 P. Gozalo and O. Linton, “Local nonlinear least squares: using parametric information in nonparametric regression,” Journal of Econometrics, vol. 99, no. 1, pp. 63–106, 2000.
Hindawi Publishing Corporation Journal of Probability and Statistics Volume 2011, Article ID 568457, 13 pages doi:10.1155/2011/568457
Research Article Estimation of Stochastic Frontier Models with Fixed Effects through Monte Carlo Maximum Likelihood Grigorios Emvalomatis,1 Spiro E. Stefanou,1, 2 and Alfons Oude Lansink1 1 2
Business Economics Group, Wageningen University, 6707 KN Wageningen, The Netherlands Department of Agricultural Economics and Rural Sociology, The Pennsylvania State University, University Park, PA 16802, USA
Correspondence should be addressed to Grigorios Emvalomatis,
[email protected] Received 30 June 2011; Accepted 31 August 2011 Academic Editor: Mike Tsionas Copyright q 2011 Grigorios Emvalomatis et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Estimation of nonlinear fixed-effects models is plagued by the incidental parameters problem. This paper proposes a procedure for choosing appropriate densities for integrating the incidental parameters from the likelihood function in a general context. The densities are based on priors that are updated using information from the data and are robust to possible correlation of the group-specific constant terms with the explanatory variables. Monte Carlo experiments are performed in the specific context of stochastic frontier models to examine and compare the sampling properties of the proposed estimator with those of the random-effects and correlated random-effects estimators. The results suggest that the estimator is unbiased even in short panels. An application to a cross-country panel of EU manufacturing industries is presented as well. The proposed estimator produces a distribution of efficiency scores suggesting that these industries are highly efficient, while the other estimators suggest much poorer performance.
1. Introduction The incidental parameters problem was formally defined and studied by Neyman and Scott 1. In general, the problem appears in many models for which the number of parameters to be estimated grows with the number of observations. In such a model, even parameters that are common to all observations cannot be consistently estimated due to their dependence on observation- or group-specific parameters. In econometrics, the issue appears to be more
2
Journal of Probability and Statistics
relevant in panel-data models where the incidental parameters—although not as much parameters as latent data—represent group-specific intercepts. In this setting, the number of incidental parameters grows linearly with the cross-sectional dimension of the panel. Evidence on the inconsistency of estimators when the problem is ignored are available for discrete choice models 2, the Tobit 3, and the stochastic frontier models 4, 5. Lancaster 6 identified three axes around which the proposed solutions concentrate: i integrate the incidental parameters out from the likelihood based on an assumed density, ii replace the incidental parameters in the likelihood function by their maximum-likelihood estimates and maximize the resulting profile with respect to the common parameters, and iii transform the incidental parameters in a way that they become approximately orthogonal to the common parameters, and then integrate them from the likelihood using a uniform prior. In the case of integrated likelihood, the Bayesian approach is straightforward: formulate a prior for each incidental parameter and use this prior to integrate them from the likelihood. As pointed out by Chamberlain 7, such a procedure does not provide a definite solution to the problem. When the number of incidental parameters grows, the number of priors placed on these parameters will grow as well and, therefore, the priors will never be dominated by the data. It appears, however, that this is the best that could be done. In the end, the problem of incidental parameters becomes one of choosing appropriate priors. There exists no direct counterpart to the Bayesian approach in frequentist statistics. Instead, a random-effects formulation of the problem could be used 4, 5. In this setting, the incidental parameters are integrated from the likelihood based on a “prior” density. However, this density is not updated by the data in terms of its shape, but only in terms of its parameters. As such, it cannot be considered a prior in the sense the term is used in a Bayesian framework. The usual practice is to use a normal density as a “prior” which does not depend on the data. This random-effects formulation will produce a fixed-T consistent estimator as long as the true underlying data-generating process is such that the group-specific parameters are uncorrelated with the regressors. Allowing for the incidental parameters to be correlated with the regressors, Abdulai and Tietje 8 use Mundlak’s 9 view on the relationship between fixed- and random-effects estimators, in the context of a stochastic frontier model. Although this approach is likely to mitigate the bias of the random-effects estimator, there is no evidence on how Mundlak’s estimator performs in nonlinear models. This study proposes a different method for integrating the incidental parameters from the likelihood in a frequentist setting. In panel-data models, the incidental parameters are treated as missing data and the approach developed by Gelfand and Carlin 10 is used to update a true prior in the Bayesian sense on the incidental parameters using information from the data. The formulated posterior is then used to integrate the incidental parameters from the likelihood. The rest of the paper is organized as follows: in Section 2, the proposed estimator is developed in a general framework and related to existing frequentist and Bayesian estimators. The following section discusses some practical considerations and possible computational pitfalls. Section 4 presents a set of Monte Carlo experiments in the specific context of stochastic frontier models. The sampling properties of the proposed estimator are compared to those of the linear fixed effects, random effects, and random effects with Mundlak’s approach. The next section provides an application of the estimators to a dataset of EU manufacturing industries, while Section 6 presents some concluding comments.
Journal of Probability and Statistics
3
2. Monte Carlo Maximum Likelihood in Panel Data Models We consider the following general formulation of a panel-data model: yit zi γ xit β εit ,
2.1
where yit and xit are time-varying observed data. The zi s are time-invariant and unobserved data, which are potentially correlated with the xi s. Since the zi s are unobserved, they will be absorbed in the group-specific constant term. The estimable model becomes yit αi xit β εit .
2.2
The view of the fixed effects as latent data rather than parameters justifies, from a frequentist perspective, the subsequent integration of the αi s from the likelihood function. The nature of the dependent variable discrete, censored, etc. and different distributional assumptions on ε give rise to an array of econometric models. In such a model, it is usually straightforward to derive the density of the dependent variable conditional on the independent and the group-specific intercept. Let yi and xi be the vector and matrix of the stacked data for group i. The contribution to the likelihood of the ith group conditional on αi is Lθ | yi , xi , αi fyi | xi , θ, αi ,
2.3
where fyit | xit , θ, αi is easy to specify. Maximum likelihood estimation is based on the density of observed data; that is, on the density of yi marginally with respect to αi . In an integrated-likelihood approach, the fixed effects are integrated out from the joint density of yi and αi . We follow Gelfand and Carlin 10 to derive an appropriate density according to which such an integration can be carried out by writing the density of the data marginally with respect to the fixed effects as fyi | xi , θ fyi | xi , ψ
Ai
fyi , αi | xi , θ fαi | yi , xi , ψdαi , fyi , αi | xi , ψ
2.4
where ψ is any point in the parameter space of θ. It is obvious from this formulation that fyi , αi | xi , ψ plays the role of an importance density for the evaluation of the integral. However, it is a very specific importance density: it has the same functional form as the unknown fyi , αi | xi , θ, but is evaluated at any chosen by the researcher point ψ. The same functional form of the integrand and the importance density can be exploited to reach a form in which, under some additional assumptions, every density will be known or easy to assume a functional form for it. Typically, the integral in 2.4 would be evaluated by simulation. For this formulation of the marginal likelihood, Geyer 11 showed that, under loose regularity conditions, the Monte Carlo likelihood hypoconverges to the theoretical likelihood and the Monte Carlo maximum likelihood estimates converge to the maximum likelihood estimates with probability 1.
4
Journal of Probability and Statistics
The joint density of yi and αi can be written as the product of the known from 2.3 conditional likelihood and the marginal density of αi . Then, 2.4 becomes fyi | xi , θ fyi | xi , ψ
Ai
fyi | xi , θ, αi pαi | xi , θ fαi | yi , xi , ψdαi . fyi | xi , ψ, αi pαi | xi , ψ
2.5
The following assumption is imposed on the data-generating process: pαi | xi , θ pαi | xi , ψ
∀αi ∈ Ai .
2.6
In words, this assumption means that the way αi and xi are related does not depend on θ. One may think of this as the relationship between αi and xi being determined by a set of parameters η, prior to the realization of yi . In mathematical terms, this would require that fyi , αi | xi , θ, η fyi | xi , θ, αi × pαi | xi , η. This implies that the set of parameters that enter the distribution of αi conditionally on xi , but unconditionally on yi , is disjoint of θ. In practice and depending on the application at hand, this assumption may or may not be restrictive. Consider, for example, the specification of a production function where y is output, x is a vector of inputs, and a represents the effect of time-invariant unobserved characteristics, such as the location of the production unit, on output. The assumption stated in 2.6 implies that, although location may affect the levels of input use, the joint density of location and inputs does not involve the marginal productivity of inputs. On the other hand, conditionally on output, the density of α does involve the marginal productivity coefficients, since this conditional density is obtained by applying Bayes’ rule on fyi | xi , θ, αi . Under the assumption stated in 2.6, 2.5 can be simplified to Lθ | yi , xi fyi | xi , ψ
Ai
fyi | xi , θ, αi fαi | yi , xi , ψdαi . fyi | xi , ψ, αi
2.7
Theoretically, fαi | yi , xi , ψ can be specified in a way that takes into account any prior beliefs on the correlation between the constant terms and the independent variables. Then, the integral can be evaluated by simulation. Practically, however, there is no guidance on how to formulate these beliefs. Furthermore, the choice of fαi | yi , xi , ψ is not updated during the estimation process and it is not truly a prior, just as in frequentist random effects. Alternatively, we can only specify the marginal density of αi and use Bayes’ rule to get fαi | yi , xi , ψ ∝ L ψ | yi , xi , αi × pαi | xi , ψ.
2.8
Again, there is not much guidance on how to specify pαi | xi , ψ. Additionally, in order to be consistent with assumption 2.6, we need to assume a density for αi that does not involve ψ. But now the issue is not as important: it is fαi | yi , xi , ψ and not the assumed pαi | xi , ψ that is used for the integration. That is, pαi | xi , ψ is a prior in the Bayesian sense of the term since it is filtered through the likelihood for a given ψ before it is used for the integration. Accordingly, fαi | yi , xi , ψ is the posterior density of αi .
Journal of Probability and Statistics
5
Before examining the role of the prior in the estimation, we note that the frequentist random-effects approach can be derived by using 2.8 to simplify 2.7. If αi is assumed to be independent of xi and the parameters of its density are different from ψ, then the unconditional likelihood does not depend on ψ and the estimator becomes similar to the one Greene 4 suggests fyi | xi , θ, αi pαi dαi . 2.9 Lθ | yi , xi Ai
It is apparent that there is an advantage in basing the estimation on the likelihood function in 2.7 rather than 2.9. By sampling from fαi | yi , xi , ψ instead of pαi , we are using information contained in the data on the way αi is correlated with xi . For example, we may assume that in the prior αi is normally or uniformly distributed and that it is independent of the data. But even this prior independence assumption will not impose independence in the estimation, because of the filtering of the prior through the likelihood in 2.8. As it is the case in Bayesian inference, the role of the prior density of αi diminishes with the increase in the number of time observations per group. But the short time dimension of the panel is the original cause of the incidental parameters problem. The estimator proposed here is still subject to the critique that was developed for the corresponding Bayesian estimator: the density of the data will not dominate the prior as N → ∞ with T held fixed. On the other hand, when the true data-generating process is such that the group-specific constant terms are correlated with the independent variables, the method proposed here will mitigate the bias from which the random-effects estimator suffers.
3. Calculations and Some Practical Considerations The first step in the application of the MC maximum likelihood estimator developed in the previous section is to sample from the posterior of αi given ψ. Since this posterior density involves the likelihood function, its functional form will, in general, not resemble the kernel of any known distribution. But, this posterior is unidimensional for every αi and simple random sampling techniques, such as rejection sampling, can be used. Of course, the MetropolisHastings algorithm provides a more general framework for sampling from any distribution. In the context of the posterior in 2.8, a Metropolis-Hastings algorithm could be used to construct a Markov chain for each αi , while holding ψ fixed. Given that M random numbers are drawn from the posterior of each αi , the simulated likelihood function for the entire dataset can be written as ⎡ N ⎤ N M f y | x , θ, α 1 i i ij | y, X, α Lθ fyi | xi , ψ ⎣ ⎦, M j1 f yi | xi , ψ, αij i1 i1
3.1
where αij is the jth draw from fαi | yi , xi , ψ. The MC likelihood function can be maximized with respect to θ. The first term in the product is constant with respect to θ and can be ignored during the optimization. The relevant part of the simulated log-likelihood is ⎡
⎤ M f y | x , θ, α 1 i i ij | y, X, α log⎣ log Lθ ⎦. M f y | x , ψ, α i i ij i1 j1 N
3.2
6
Journal of Probability and Statistics
One practical issue that remains to be resolved is the choice of ψ. Theoretically, this choice should not matter. In practice, however, when the calculations are carried on finiteprecision machines, it does. In principle, fyi , αi | xi , ψ should mimic the shape of fyi , αi | xi , θ, as it plays the role of an importance density for the evaluation of the integral in 2.4. If ψ is chosen to be far away from θ, then the two densities will have probability mass over different locations and the ratio in 3.2 will be ill behaved in the points of the parameter space where the proposal density approaches zero, while the likelihood does not. Gelfand and Carlin 10 propose solving this problem by choosing an initial ψ and running some iterations by replacing ψ with the MC maximum likelihood estimates from the previous step. In the final step, the number of samples is increased to reduce the Monte Carlo standard errors. The estimator produced by this iterative procedure has the same theoretical properties as an estimator obtained by choosing any arbitrary ψ. On the other hand, this iterative procedure introduces another problem: if during this series of iterations ψ converges to the value of θ supported by the data, then in the subsequent iteration the ratio in 3.2 will be approximately unity. In practice, the simulated likelihood will never be exactly one due to the noise introduced through the random sampling. As a consequence, the MC likelihood function will no longer depend on θ or at least it will be very flat. This leads to numerical complications that now have to do with the routine used for maximizing the likelihood. A way to overcome this problem is by introducing some noise to the estimate of θ from iteration, say k, before using it in place of ψ for iteration k 1. Additionally, increasing the variance parameters contained in ψ will result in the proposal density having heavier tails than the likelihood, alleviating in this way the numerical instability problem in the ratio of the two densities.
4. Monte Carlo Experiments In this section, we perform a set of Monte Carlo experiments on the stochastic frontier model 12, 13. Wang and Ho 14 have analytically derived the likelihood function for the class of stochastic frontiers models that have the scaling property 15 by using within and first-difference transformations. Instead of restricting attention to this class of models, the formulation proposed by Meeusen and van den Broeck 13 is used here yit αi xit β vit − uit ,
4.1
where the noise component of the error term is assumed to follow a normal distribution with mean zero and variance σv2 , while the inefficiency component of the error is assumed to follow an exponential distribution with rate λ. The technical efficiency score for observation i in period t is defined as TEit exp{−uit } and assumes values on the unit interval. Under the described specification and assuming independence over t, the contribution of group i to the likelihood conditional on the fixed effects is fyi | xi , θ, αi ≡ Lθ | yi , xi , αi
T εit σv 1 εit 1 σv 2 Φ − , − exp λ σv λ λ 2 λ t1
4.2
Journal of Probability and Statistics
7
where εit yit − αi − xit β and θ β log σv2 log λ2 . A major objective of an application of a stochastic frontier model is usually the estimation not only of the model’s parameters, but also of the observation-specific efficiency scores. These estimates can be obtained as it /σv − σv Φ μ −uit σv2 , | εit exp − μit E e 2 Φ μ it /σv
4.3
where μ it −εit − σv2 /λ. Three experiments are performed for panels of varying length T 4, 8, and 16, while keeping the total number of observations cross-section and time dimensions combined fixed at 2000. The sampling properties of four estimators are examined: i linear fixed effects within estimator, ii MC maximum likelihood, iii simple random effects, and iv correlated random effects using Mundlak’s approach. The data are generated in the following sequence: i N αi s are drawn from a normal distribution with mean zero and variance 2, ii for each i, T draws are obtained from a normal distribution with mean αi 1/2α2i and standard deviation equal to 1/2 for two independent variables, x1 and x2 , iii NT draws are obtained from a normal distribution with zero mean and standard deviation equal to 0.3 for vit , iv NT draws are obtained from an exponential distribution with rate equal to 0.3 for uit , v the dependent variable is constructed as yit αi 0.7x1,i 0.4x2,i vit − uit . For the MC maximum likelihood estimator, uniform priors are assumed for the αi s and their integration from the likelihood is based on 3000 random draws from their posterior. These draws are obtained using a Metropolis-Hastings random-walk algorithm. For the random-effects estimators, each αi is assumed to follow a normal distribution with mean μ and variance σα2 . Integration of the unobserved effects for the random-effects estimator, is carried using Gaussian quadratures. Although integration of the unobserved effects can be carried using simulation as suggested by Greene 4, under normally distributed αi ’s integration through a Gauss-Hermite quadrature reduces computational cost substantially. Table 1 presents the means, mean squared errors, and percent biases for the four estimators, based on 1000 repetitions. The linear fixed-effects estimator is unbiased with respect to the slope parameters, as well as with respect to the standard deviation of the composite error term. This estimator, however, cannot distinguish between the two components of the error. Nevertheless, it can be used to provide group-specific but timeinvariant efficiency scores using the approach of Schmidt and Sickles 16. This approach has the added disadvantage of treating all unobserved heterogeneity as inefficiency. On the other hand, as expected, the simple random-effects estimator is biased with respect to the slope parameters. Interestingly, however, the bias is much smaller for the variance parameter of the inefficiency component of the error term. This suggests that one may use the simple random-effects estimator to obtain an indication of the distribution of the industry-level efficiency even in the case where the group effects are correlated with the independent variables. The MC maximum likelihood and the correlated random-effects estimators are virtually unbiased both with respect to the slope and the variance parameters, even for small
Correlated random effects
Simple random effects
Monte Carlo maximum likelihood
Linear fixed effects
Estimator
0.0001 0.0004
0.4242 0.7002 0.3993
β2 0.4
0.2996 0.2996
σv 0.3
λ 0.3
0.0004
0.0002
0.0187
−2.4150
log λ
0.0097
0.0004
−2.4129
0.3997
β2 0.4
0.0005 0.0004
2
0.2990 0.7006
λ 0.3 β1 0.7
log σv2
0.3186
σv 0.3 0.0006
0.0210
−2.4197
log λ
0.0237
0.0234
−2.2901
0.5513
β2 0.4
0.0004 0.0236
2
0.2985 0.8518
λ 0.3 β1 0.7
log σv2
0.3002
σv 0.3 0.0002
0.0196
−2.4225
2
log λ
0.0100
−2.4094
log σv2
0.0004
0.0005
N 500, T 4 MSE 0.0005
0.3997
Mean 0.7008
β2 0.4 σ σv2 λ2 β1 0.7
Parameter β1 0.7
−0.1%
−0.1%
−0.3%
−0.2%
−0.1%
−0.3% 0.1%
6.2%
−0.5%
4.9%
37.8%
−0.5% 21.7%
0.1%
−0.6%
−0.1%
−0.2%
0.0% 0.0%
−0.1%
Bias 0.1%
0.2999
0.2991
−2.4119
−2.4153
0.4002
0.2995 0.6997
0.3058
−2.4152
−2.3711
0.4851
0.2997 0.7845
0.2996
−2.4134
−2.4123
0.3997
0.4243 0.6992
0.4004
Mean 0.7001
0.0003
0.0002
0.0143
0.0066
0.0004
0.0003 0.0004
0.0002
0.0153
0.0083
0.0078
0.0003 0.0077
0.0001
0.0144
0.0065
0.0004
0.0001 0.0004
0.0004
N 250, T 8 MSE 0.0004
Table 1: Simulation results for the stochastic frontier model.
0.0%
−0.3%
−0.2%
−0.3%
0.1%
−0.2% 0.0%
1.9%
−0.3%
1.5%
21.3%
−0.1% 12.1%
−0.1%
−0.2%
−0.2%
−0.1%
0.0% −0.1%
0.1%
Bias 0.0%
0.3001
0.2988
−2.4106
−2.4176
0.4003
0.3000 0.7006
0.3010
−2.4114
−2.4030
0.4417
0.3000 0.7419
0.2995
−2.4112
−2.4130
0.3997
0.4243 0.7001
0.3999
Mean 0.7007
0.0003
0.0001
0.0126
0.0056
0.0003
0.0003 0.0003
0.0001
0.0128
0.0055
0.0022
0.0003 0.0022
0.0001
0.0127
0.0054
0.0004
0.0001 0.0004
0.0004
N 125, T 16 MSE 0.0004
0.0%
−0.4%
−0.1%
−0.4%
0.1%
0.0% 0.1%
0.3%
−0.1%
0.2%
10.4%
0.0% 6.0%
−0.2%
−0.1%
−0.2%
−0.1%
0.0% 0.0%
0.0%
Bias 0.1%
8 Journal of Probability and Statistics
Journal of Probability and Statistics
9
T . Furthermore, the mean squared errors of both estimators decrease as the time dimension of the panel increases. For the MC maximum likelihood estimator this can be attributed to the fact that as T increases more information per group is used to formulate the posterior of αi . Obtaining estimates of observation-specific efficiency scores involves first generating estimates of the group intercepts. Estimates of the group effects can be obtained for the random-effects estimators using group averages of the dependent and independent variables, accounting at the same time for the skewness of the composite error term 14. On the other hand, the MC maximum likelihood estimator can provide estimates of the αi s by treating them as quantities to be estimated by simulation after the estimation of the common parameters of the model. In both estimators, the αi s and θ are replaced in 4.3 by their point estimates to obtain estimates of the observation-specific efficiency scores. Nevertheless, both the random-effects and the MC maximum likelihood estimators of the αi s are only T consistent. A different approach, which is consistent with treating the αi s as missing data, is to integrate them from the expectation in 4.3. That is, one may obtain the expectation of e−uit | εit unconditionally on the missing data. In this way, the uncertainty associated with the αi s is accommodated when estimating observation-specific efficiency scores. The integration of the αi s is achieved using the following procedure: where θ is either the random-effects or the 1 draw M samples from fαi | yi , xi , θ MC maximum likelihood point estimate, 2 for each draw j 1, 2, . . . , M, evaluate Ee−uit | εit,j , where εit,j yit − αi,j − x β, it
3 take the sample mean of the Ee−uit | εit,j s over j. By the law of iterated expectations, this sample mean will converge to the unconditional expectation of e−uit | εit,j . In the random-effects model, integration can also be performed by quadratures rather than simulation. Figure 1 presents scatter plots of the actual versus the predicted efficiency scores for the MC maximum likelihood and the correlated random-effects estimators for a particular Monte Carlo repetition. Apart from the known problem of underestimating the scores of highly efficient observations, the approach of integrating the αi s from the expectation of e−uit | εit produces good predictions for the efficiency scores for the MC maximum likelihood estimator. On the other hand, the predictions of the correlated random-effects estimator are more dispersed around the 45◦ line. The MC maximum likelihood estimator has an advantage over the random-effects estimator because it does not need to specify a systematic relationship between the group effects and the independent variables. In other words, the quality of the estimates of the efficiency scores from the random-effects estimator deteriorates if there is a lot of noise in the relationship between the group effects and the group means of the independent variables.
5. Application This section presents an application of the estimators discussed in this paper to a panel of European manufacturing industries. The dataset comes from the EU-KLEMS project 17 and covers the period from 1970 to 2007. It contains annual information at the country level for industries classified according to the 4-digit NACE revision 2 system. The part of the dataset used for the application covers 10 manufacturing industries for 6 countries for which the required data series are complete Denmark, Finland, Italy, Spain, The Netherlands, and UK.
10
Journal of Probability and Statistics T=4 MC maximum likelihood
Correlated random effects
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0.8
1
0.8
1
T=8 MC maximum likelihood
Correlated random effects
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2 0
0 0
0.2
0.4
0.6
0.8
0
1
0.2
0.4
0.6
T = 16 Correlated random effects
MC maximum likelihood 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2 0
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
Figure 1: Actual versus predicted efficiency scores with simulated data.
The production frontier is specified as Cobb-Douglass in capital stock and labor input, with value added being the dependent variable. A linear time trend is included to capture autonomous technological progress. The model specification is
log yit αi βK log K βL log L βt t vit − uit ,
5.1
Journal of Probability and Statistics
11
Table 2: Results for the EU-KLEMS model. Linear fixed effects βK βL βt σv λ
0.3262 0.7265 0.0202 — —
MC maximum likelihood Simple random effects 0.2687 0.7323 0.0211 0.0755 0.2057
0.3478 0.6134 0.0173 0.0796 0.2033
Correlated random effects 0.1833 0.7388 0.0224 0.0705 0.2087
where it is assumed that vit ∼ N0, σv2 and uit ∼ Expλ. Each industry in each country is treated as a group with its own intercept, but the production technologies of all industries across all countries are assumed to be represented by the same slope parameters. The model is estimated using the linear fixed-effects, MC maximum likelihood, and simple and correlated random-effects estimators. The results are presented in Table 2. Given that under the strict model specification the group effects are expected to be correlated with the regressors, it does not come as a surprise that relatively large discrepancies between the parameter estimates of the linear fixed-effects and the simple random-effects estimators appear. Nonnegligible discrepancies are also observed between the linear fixedeffects estimate of βK and the corresponding estimates from the estimators that account for possible correlation between the group effects and the independent variables. Although this result appears to be in contrast with the findings of the Monte Carlo simulations, we need to keep in mind that the Monte Carlo findings are valid for the estimators on average, while the application considers a single dataset where particularities could lead to these discrepancies. For example, limited within variation in the capital and labor variables could induce multicollinearity and render the point estimates less precise. On the other hand, all three estimators that can distinguish between noise and inefficiency effects produce very similar parameter estimates for the variances of the two error terms. The estimates of the parameter associated with the time trend suggest that the industries experience, on average, productivity growth at a rate slightly larger than 2%. Figure 2 presents kernel density estimates of the observation-specific technical efficiency scores obtained by integrating the group effects from the expectation in 4.3 using the MC maximum likelihood and the two random-effects estimators. It appears that only the MC maximum likelihood estimator produces a distribution of technical efficiency scores similar to the original assumptions imposed by the model, with the majority of the industries being highly efficient. On the other hand, the simple random-effects estimator yields a bimodal distribution of efficiency scores.
6. Conclusions and Further Remarks This paper proposes a general procedure for choosing appropriate densities for frequentist integrated-likelihood methods in panel data models. The proposed method requires the placement of priors on the density of the group-specific constant terms. These priors, however, are updated during estimation and in this way their impact on the final parameter estimates is minimized. A set of Monte Carlo experiments were conducted to examine the sampling properties of the proposed estimator and to compare them with the properties of existing relevant estimators. Although the experiments were conducted in the specific context of a stochastic
12
Journal of Probability and Statistics 7 6 5 4 3 2 1 0
0
0.2
0.4
0.6
0.8
1
MC maximum likelihood Correlated random effects Simple random effects
Figure 2: Kernel density estimates of efficiency scores from the three estimators.
frontier model, the proposed estimator can be generalized to other nonlinear models. The results suggest that, even in very short panels, both the MC maximum likelihood estimator and random-effects estimator augmented by the group averages of the regressors are virtually unbiased in the stochastic frontier model. Returning to Chamberlain’s 7 observation that in panel-data settings the contribution of the prior is never dominated by the data, the results from the Monte Carlo experiments suggest that this is not an issue of major importance. It appears that when the objective is not the estimation of the incidental parameters but their integration from the likelihood, then even very vague priors do not introduce any bias in the common parameter estimates. In the end, which estimator should be chosen? From the estimators considered here, the MC and Mundlak’s random-effects estimators are able to distinguish inefficiency from group- and time-specific unobserved heterogeneity, while being reasonably unbiased with respect to the common parameters. The difference between the two is based on theoretical grounds. The MC estimator is able to account for the correlation of the group-specific parameters with the regressors in any unknown form. On the other hand, the correlated random-effects estimator lacks such a theoretical support; there still exist no analytical results on the properties of this estimator in nonlinear settings. Another disadvantage of the correlated random-effects estimator is that it requires the inclusion of the group means of independent variables in the model. This approach could induce a high degree of multicollinearity if there is little within variability in the data. Lastly, in the specific context of stochastic frontier models, the MC maximum likelihood estimator provides better estimates of the observation-specific efficiency scores.
References 1 J. Neyman and E. L. Scott, “Consistent estimates based on partially consistent observations,” Econometrica, vol. 16, pp. 1–32, 1947. 2 J. J. Heckman, “The incidental parameters problem and the problem of initial conditions in estimating a discrete time-discrete data stochastic process,” in Structural Analysis of Discrete Data with Econometric
Journal of Probability and Statistics
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
13
Applications, C. F. Manski and D. McFadden, Eds., pp. 179–195, MIT Press, Cambridge, Mass, USA, 1981. W. Greene, “Fixed effects and bias due to the incidental parameters problem in the tobit model,” Econometric Reviews, vol. 23, no. 2, pp. 125–147, 2004. W. Greene, “Fixed and random effects in stochastic frontier models,” Journal of Productivity Analysis, vol. 23, no. 1, pp. 7–32, 2005. W. Greene, “Reconsidering heterogeneity in panel data estimators of the stochastic frontier model,” Journal of Econometrics, vol. 126, no. 2, pp. 269–303, 2005. T. Lancaster, “The incidental parameter problem since 1948,” Journal of Econometrics, vol. 95, no. 2, pp. 391–413, 2000. G. Chamberlain, “Panel data,” in Handbook of Econometrics, Z. Griliches and M. D. Intriligator, Eds., vol. 2, pp. 1247–1313, North-Holland, Amsterdam, The Netherland, 1984. A. Abdulai and H. Tietje, “Estimating technical efficiency under unobserved heterogeneity with stochastic frontier models: Application to northern German dairy farms,” European Review of Agricultural Economics, vol. 34, no. 3, pp. 393–416, 2007. Yair Mundlak, “On the pooling of time series and cross section data,” Econometrica, vol. 46, no. 1, pp. 69–85, 1978. A. E. Gelfand and B. P. Carlin, “Maximum-likelihood estimation for constrained- or missing-data models,” The Canadian Journal of Statistics, vol. 21, no. 3, pp. 303–311, 1993. C. J. Geyer, “On the convergence of Monte Carlo maximum likelihood calculations,” Journal of the Royal Statistical Society B, vol. 56, no. 1, pp. 261–274, 1994. D. Aigner, C. A. K. Lovell, and P. Schmidt, “Formulation and estimation of stochastic frontier production function models,” Journal of Econometrics, vol. 6, no. 1, pp. 21–37, 1977. W. Meeusen and J. van den Broeck, “Efficiency estimation from Cobb-Douglas production functions with composed error,” International Economic Review, vol. 18, pp. 435–444, 1977. H.-J. Wang and C.-W. Ho, “Estimating fixed-effect panel stochastic frontier models by model transformation,” Journal of Econometrics, vol. 157, no. 2, pp. 286–296, 2010. A. Alvarez, C. Amsler, L. Orea, and P. Schmidt, “Interpreting and testing the scaling property in models where inefficiency depends on firm characteristics,” Journal of Productivity Analysis, vol. 25, no. 3, pp. 201–212, 2006. P. Schmidt and R. C. Sickles, “Production frontiers and panel data,” Journal of Business & Economic Statistics, vol. 2, no. 4, pp. 367–374, 1984. M. O’Mahony and M. P. Timmer, “Output, input and productivity measures at the industry level: the EU KLEMS database,” Economic Journal, vol. 119, no. 538, pp. F374–F403, 2009.
Hindawi Publishing Corporation Journal of Probability and Statistics Volume 2011, Article ID 617652, 17 pages doi:10.1155/2011/617652
Research Article Panel Unit Root Tests by Combining Dependent P Values: A Comparative Study Xuguang Sheng1 and Jingyun Yang2 1 2
Department of Economics, American University, Washington, DC 20016, USA Department of Biostatistics and Epidemiology, University of Oklahoma Health Sciences Center, Oklahoma City, OK 73104, USA
Correspondence should be addressed to Xuguang Sheng,
[email protected] and Jingyun Yang,
[email protected] Received 27 June 2011; Accepted 25 August 2011 Academic Editor: Mike Tsionas Copyright q 2011 X. Sheng and J. Yang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We conduct a systematic comparison of the performance of four commonly used P value combination methods applied to panel unit root tests: the original Fisher test, the modified inverse normal method, Simes test, and the modified truncated product method TPM. Our simulation results show that under cross-section dependence the original Fisher test is severely oversized, but the other three tests exhibit good size properties. Simes test is powerful when the total evidence against the joint null hypothesis is concentrated in one or very few of the tests being combined, but the modified inverse normal method and the modified TPM have good performance when evidence against the joint null is spread among more than a small fraction of the panel units. These differences are further illustrated through one empirical example on testing purchasing power parity using a panel of OECD quarterly real exchange rates.
1. Introduction Combining significance tests, or P values, has been a source of considerable research in statistics since Tippett 1 and Fisher 2. For a systematic comparison of methods for combining P values from independent tests, see the studies by Hedges and Olkin 3 and Loughin 4. Despite the burgeoning statistical literature on combining P values, these techniques have not been used much in panel unit root tests until recently. Maddala and Wu 5 and Choi 6 are among the first who attempted to test unit root in panels by combining independent P values. More recent contributions include those by Demetrescu et al. 7, Hanck 8, and Sheng and Yang 9. Combining P values has several advantages over combination of test statistics in that i it allows different specifications, such as different
2
Journal of Probability and Statistics
deterministic terms and lag orders, for each panel unit, ii it does not require a panel to be balanced, and iii observed P values derived from continuous test statistics have a uniform distribution under the null hypothesis regardless of the test statistic or distribution from which they arise, and thus it can be carried out for any unit root test derived. While the formulation of the joint null hypothesis H0 : all of the time series in the panel are nonstationary is relatively uncontroversial, the specification of the alternative hypothesis critically depends on what assumption one makes about the nature of the heterogeneity of the panel. Recent contributions include O’Connell 10, Phillips and Sul 11, Bai and Ng 12, Chang 13, Moon and Perron 14 and Pesaran 15. The problem of selecting a test is complicated by the fact that there are many different ways in which H0 can be false. In general, we cannot expect one test to be sensitive to all possible alternatives, so that no single P value combination method is uniformly the best. The goal of this paper is to make a detailed comparison, via both simulations and empirical examples, of some commonly used P value combination methods, and to provide specific recommendation regarding their use in panel unit root tests. The plan of the paper is as follows. Section 2 briefly reviews the methods of combining P values. Small sample performance of these methods is investigated in Section 3 using Monte Carlo simulations. Section 4 provides the empirical applications, and Section 5 concludes the paper.
2. P Value Combination Methods Consider the model yit 1 − αi μi αi yi,t−1 it ,
i 1, . . . , N; t 1, . . . , T.
2.1
Heterogeneity in both the intercept and the slope is allowed in 2.1. This specification is commonly used in the literature, see the work of Breitung and Pesaran 16 for a recent review. Equation 2.1 can be rewritten as Δyit −φi μi φi yi,t−1 it ,
2.2
where Δyit yit − yi,t−1 and φi αi − 1. The null hypothesis is H0 : φ1 φ2 · · · φN 0,
2.3
and the alternative hypothesis is H1 : φ1 < 0, φ2 < 0, . . . , φN0 < 0,
N0 ≤ N.
2.4
Let Si,Ti be a test statistic for the ith unit of the panel in 2.2, and let the corresponding P value be defined as pi FSi,Ti , where F· denotes the cumulative distribution function c.d.f. of Si,Ti . We assume that, under H0 , Si,Ti has a continuous distribution function. This assumption is a regularity condition that ensures a uniform distribution of the P
Journal of Probability and Statistics
3
values, regardless of the test statistic or distribution from which they arise. Thus, P value combinations are nonparametric in the sense that they do not depend on the parametric form of the data. The nonparametric nature of combined P values gives them great flexibility in applications. In the rest of this section, we briefly review the P value combination methods in the context of panel unit root tests. The first test, proposed by Fisher 2, is defined as N P −2 ln pi ,
2.5
i1
which has an χ2 distribution with 2N degrees of freedom under the assumption of crosssection independence of the P values. Maddala and Wu 5 introduced this method to the panel unit root tests, and Choi 6 modified it to the case of infinite N. Inverse normal method, attributed to Stouffer et al. 17, is another often used method defined as N 1 Φ−1 pi , Z√ N i1
2.6
where Φ· is the c.d.f. of the standard normal distribution. Under H0 , Z ∼ N0, 1. Choi 6 first applied this method to the panel unit root tests assuming cross-section independence among the panel units. To account for cross-section dependence, Hartung 18 developed a modified inverse normal method by assuming a constant correlation across the probits ti , cov ti , tj ρ,
for i / j, i, j 1, . . . , N,
2.7
where ti Φ−1 pi . He proposed to estimate ρ in finite samples by ρ max −
where ρ 1 − 1/N − 1 statistic is formed as ∗
N
i1 ti
Z
1 , ρ , N−1
− t2 and t 1/N
N
i1 ti .
2.8
The modified inverse normal test
N
i1 ti
N NN − 1 ρ κ 2/n 1 1 − ρ
,
2.9
where κ 0.11 1/N − 1 − ρ is a parameter designed to improve the small sample performance of the test statistic. Under the null hypothesis, Z∗ ∼ N0, 1. Demetrescu et al. 7 showed that this method was robust to certain deviations from the assumption of constant correlation between probits in the panel unit root tests.
4
Journal of Probability and Statistics
A third method, proposed by Simes 19 as an improved Bonferroni procedure, is based on the ordered P values, denoted by p1 ≤ p2 ≤ · · · ≤ pN . The joint hypothesis H0 is rejected if pi ≤
iα , N
2.10
for at least one i 1, . . . , N. This procedure has a type I error equal to α when the test statistics are independent. Hanck 8 showed that Simes test was robust to general patterns of crosssectional dependence in the panel. The fourth method is Zaykin et al.’s 20 truncated product method TPM, which takes the product of all those P values that do not exceed some prespecified value τ. The TPM is defined as W
N
Ipi ≤τ
pi
,
2.11
i1
where I· is the indicator function. Note that setting τ 1 leads to Fisher’s original combination method, which could lose power in cases when there are some very large P values. This can happen when some series in the panel are clearly nonstationary such that the resulting P -values are close to 1, and some are clearly stationary such that the resulting P values are close to 0. Ordinary combination methods could be dominated by the large P values. The TPM removes these large P values through truncation, thus eliminating the effect that they could have on the resulting test statistic. When all the P values are independent, there exists a closed form of the distribution for W under H0 . When the P values are dependent, Monte Carlo simulation is needed to obtain the empirical distribution of W. Sheng and Yang 9 modify the TPM to allow for a certain degree of correlation among the P values. Their procedure is as follows. Step 1. Calculate W ∗ using 2.11. Set A 0. Step 2. Estimate the correlation matrix, Σ, for P values. Following Hartung 18 and Demetrescu et al. 7, they assume a constant correlation between the probits ti and tj , cov ti , tj ρ,
for i / j, i, j 1, . . . , N,
2.12
where ti Φ−1 pi and tj Φ−1 pj . ρ can be estimated in finite samples according to 2.8. Step 3. The distribution of W ∗ is calculated based on the following Monte Carlo simulations. a Draw pseudorandom probits from the normal distribution with mean zero and the and transform them back through the standard estimated correlation matrix, Σ, normal c.d.f., resulting in N P -values, denoted by p1 , p2 , . . . , pN . I p ≤τ N b Calculate W i i . i1 p ≤ W ∗ , increment A by one. c If W d Repeat steps a–c B times. e The P value for W ∗ is given by A/B.
Journal of Probability and Statistics
5
3. Monte Carlo Study In this section we compare the finite sample performance of the P value combination methods introduced in Section 2. We consider “strong” cross-section dependence, driven by a common factor, and “weak” cross-section dependence due to spatial correlation.
3.1. The Design of Monte Carlo First we consider dynamic panels with fixed effects but no linear trends or residual serial correlation. The data-generating process DGP in this case is given by yit 1 − αi μi αi yi,t−1 it ,
3.1
it γi ft ξit ,
3.2
where
for i 1, . . . , N, t −50, −49, . . . , T . The initial values yi,−50 are set to be 0 for all i. The individual fixed effect μi , the common factor ft , the factor loading γi , and the error term ξit are independent of each other with μi ∼ i.i.d N0, 1, ft ∼ i.i.d N0, σf2 , γi ∼ i.i.d U0, 3, and ξit ∼ i.i.d N0, 1. Remark 3.1. Setting σf2 0, we explore the properties of the tests under cross-section independence, and, with σf2 10, we explore the performance of the tests under “high” cross-section dependence. In the latter case, the average pairwise correlation coefficient of it and jt is 70%, representing a strong cross-section correlation in practice. Next we allow for deterministic trends in the DGP and the Dickey-Fuller DF regressions. For this case yit is generated as follows: yit κi 1 − αi λi t αi yi,t−1 it ,
3.3
with κi ∼ i.i.d U0, 0.02 and λi ∼ i.i.d U0, 0.02. This ensures that yit has the same average trend properties under the null and the alternative hypotheses. The errors it are generated according to 3.2 with σf2 10, representing the scenario of high cross-section correlation. To examine the impact of residual serial correlation, we consider a number of experiments, where the errors ξit in 3.2 are generated as ξit ρi ξi,t−1 eit ,
3.4
with eit ∼ i.i.d N0, 1. Following Pesaran 15, we choose ρi ∼ i.i.d U0.2, 0.4 for positive serial correlations and ρi ∼ i.i.d U−0.4, −0.2 for negative serial correlations. We use this DGP to check the robustness of the tests to alternative residual correlation models and to the heterogeneity of the coefficients, ρi . Finally we explore the performance of the tests under spatial dependence. We consider two commonly used spatial error processes: the spatial autoregressive SAR and the spatial
6
Journal of Probability and Statistics
moving average SMA. Let t be the N × 1 error vector in 3.1. In SAR, it can be expressed as t θ1 WN t υt IN − θ1 WN −1 υt ,
3.5
where θ1 is the spatial autoregressive parameter, WN is an N × N known spatial weights matrix, and υt is the error component which is assumed to be distributed independently across cross-section dimension with constant variance συ2 . Then the full NT × NT covariance matrix is
−1 , BN ΩSAR συ2 IT ⊗ BN
3.6
where BN IN − θ1 WN . In SMA, the error vector t can be expressed as t θ2 WN υt υt IN θ2 WN υt ,
3.7
with θ2 being the spatial moving average parameter. Then the full NT ×NT covariance matrix becomes
ΩSMA συ2 IT ⊗ IN θ2 WN WN . θ22 WN WN
3.8
Without loss of generality, we let συ2 1. We consider the spatial dependence with θ1 0.8 and θ2 0.8. The average pairwise correlation coefficient of it and jt is 4%–22% for SAR and 2%–8% for SMA, representing a wide range of cross-section correlations in practice. The spatial weight matrix WN is specified as a “1 ahead and 1 behind” matrix with the ith row, 1 < i < N, of this matrix having nonzero elements in positions i 1 and i − 1. Each row of this matrix is normalized such that all its nonzero elements are equal to 1/2. For all of DGPs considered here, we use
αi
⎧ ⎨∼ i.i.d. U0.85, 0.95
for i 1, . . . , N0 , where N0 δ · N,
⎩ 1
for i N0 1, . . . , N,
3.9
where δ indicates the fraction of stationary series in the panel, varying in the interval 0-1. As a result, changes in δ allow us to study the impact of the proportion of stationary series on the power of tests. When δ 0, we explore the size of tests. We set δ 0.1, 0.5 and 0.9 to examine the power of the tests under heterogeneous alternatives. The tests are one-sided with the nominal size set at 5% and conducted for all combinations of N and T 20, 50, and 100. We also conduct the simulations with the nominal size set at 1% and 10%. The results are qualitatively similar to those at the 5% level, and thus are not reported here. The results are obtained with MATLAB using M 2000 simulations. To calculate the empirical critical value for the modified TPM, we run additional B 1000 replications within each simulation. We calculate the augmented Dickey-Fuller ADF t statistics. The number of lags in the ADF regressions is selected according to the recursive t-test procedure. Start with an upper bound, kmax 8, on k. If the last included lag is significant, choose k kmax , if not, reduce k
Journal of Probability and Statistics
7
Table 1: Size and power of panel unit root tests: cross-section independence. N 20
δ0
50
100
20
δ 0.1
50
100
20
δ 0.5
50
100
20
δ 0.9
50 100
T 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100
P 0.059 0.054 0.047 0.040 0.048 0.057 0.051 0.047 0.052 0.066 0.087 0.172 0.074 0.123 0.303 0.080 0.167 0.464 0.120 0.417 0.951 0.181 0.749 1.000 0.292 0.950 1.000 0.182 0.816 1.000 0.358 0.994 1.000 0.591 1.000 1.000
Z∗ 0.056 0.046 0.047 0.044 0.045 0.054 0.050 0.051 0.045 0.072 0.080 0.144 0.081 0.128 0.251 0.085 0.165 0.366 0.144 0.489 0.931 0.261 0.838 1.000 0.422 0.979 1.000 0.283 0.933 1.000 0.562 1.000 1.000 0.834 1.000 1.000
S 0.052 0.053 0.053 0.050 0.048 0.047 0.050 0.048 0.049 0.052 0.062 0.112 0.054 0.064 0.121 0.048 0.060 0.130 0.066 0.106 0.360 0.058 0.119 0.417 0.059 0.108 0.447 0.058 0.127 0.580 0.054 0.151 0.649 0.069 0.156 0.676
W∗ 0.063 0.050 0.052 0.045 0.045 0.058 0.047 0.057 0.049 0.056 0.083 0.0174 0.065 0.102 0.276 0.066 0.126 0.435 0.083 0.253 0.860 0.108 0.454 0.998 0.142 0.683 1.000 0.105 0.495 0.994 0.154 0.817 1.000 0.257 0.969 1.000
Note. Rejection rates of panel unit root tests at nominal level α 0.05, using 2000 simulations. P is Maddala and Wu’s 5 original Fisher test, Z ∗ is Demetrescu et al.’s 7 modified inverse normal method, S is Hanck’s 8 Simes test, and W ∗ is Sheng and Yang’s 9 modified TPM.
by one until the last lag becomes significant. If no lag is significant, set k 0. The 10 percent level of the asymptotic normal distribution is used to determine the significance of the last lag. As shown in the work of Ng and Perron 21, this sequential testing procedure has better size properties than those based on information criteria in panel unit root tests. The P values in this paper are calculated using the response surfaces estimated in the study by Mackinnon 22.
8
Journal of Probability and Statistics
Table 2: Size and power of panel unit root tests: no serial correlation, cross-section dependence driven by a common factor. N 20
δ0
50
100
20
δ 0.1
50
100
20
δ 0.5
50
100
20
δ 0.9
50
100
T 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100
P 0.239 0.234 0.243 0.280 0.290 0.290 0.311 0.305 0.305 0.244 0.263 0.303 0.301 0.315 0.373 0.318 0.364 0.410 0.281 0.406 0.679 0.338 0.486 0.759 0.402 0.501 0.792 0.314 0.529 0.872 0.377 0.590 0.913 0.432 0.655 0.935
Intercept only Z∗ S 0.076 0.035 0.070 0.034 0.070 0.036 0.070 0.042 0.069 0.030 0.066 0.031 0.076 0.048 0.070 0.029 0.068 0.029 0.078 0.034 0.078 0.043 0.099 0.094 0.074 0.044 0.070 0.035 0.100 0.090 0.077 0.064 0.084 0.041 0.094 0.088 0.093 0.042 0.150 0.065 0.396 0.229 0.101 0.053 0.166 0.063 0.433 0.212 0.106 0.075 0.158 0.058 0.437 0.196 0.094 0.046 0.172 0.091 0.510 0.305 0.088 0.058 0.171 0.076 0.514 0.305 0.098 0.078 0.176 0.090 0.508 0.276
W∗ 0.054 0.049 0.049 0.049 0.046 0.047 0.051 0.050 0.048 0.054 0.054 0.073 0.050 0.048 0.082 0.054 0.062 0.084 0.074 0.116 0.351 0.075 0.127 0.368 0.086 0.116 0.384 0.069 0.115 0.382 0.064 0.117 0.390 0.064 0.122 0.373
P 0.233 0.259 0.243 0.297 0.291 0.275 0.326 0.340 0.300 0.238 0.243 0.272 0.280 0.310 0.333 0.319 0.319 0.350 0.251 0.314 0.476 0.288 0.373 0.565 0.352 0.405 0.598 0.260 0.368 0.660 0.298 0.442 0.742 0.372 0.491 0.769
Intercept and trend Z∗ S 0.072 0.041 0.074 0.032 0.075 0.035 0.069 0.033 0.061 0.028 0.063 0.031 0.082 0.038 0.070 0.024 0.062 0.028 0.067 0.031 0.063 0.036 0.083 0.057 0.068 0.031 0.085 0.037 0.080 0.047 0.070 0.032 0.068 0.027 0.078 0.051 0.083 0.040 0.102 0.052 0.221 0.125 0.068 0.029 0.097 0.047 0.225 0.113 0.082 0.031 0.098 0.039 0.223 0.104 0.070 0.036 0.107 0.047 0.282 0.163 0.073 0.033 0.107 0.051 0.282 0.148 0.077 0.028 0.118 0.054 0.291 0.139
W∗ 0.061 0.062 0.061 0.063 0.055 0.054 0.074 0.061 0.054 0.057 0.057 0.078 0.058 0.075 0.077 0.059 0.057 0.079 0.075 0.096 0.228 0.061 0.096 0.237 0.075 0.094 0.240 0.064 0.100 0.268 0.066 0.096 0.270 0.068 0.111 0.270
Note. See Table 1.
3.2. Monte Carlo Results We compare the finite sample size and power of the following tests: Maddala and Wu’s 5 original Fisher test denoted by P , Demetrescu et al.’s 7 modified inverse normal method denoted by Z∗ , Hanck’s 8 Simes test denoted by S, and Sheng and Yang 9’s
Journal of Probability and Statistics
9
Table 3: Size and power of panel unit root tests: serial correlation, intercept only, cross-section dependence driven by a common factor.
N 20
δ0
50
100
20
δ 0.1
50
100
20
δ 0.5
50
100
20
δ 0.9
50
100
T 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100
P 0.250 0.240 0.224 0.309 0.289 0.283 0.335 0.317 0.308 0.256 0.241 0.282 0.302 0.308 0.354 0.330 0.363 0.399 0.285 0.383 0.629 0.351 0.483 0.757 0.398 0.529 0.781 0.323 0.511 0.858 0.376 0.598 0.901 0.415 0.633 0.918
Positive serial correlation S W∗ Z∗ 0.116 0.085 0.081 0.105 0.063 0.071 0.090 0.048 0.054 0.148 0.126 0.112 0.091 0.063 0.068 0.087 0.043 0.062 0.141 0.149 0.114 0.100 0.057 0.071 0.094 0.042 0.066 0.139 0.111 0.116 0.104 0.070 0.076 0.109 0.093 0.082 0.141 0.117 0.105 0.113 0.072 0.083 0.125 0.096 0.099 0.134 0.139 0.106 0.118 0.073 0.088 0.133 0.110 0.111 0.152 0.117 0.117 0.174 0.104 0.132 0.382 0.221 0.342 0.164 0.129 0.135 0.190 0.110 0.152 0.435 0.231 0.367 0.175 0.169 0.144 0.195 0.108 0.157 0.439 0.219 0.382 0.151 0.128 0.113 0.199 0.124 0.146 0.505 0.324 0.393 0.152 0.139 0.116 0.208 0.135 0.152 0.494 0.300 0.361 0.157 0.179 0.121 0.185 0.128 0.125 0.523 0.320 0.392
P 0.255 0.246 0.237 0.306 0.288 0.285 0.308 0.301 0.331 0.260 0.263 0.278 0.303 0.327 0.368 0.340 0.331 0.394 0.294 0.393 0.636 0.338 0.463 0.731 0.387 0.486 0.740 0.327 0.505 0.833 0.316 0.572 0.614 0.413 0.613 0.902
Negative serial correlation Z∗ S W∗ 0.097 0.077 0.075 0.076 0.044 0.050 0.071 0.033 0.048 0.096 0.090 0.076 0.080 0.050 0.063 0.076 0.034 0.051 0.103 0.100 0.078 0.078 0.052 0.056 0.074 0.037 0.049 0.108 0.079 0.082 0.091 0.063 0.064 0.091 0.096 0.073 0.114 0.101 0.087 0.087 0.064 0.066 0.104 0.098 0.081 0.117 0.120 0.090 0.086 0.063 0.064 0.100 0.098 0.085 0.136 0.099 0.111 0.169 0.093 0.136 0.348 0.200 0.315 0.146 0.112 0.124 0.184 0.106 0.157 0.388 0.210 0.344 0.153 0.137 0.130 0.162 0.100 0.133 0.368 0.204 0.336 0.132 0.110 0.104 0.182 0.116 0.133 0.464 0.276 0.355 0.144 0.111 0.108 0.180 0.108 0.130 0.185 0.117 0.135 0.127 0.131 0.093 0.185 0.122 0.138 0.478 0.291 0.370
Note. See Table 1.
modified TPM denoted by W ∗ . The results in Table 1 are obtained for the case of crosssection independence for a benchmark comparison. Tables 2 and 3 consider the cases of cross-section dependence driven by a single common factor with the trend and residual serial correlation. Table 4 reports the results with spatial dependence. Given the size distortions of some methods, we also include the size-adjusted power in Tables 5, 6, and 7. Major findings of our experiments can be summarized as follows.
10
Journal of Probability and Statistics Table 4: Size and power of panel unit root tests: intercept only, spatial dependence.
N 20
δ0
50
100
20
δ 0.1
50
100
20
δ 0.5
50
100
20
δ 0.9
50
100
T 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100
P 0.121 0.126 0.133 0.140 0.142 0.123 0.143 0.152 0.136 0.144 0.167 0.244 0.160 0.198 0.353 0.161 0.264 0.479 0.199 0.414 0.857 0.272 0.669 0.989 0.356 0.848 1.000 0.281 0.687 0.992 0.407 0.934 1.000 0.538 0.996 1.000
Spatial autoregressive S Z∗ 0.059 0.040 0.063 0.044 0.066 0.044 0.054 0.040 0.063 0.058 0.052 0.051 0.046 0.053 0.046 0.051 0.047 0.049 0.068 0.049 0.089 0.057 0.135 0.104 0.071 0.053 0.092 0.057 0.189 0.124 0.048 0.052 0.097 0.058 0.231 0.121 0.095 0.056 0.217 0.098 0.626 0.318 0.098 0.052 0.308 0.118 0.855 0.380 0.101 0.068 0.383 0.109 0.967 0.418 0.117 0.072 0.264 0.141 0.754 0.482 0.116 0.067 0.276 0.145 0.904 0.579 0.108 0.065 0.295 0.145 0.984 0.661
W∗ 0.046 0.048 0.049 0.038 0.042 0.038 0.021 0.022 0.023 0.048 0.058 0.119 0.039 0.056 0.158 0.024 0.038 0.171 0.064 0.141 0.505 0.054 0.166 0.732 0.037 0.166 0.888 0.067 0.163 0.578 0.057 0.147 0.737 0.031 0.146 0.846
P 0.079 0.086 0.091 0.081 0.089 0.089 0.095 0.092 0.089 0.089 0.128 0.196 0.095 0.159 0.320 0.113 0.203 0.499 0.155 0.425 0.903 0.224 0.710 0.999 0.289 0.929 1.000 0.222 0.752 1.000 0.379 0.980 1.000 0.541 1.000 1.000
Spatial moving average Z∗ S 0.050 0.051 0.054 0.050 0.060 0.047 0.030 0.042 0.038 0.057 0.039 0.042 0.026 0.055 0.027 0.060 0.023 0.052 0.060 0.050 0.073 0.063 0.135 0.113 0.043 0.052 0.075 0.069 0.166 0.133 0.032 0.050 0.064 0.065 0.217 0.141 0.082 0.053 0.239 0.099 0.745 0.355 0.085 0.059 0.369 0.107 0.938 0.402 0.076 0.055 0.446 0.117 0.987 0.460 0.090 0.055 0.297 0.152 0.850 0.558 0.102 0.067 0.279 0.140 0.958 0.618 0.092 0.067 0.247 0.144 0.997 0.675
W∗ 0.034 0.039 0.043 0.020 0.024 0.025 0.010 0.013 0.012 0.040 0.055 0.105 0.023 0.038 0.129 0.011 0.018 0.148 0.048 0.132 0.585 0.033 0.163 0.833 0.019 0.159 0.955 0.049 0.161 0.652 0.037 0.131 0.792 0.027 0.100 0.901
Note. See Table 1.
1 In the absence of clear guidance regarding the choice of τ, we try 10 different values, ranging from 0.05, 0.1, 0.2, . . ., up to 0.9. Our simulation results show that W ∗ tends to be slightly oversized with a small τ but moderately undersized with a large τ and that its power does not show any clear patterns. We also note that W ∗ yields similar results as τ varies between 0.05 and 0.2. In our paper we select τ 0.1. To save space, the complete simulation results are not reported here, but are available upon request.
Journal of Probability and Statistics
11
Table 5: Size-adjusted power of panel unit root tests: no serial correlation, cross-section dependence driven by a common factor.
N 20
δ 0.1
50
100
20
δ 0.5
50
100
20
δ 0.9
50
100
T 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100
P 0.043 0.048 0.062 0.061 0.053 0.066 0.099 0.043 0.065 0.044 0.070 0.200 0.057 0.085 0.161 0.114 0.069 0.189 0.059 0.140 0.456 0.084 0.150 0.453 0.144 0.143 0.446
Intercept only Z∗ 0.046 0.056 0.084 0.063 0.054 0.080 0.121 0.051 0.069 0.071 0.116 0.349 0.080 0.133 0.348 0.154 0.123 0.355 0.050 0.114 0.433 0.073 0.114 0.424 0.151 0.129 0.395
W∗ 0.044 0.049 0.063 0.061 0.054 0.058 0.094 0.042 0.065 0.042 0.068 0.207 0.057 0.076 0.160 0.119 0.064 0.189 0.058 0.137 0.431 0.082 0.144 0.422 0.131 0.136 0.414
P 0.038 0.049 0.066 0.047 0.042 0.056 0.052 0.046 0.058 0.045 0.062 0.123 0.052 0.051 0.116 0.048 0.057 0.099 0.061 0.088 0.250 0.054 0.087 0.243 0.054 0.088 0.208
Intercept and trend Z∗ W∗ 0.052 0.037 0.050 0.050 0.062 0.064 0.046 0.047 0.046 0.040 0.062 0.053 0.058 0.052 0.045 0.046 0.067 0.055 0.056 0.046 0.075 0.058 0.163 0.110 0.056 0.047 0.062 0.041 0.182 0.099 0.053 0.047 0.068 0.054 0.169 0.086 0.061 0.060 0.077 0.084 0.201 0.231 0.053 0.053 0.073 0.081 0.225 0.222 0.051 0.054 0.070 0.086 0.198 0.191
Note. The power is calculated at the exact 5% level. The 5% critical values for these tests are obtained from their finite sample distributions generated by 2000 simulations for sample sizes T 20, 50, and 100. P is Maddala and Wu’s 5 original Fisher test, Z ∗ is Demetrescu et al.’s 7 modified inverse normal method, and W ∗ is Sheng and Yang’s 9 modified TPM.
2 With no cross-section dependence, all the tests yield good empirical size, close to the 5% nominal level Table 1. As expected, P test shows severe size distortions under cross-section dependence driven by a common factor or by spatial correlations. For a common factor with no residual serial correlation, while Z∗ test is mildly oversized and S test is slightly undersized, W ∗ test shows satisfactory size properties Table 2. The presence of serial correlation leads to size distortions for all statistics when T is small, which even persist when T 100 for P and Z∗ tests. On the contrary, S and W ∗ tests exhibit good size properties with T 50 and 100 Table 3. Under spatial dependence, S test performs the best in terms of size, while Z∗ and W ∗ tests are conservative for large N Table 4. 3 All the tests become more powerful as N increases, which justifies the use of panel data in unit root tests. When a linear time trend is included, the power of all the
12
Journal of Probability and Statistics
Table 6: Size-adjusted power of panel unit root tests: serial correlation, intercept only, cross-section dependence driven by a common factor.
N 20
δ 0.1
50
100
20
δ 0.5
50
100
20
δ 0.9
50
100
T 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100
P 0.041 0.053 0.062 0.042 0.059 0.058 0.045 0.056 0.054 0.040 0.072 0.188 0.049 0.081 0.207 0.051 0.100 0.209 0.058 0.153 0.424 0.068 0.162 0.454 0.062 0.169 0.431
Positive correlation W∗ Z∗ 0.051 0.043 0.054 0.050 0.065 0.057 0.058 0.043 0.069 0.063 0.075 0.052 0.058 0.041 0.061 0.052 0.067 0.053 0.056 0.040 0.100 0.069 0.255 0.199 0.059 0.048 0.120 0.083 0.302 0.210 0.054 0.046 0.124 0.095 0.330 0.221 0.060 0.056 0.083 0.148 0.242 0.391 0.054 0.066 0.078 0.156 0.262 0.415 0.052 0.061 0.088 0.157 0.268 0.411
P 0.042 0.043 0.067 0.050 0.053 0.056 0.044 0.063 0.069 0.045 0.069 0.158 0.051 0.068 0.168 0.037 0.071 0.148 0.066 0.119 0.390 0.065 0.136 0.376 0.058 0.135 0.358
Negative correlation Z∗ W∗ 0.048 0.041 0.054 0.042 0.077 0.066 0.053 0.048 0.058 0.053 0.066 0.054 0.051 0.044 0.061 0.065 0.083 0.065 0.064 0.039 0.117 0.059 0.298 0.150 0.093 0.045 0.127 0.058 0.309 0.150 0.084 0.033 0.121 0.063 0.345 0.127 0.066 0.065 0.110 0.114 0.384 0.376 0.069 0.061 0.130 0.134 0.376 0.352 0.063 0.057 0.114 0.135 0.371 0.326
Note. See Table 5.
tests decreases substantially. Also notable is the fact that the power of tests increases when the proportion of stationary series increases in the panel. 4 Compared to the other three tests, the size-unadjusted power of S test is somewhat disappointing here. An exception is that, when only very few series are stationary, S test becomes most powerful. When the proportion of stationary series in the panel increases, however, S test is outperformed by other tests. For example, in the case of no cross-section dependence in Table 1 with δ 0.9, N 100, and T 50, the power of S test is 0.156, and, in contrast, all other tests have power close to 1. 5 Because P test has severe size distortions, we only compare Z∗ and W ∗ tests in terms of size-adjusted power. The power is calculated at the exact 5% level. The 5% critical values for these tests are obtained from their finite sample distributions generated by 2000 simulations for sample size T 20, 50, and 100. Since Hanck’s 8 test does not have an explicit form of finite sample distribution, we do not calculate its size-adjusted power. With the cross-section dependence driven by a common
Journal of Probability and Statistics
13
Table 7: Size-adjusted power of panel unit root tests: intercept only, spatial dependence.
N 20
δ 0.1
50
100
20
δ 0.5
50
100
20
δ 0.9
50
100
T 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100 20 50 100
P 0.046 0.085 0.108 0.055 0.098 0.165 0.069 0.103 0.283 0.088 0.238 0.669 0.114 0.442 0.957 0.168 0.654 0.999 0.100 0.519 0.968 0.218 0.812 1.000 0.329 0.977 1.000
Autoregressive Z∗ 0.048 0.083 0.096 0.055 0.101 0.151 0.071 0.091 0.247 0.067 0.178 0.557 0.093 0.298 0.850 0.114 0.393 0.974 0.066 0.223 0.636 0.113 0.269 0.864 0.124 0.285 0.988
W∗ 0.046 0.078 0.102 0.060 0.096 0.190 0.066 0.088 0.311 0.081 0.190 0.640 0.093 0.321 0.922 0.116 0.465 0.996 0.087 0.359 0.918 0.143 0.578 0.998 0.174 0.795 1.000
P 0.059 0.072 0.144 0.058 0.098 0.234 0.065 0.125 0.360 0.117 0.298 0.852 0.141 0.593 0.995 0.211 0.859 1.000 0.158 0.644 0.999 0.284 0.958 1.000 0.472 0.999 1.000
Moving average Z∗ 0.056 0.069 0.142 0.061 0.095 0.207 0.068 0.119 0.334 0.094 0.211 0.760 0.104 0.381 0.961 0.146 0.578 1.000 0.105 0.272 0.845 0.142 0.371 0.987 0.143 0.569 1.000
W∗ 0.049 0.067 0.148 0.047 0.089 0.235 0.062 0.105 0.369 0.076 0.211 0.785 0.085 0.379 0.984 0.121 0.615 1.000 0.100 0.438 0.977 0.135 0.709 1.000 0.227 0.930 1.000
Note. See Table 5.
factor, Z∗ test tends to deliver higher power for δ 0.5 but lower power for δ 0.9 than W ∗ test Tables 5 and 6. Under spatial dependence, however, the former is clearly dominated by the latter in most of the time. This is especially true for SAR process, where W ∗ test exhibits substantially higher size-adjusted power than Z∗ test Table 7.
4. Empirical Application Purchasing Power Parity PPP is a key assumption in many theoretical models of international economics. Empirical evidence of PPP for the floating regime period 1973– 1998 is, however, mixed. While several authors, such as Wu and Wu 23 and Lopez 24, found supporting evidence, others 10, 15, 25 questioned the validity of PPP for this period. In this section, we use the methods discussed in previous sections to investigate if the real exchange rates are stationary among a group of OECD countries.
14
Journal of Probability and Statistics Table 8: Unit root tests for 27 OECD real exchange rates.
US dollar real exchange rate Deutchemark real exchange rate Country k P value Simes criterion Country k P value Simes criterion New Zealand 8 0.008 0.002 Mexico 3 0.006 0.002 Sweden 8 0.053 0.004 Iceland 0 0.010 0.004 United Kingdom 7 0.055 0.006 Australia 3 0.012 0.006 Finland 7 0.058 0.007 Korea 0 0.014 0.007 Spain 8 0.061 0.009 Canada 7 0.040 0.009 Mexico 3 0.066 0.011 Sweden 0 0.074 0.011 Iceland 8 0.069 0.013 United States 4 0.148 0.013 Switzerland 4 0.071 0.015 New Zealand 0 0.171 0.015 France 4 0.080 0.017 Finland 6 0.232 0.017 Netherlands 4 0.099 0.019 Turkey 8 0.241 0.019 Austria 4 0.102 0.020 Netherlands 1 0.415 0.020 Italy 4 0.103 0.022 Norway 7 0.417 0.022 Belgium 4 0.135 0.024 Spain 0 0.459 0.024 Korea 0 0.138 0.026 France 0 0.564 0.026 Germany 4 0.148 0.028 Italy 0 0.565 0.028 Greece 4 0.150 0.030 Poland 5 0.579 0.030 Norway 7 0.167 0.031 Hungary 4 0.612 0.031 Denmark 3 0.206 0.033 Belgium 0 0.618 0.033 Ireland 7 0.235 0.035 Luxembourg 0 0.655 0.035 Japan 4 0.246 0.037 Japan 5 0.656 0.037 Luxembourg 3 0.276 0.039 United Kingdom 0 0.697 0.039 Portugal 8 0.332 0.041 Denmark 0 0.698 0.041 Australia 3 0.386 0.043 Ireland 0 0.708 0.043 Poland 0 0.414 0.044 Austria 0 0.720 0.044 Turkey 8 0.418 0.046 Switzerland 8 0.733 0.046 Canada 6 0.580 0.048 Portugal 0 0.786 0.048 Hungary 0 0.816 0.050 Greece 5 0.880 0.050 P 0.097 0.015 Z∗ 0.095 0.016 ∗ 0.257 0.002 W Note. Simes criterion is calculated using the 5% significance level.
The log real exchange rate between country i and the US is given by qit sit − pus,t pit ,
4.1
where sit is the nominal exchange rate of the ith country’s currency in terms of US dollar and pus,t and pit are consumer price indices in the US and country i, respectively. All these variables are measured in natural logarithms. We use quarterly data from 1973 : 1 to 1998 : 2 for 27 OECD countries, as listed in Table 8. Two countries, Czech Republic and Slovak Republic, are excluded from our analysis, since their data span a very limited period of time, starting at 1993 : 1. All data are obtained from the IMF’s International Financial Statistics. Note that, for Iceland, the consumer price indices are missing during 1982:Q1–1982:Q4 in
Journal of Probability and Statistics
15
the original data. We filled out this gap by calculating the level of CPI from its percentage changes in the IMF database. As the first stage in our analysis we estimated individual ADF regressions: Δqit μi φi qi,t−1
ki
ϕij Δqi,t−j εit ,
i 1, . . . , N; t ki 2, . . . , T.
4.2
j1
The null and alternative hypotheses for testing PPP are specified in 2.3 and 2.4, respectively. The selected lags and the P values are reported in Table 8. The results in the left panel show that the ADF test does not reject the unit root null of real exchange rate at the 5% level except for New Zealand. As a robustness check, we investigated the impact of a change in numeraire on the results. The right panel reports the estimation results when the Deutsche mark is used as the numeraire. Out of 27 countries, only 5—Mexico, Iceland, Australia, Korea, and Canada—reject the null of unit root at the 5% level. As is well known, the ADF test has low power with a short time span. Exploring the cross-section dimension is an alternative. However, if a positive cross-section dependence is ignored, panel unit root tests can also lead to spurious results, as pointed out by O’Connell 10. As a preliminary check, we compute the pairwise cross-section correlation coefficients of the residuals from the above individual ADF regressions, ρij . The simple average of these correlation coefficients is calculated, according to Pesaran 26, as ρ
N N−1 2 ρij . NN − 1 i1 ji1
4.3
The associated cross-section dependence CD test statistic is calculated using CD
N N−1 2T ρij . NN − 1 i1 ji1
4.4
In our sample ρ is estimated as 0.396 and 0.513 when US dollar and Deutchemark are considered as the numeraire, respectively. The CD statistics, 71.137 for the former and 93.368 for the latter, strongly reject the null of no cross-section dependence at the conventional significance level. Now turning to panel unit root tests, Simes test does not reject the unit root null, regardless of which numeraire, US dollar or Deutchemark, is used. However, the evidence is mixed, as illustrated by other test statistics. For 27 OECD countries as a whole, we find substantial evidence against the unit root null with Deutchemark but not with US dollar. In summary, our results from panel unit root tests are numeraire specific, consistent with Lopez 24, and provide mixed evidence in support of PPP for the floating regime period.
5. Conclusion We conduct a systematic comparison of the performance of four commonly used P -value combination methods applied to panel unit root tests: the original Fisher test, the modified
16
Journal of Probability and Statistics
inverse normal method, Simes test, and the modified TPM. Monte Carlo evidence shows that, in the presence of both “strong” and “weak” cross-section dependence, the original Fisher test is severely oversized but the other three tests exhibit good size properties with moderate and large T . In terms of power, Simes test is useful when the total evidence against the joint null hypothesis is concentrated in one or very few of the tests being combined, and the modified inverse normal method and the modified TPM perform well when evidence against the joint null is spread among more than a small fraction of the panel units. Furthermore, under spatial dependence, the modified TPM yields the highest size-adjusted power. We investigate the PPP hypothesis for a panel of OECD countries and find mixed evidence. The results of this work provide practitioners with guidelines to follow for selecting an appropriate combination method in panel unit root tests. A worthwhile extension would be to develop bootstrap P value combination methods that are robust to general forms of cross-section dependence in panel data. This issue is currently under investigation by the authors.
Acknowledgment The authors have benefited greatly from discussions with Kajal Lahiri and Dmitri Zaykin. They also thank the guest editor Mike Tsionas and an anonymous referee for helpful comments. The usual disclaimer applies.
References 1 L. H. C. Tippett, The Method of Statistics, Williams and Norgate, London, UK, 1931. 2 R. A. Fisher, Statistical Methods for Research Workers, Oliver and Boyd, London, UK, 4th edition, 1932. 3 L. V. Hedges and I. Olkin, Statistical Methods for Meta-Analysis, Academic Press, Orlando, Fla, USA, 1985. 4 T. M. Loughin, “A systematic comparison of methods for combining -values from independent tests,” Computational Statistics & Data Analysis, vol. 47, no. 3, pp. 467–485, 2004. 5 G. S. Maddala and S. Wu, “A comparative study of unit root tests with panel data and a new simple test,” Oxford Bulletin of Economics and Statistics, vol. 61, pp. 631–652, 1999. 6 I. Choi, “Unit root tests for panel data,” Journal of International Money and Finance, vol. 20, no. 2, pp. 249–272, 2001. 7 M. Demetrescu, U. Hassler, and A. I. Tarcolea, “Combining significance of correlated statistics with application to panel data,” Oxford Bulletin of Economics and Statistics, vol. 68, no. 5, pp. 647–663, 2006. 8 C. Hanck, “Intersection test for panel unit roots,” Econometric Reviews. In press. 9 X. Sheng and J. Yang, “A simple panel unit root test by combining dependent p-values,” SSRN working paper, no. 1526047, 2009. 10 P. G. J. O’Connell, “The overvaluation of purchasing power parity,” Journal of International Economics, vol. 44, no. 1, pp. 1–19, 1998. 11 P. C. B. Phillips and D. Sul, “Dynamic panel estimation and homogeneity testing under cross section dependence,” Econometrics Journal, vol. 6, no. 1, pp. 217–259, 2003. 12 J. Bai and S. Ng, “A PANIC attack on unit roots and cointegration,” Econometrica, vol. 72, no. 4, pp. 1127–1177, 2004. 13 Y. Chang, “Bootstrap unit root tests in panels with cross-sectional dependency,” Journal of Econometrics, vol. 120, no. 2, pp. 263–293, 2004. 14 H. R. Moon and B. Perron, “Testing for a unit root in panels with dynamic factors,” Journal of Econometrics, vol. 122, no. 1, pp. 81–126, 2004. 15 M. H. Pesaran, “A simple panel unit root test in the presence of cross-section dependence,” Journal of Applied Econometrics, vol. 22, no. 2, pp. 265–312, 2007.
Journal of Probability and Statistics
17
16 J. Breitung and M. H. Pesaran, “Unit root and cointegration in panels,” in The Econometrics of Panel Data: Fundamentals and Recent Developments in Theory and Practice, L. Matyas and P. Sevestre, Eds., pp. 279–322, Kluwer Academic Publishers, Boston, Mass, USA, 2008. 17 S. A. Stouffer, E. A. Suchman, L. C. DeVinney, S. A. Star, and R. M. Williams Jr., The American Soldier, vol. 1 of Adjustment during Army Life, Princeton University Press, Princeton, NJ, USA, 1949. 18 J. Hartung, “A note on combining dependent tests of significance,” Biometrical Journal, vol. 41, no. 7, pp. 849–855, 1999. 19 R. J. Simes, “An improved Bonferroni procedure for multiple tests of significance,” Biometrika, vol. 73, no. 3, pp. 751–754, 1986. 20 D. V. Zaykin, L. A. Zhivotovsky, P. H. Westfall, and B. S. Weir, “Truncated product method for combining P-values,” Genetic Epidemiology, vol. 22, no. 2, pp. 170–185, 2002. 21 S. Ng and P. Perron, “Unit root tests in ARMA models with data-dependent methods for the selection of the truncation lag,” Journal of the American Statistical Association, vol. 90, no. 429, pp. 268–281, 1995. 22 J. G. Mackinnon, “Numerical distribution functions for unit root and cointegration tests,” Journal of Applied Econometrics, vol. 11, no. 6, pp. 601–618, 1996. 23 J.-L. Wu and S. Wu, “Is purchasing power parity overvalued?” Journal of Money, Credit and Banking, vol. 33, pp. 804–812, 2001. 24 C. Lopez, “Evidence of purchasing power parity for the floating regime period,” Journal of International Money and Finance, vol. 27, no. 1, pp. 156–164, 2008. 25 I. Choi and T. K. Chue, “Subsampling hypothesis tests for nonstationary panels with applications to exchange rates and stock prices,” Journal of Applied Econometrics, vol. 22, no. 2, pp. 233–264, 2007. 26 M. H. Pesaran, “General diagnostic tests for cross section dependence in panels,” Cambridge working papers in economics, no. 435, University of Cambridge, 2004.
Hindawi Publishing Corporation Journal of Probability and Statistics Volume 2011, Article ID 874251, 23 pages doi:10.1155/2011/874251
Research Article Nonparametric Estimation of ATE and QTE: An Application of Fractile Graphical Analysis Gabriel V. Montes-Rojas Department of Economics, City University London, D306 Social Sciences Building, Northampton Square, London EC1V 0HB, UK Correspondence should be addressed to Gabriel V. Montes-Rojas,
[email protected] Received 3 May 2011; Accepted 28 July 2011 Academic Editor: Mike Tsionas Copyright q 2011 Gabriel V. Montes-Rojas. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Nonparametric estimators for average and quantile treatment effects are constructed using Fractile Graphical Analysis, under the identifying assumption that selection to treatment is based on observable characteristics. The proposed method has two steps: first, the propensity score is estimated, and, second, a blocking estimation procedure using this estimate is used to compute treatment effects. In both cases, the estimators are proved to be consistent. Monte Carlo results show a better performance than other procedures based on the propensity score. Finally, these estimators are applied to a job training dataset.
1. Introduction Econometric methods for estimating the effects of certain programs such as job search assistance or classroom teaching programs has been widely developed since the pioneering work of Ashenfelter 1, LaLonde 2, and others. In this case, a treatment refers to a certain program whose benefits are potentially obtainable by those selected for participation treated, and it has no effect on a control group nontreated. Estimating average treatment effects ATEs, which refers to the mean effect of the program on a given outcome variable in parametric and nonparametric environments see 3, 4, has been a central issue in the literature. Lehmann 5 and Doksum 6 introduced the concept of quantile treatment effects QTEs as the difference of the quantiles of the treated and control outcome distributions. In this case, it is implicitly assumed that individuals have an intrinsic heterogeneity which cannot be controlled for using observables. Bitler et al. 7 discuss the costs of focusing on average treatment estimation instead of other statistics.
2
Journal of Probability and Statistics
Provided that in nonexperimental settings selection into treatment is not random, ordinary least squares OLSs and quantile regression techniques are inconsistent. As stated by Heckman and Navarro-Lozano 8, three different approaches were used to overcome this problem. First, the control function approach explicitly models the selection mechanism and its relation to the outcome equation; second, instrumental variables; third, local estimation and aggregation. In the latter, under the unconfoundedness assumption, which states that conditional on a given set of exogenous covariates observables treatment occurrence is statistically independent of the potential outcomes, local unbiased estimates can be obtained by conditioning on this set of covariates. The identification strategies that we follow relies on this assumption. Rosenbaum and Rubin 9, 10 show that, adjusting solely for differences between treated and control units in a scalar function of the pretreatment covariates, the propensity score also removes the entire bias associated with differences in pre-treatment variables. Several estimation methods have been proposed for estimating ATE by conditioning on the propensity score. Matching estimators are widely used in empirical settings and in particular propensity score matching. In this case, each treated nontreated individual is matched to a nontreated treated individual or aggregate of individuals by means of their proximity in terms of the propensity score. Only in a few cases matching on more than one dimension has been used see, e.g., 11 because of the computational burden that multivariate matching requires. Moreover, Hirano et al.’s 12 method uses a series estimator of the propensity score to obtain efficient in the sense of Hahn 13 ATE estimators. Estimation of QTE has been developed using the minimization of convex check functions as in Koenker and Bassett 14. Abadie et al. 15 and Chernozhukov and Hansen 16, 17 develop this methodology using instrumental variables. On the other hand, Firpo 18 does not require instrumental variables, and his methodology follows a two-step procedure: in the first stage, he estimates the propensity score using a series estimator, while, in the second, he uses a weighted quantile regression method. Bitler et al. 7 compute QTE using the empirical distribution function and derives an equivalent estimator. Diamond 19 uses matching to construct comparable treated and nontreated groups, and, then computes the difference between the matched sample quantiles. An alternative source of heterogeneity comes from the consideration of observables only. Treatment effects may vary depending on the amount of human capital or on the income and job status of their families. Differences in terms of these covariates determines that one may be interested in the conditional treatment effect that is conditional on some value of the observables. For instance, in terms of the propensity score, individuals are more likely to receive a treatment may have a different effect than those are less likely to receive it. As we show in this paper, how observables are treated determines differences in the parameter of interest for QTE but not for ATE. We define the average conditional quantile treatment effect as our parameter of interest, which can be described as the average of local QTEs. This parameter is equivalent to the standard unconditional QTE only in the case that the quantile treatment effect is constant. In many cases, one would be more interested in the dependence of the outcome variable on the fractiles i.e., quantiles of the covariates rather than the covariates themselves. Mahalanobis’s 20 fractile graphical analysis FGA methodology was developed to account for this heterogeneity in observables. This method has awaken recent interest in the literature as a nonparametric regression technique 21, 22. For our purposes, this methodology can be used as an alternative to matching, and it allows not only for estimating average but also quantile treatment effects. The idea is simple: divide the covariates space into fractiles, and obtain the conditional regression or quantile
Journal of Probability and Statistics
3
by a step function. Provided that the number of fractile groups increases with the number of observations, we obtain consistent estimates of these functions, as the local estimators would satisfy the unconfoundedness assumption quoting Koenker and Hallock 23, page 147: “... segmenting the sample into subsets defined according to the conditioning covariates is always a valid option. Indeed, such local fitting underlies all nonparametric quantile regression approaches. In the most extreme cases, we have p distinct cells corresponding to different settings of the covariate vector, x, and quantile regression reduces simply to computing univariate quantiles for each of these cells.” FGA can be viewed as a histogram-type smoother, and it shares the convergence rate of histograms as opposed to kernel-based methods that have a better performance. In the classification of Imbens 4, it can be associated with the “blocking on the propensity score” methods. An advantage of this procedure is that only the number of fractile groups needs to be chosen as a smoothing parameter. In spirit, this method is very similar to matching. The latter matches every treated individual to a control nontreated individual whose characteristics are similar. Then, using the unconfoundedness assumption, it integrates over the covariates as the matched sample is similar to the treated. FGA decomposes the covariates distribution into fractiles. Then within each fractile, treated and nontreated individuals are compared. Finally, it integrates over the covariates in this case over the fractile groups as matching does. However, this nonparametric technique allows us to recover the complete graph for the conditional expectation or quantiles. In the latter, we show that the graph contains more information than the comparison of treated and nontreated separately. The propensity score FGA estimators are compared to other estimators based on the propensity score. In particular we compare it to propensity score matching estimators and Hirano et al.’s 12 estimator for ATE and to Firpo’s 18 for QTE. The paper is organized as follows. Section 2 describes the general framework and defines the parameters of interest. Section 3 reviews the literature on FGA. Section 4 derives ATE estimators, and Section 5 does it for QTE. Section 6 presents Monte Carlo evidence on the performance of these estimators, while Section 7 applies them to a well-known job training dataset. Conclusions appear in Section 8.
2. A General Setup for Nonrandom Experiments and Main Estimands 2.1. Unconditional Treatment Effects To more formally characterize the model we follow the potential-outcome notation used in Imbens 4, which dates back to Fisher 24, Splawa-Neyman 25, and Rubin 26–28, and it is standard in the literature. Consider N individuals indexed by i 1, 2, . . . , N who may receive a certain “treatment” e.g., receiving job training, indicated by the binary variable Wi 0, 1. Each individual has a pair of potential outcomes Y1i , Y0i that corresponds to the outcome with and without treatment, respectively. The fundamental problem, of course, is the inability to observe at the same time the same individual both with and without the treatment effect; that is, we only observe Yi Wi × Y1i 1 − Wi × Y0i and a set of exogenous variables Xi . We are interested in measuring the “effect” of the W-treatment e.g., whether job training increases salaries or the chances of being employed.
4
Journal of Probability and Statistics A parameter of interest is the average treatment effect, ATE, δ EY1 − Y0
2.1
which tells us whether, on average the W-treatment has an effect on the population. The key identification assumption is the unconfoundedness assumptionRosenbaum and Rubin 9 called this strongly ignorable treatment assignment assumption, Heckman et al. 29 and Lechner 30, 31 conditional independence assumption 9, 28, which states that conditional on the exogenous variables, the treatment indicator is independent of the potential outcomes. More formally, see the following assumption . Assumption 2.1 unconfoundedness. Consider W ⊥ Y1 , Y0 | X,
2.2
where ⊥ denotes statistical independence. Under this assumption we can identify the ATE see, 4 if both treated and nontreated have a common support, that is, comparable Xvalues δ EY1 − Y0 EX EY1 − Y0 | X EX EY1 | X, W 1 − EX EY0 | X, W 0
2.3
EX EY | X, W 1 − EX EY | X, W 0. In some cases, we are interested not only in the average effect but also in the effect on a subgroup of the population. Average treatment effects do not fully describe all the distributional features of the W-treatment. For instance, high-ability individuals may benefit differently from program participation than low-ability ones, even if they have the same value of covariates. This determines that the effect of a certain treatment would vary according to unobservable characteristics. A parameter of interest in the presence of heterogeneous treatment effects is the quantile treatment effect QTE. As originally defined in the studies byDoksum 6 and Lehmann 5, the QTE corresponds, for any fixed percentile, to the horizontal distance between two cumulative distribution functions. Let F0 and F1 be the control and treated distribution of a certain outcome, and let Δy denote the horizontal distance at y between F0 and F1 , that is, F0 y F1 y Δy or Δy F1−1 F0 y − y. We can express this effect not in terms of y but on the quantiles of the same variable, and the QTE is then δτ Δ F0−1 τ F1−1 τ − F0−1 τ ≡ Qτj − Qτj ,
2.4
where Qτj , j 0, 1 are the quantiles of the treated and nontreated outcome distributions. The key identification assumption here is the rank invariance assumption which is implied by the unconfoundedness assumption: in both treatment statuses, all individuals would mantain their rank in the distribution see 29, for a general discussion about this
Journal of Probability and Statistics
5
assumption. Therefore, using a similar argument as in the ATE case, Firpo 18 shows that this assumption provides a way of identifying the QTE: τ EP Y1 ≤ Qτ1 | X EP Y0 ≤ Qτ0 | X EP Y ≤ Qτ1 | X, W 1 EP Y ≤ Qτ0 | X, W 0,
2.5
where the last two expectations can be estimated from the observable data. In both cases, Assumption 2.1 suggests that, by constructing cells of homogenous values of X, we would be able to get an unbiased estimate of the treatment effect. However this becomes increasingly difficult and computationally impossible as the dimension of X increases. Rosenbaum and Rubin 9 argue that the unconfoundedness assumption can be restated in terms of the propensity score, pX ≡ P W 1 | X x, under the following assumption. Assumption 2.2 common support. For all x ∈ domainX, we have that 0 < p ≤ px ≤ p < 1.
2.6
In this case, we have the following lemma. Lemma 2.3. Assumptions 2.1 and 2.2 imply that W ⊥ Y1 , Y0 | pX.
2.7
Proof. See the work by Rosenbaum and Rubin 9. Therefore, the problem can be reduced to the dimension of pX. Through this paper we consider estimators based only on the propensity score.
2.2. Conditional Treatment Effects Let Yj X EYi | X and Fj · | X, j 0, 1 be the outcome distribution functions conditional on X, and let H· be the distribution function of X. Then the ATE can be defined as
Y1 XdHX dF1 Y1 − Y0 XdHX dF0 Y0
Y1 XdF1 Y1 | X −
2.8
Y0 XdF0 Y0 | X dHX.
Therefore, ATE can be obtained by comparing the unconditional mean outcome for the treated and nontreated or by obtaining first the conditional ATE and then integrating over the covariates space.
6
Journal of Probability and Statistics Now define
Qτj x Fj−1 τ | X x ≡ inf Yj : Fj Yj | X x ≥ τ ,
j 0, 1,
2.9
as the conditional τth quantile. In general EX Qτj X / Qτj ,
j 0, 1.
2.10
In other words, the above equivalence cannot be applied to QTE: comparing the unconditional quantiles of the outcome distributions is not equivalent to computing the conditional quantiles and then aggregating. Chernozhukov and Hansen 16, 17 define the conditional quantile treatment effect CQTE as δτ x Qτ1 x − Qτ0 x.
2.11
Define the average conditional quantile treatment effect ACQTE as δ τ EX Qτ1 X − Qτ0 X.
2.12
Strictly speaking, differences in Qτ1 X − Qτ0 X can either be attributed to differences in the treatment effect or differences in the effect of the X’s on the treated and nontreated. For instance, in a linear regression setup, we may have Qτj X ατ, Xj βτ, jX, j 0, 1. In the job training example, we may have that training increases salaries and returns to schooling, where years of schooling are X. However, in general, both parameters cannot be identified separately, and the literature often attributes to the treatment the whole conditional difference, that is, βτ, j βτ, j 0, 1. In order to see these differences consider the following simple example with one outcome variable. Let X be a uniform random variable on 0, 1, and let ⎧ 0 with prob. ⎪ ⎪ ⎨ 0.5 with prob. Y X ⎪ 0.5 with prob. ⎪ ⎩ 1 with prob.
0.5 0.5 0.5 0.5
if X ≤ 0.5,
2.13
if X > 0.5.
Here note that EY EEX Y by the Law of Iterated Expectations. Let Qτ be the quantile of the Y distribution, and let Qτ X be the conditional quantile of Y conditional on X. In this case, ⎧ ⎪ 0 ⎪ ⎪ ⎨ Qτ 0.5 ⎪ ⎪ ⎪ ⎩ 1
if τ < 0.25, if 0.25 ≤ τ < 0.75, if τ ≥ 0.75.
2.14
Journal of Probability and Statistics
7
But,
EX Qτ X
⎧ ⎨0.25
if τ < 0.5,
⎩0.75
if τ ≥ 0.5.
2.15
This determines that recovering the complete graph {X, Qτj X}, j 0, 1, provides additional information that cannot be recovered by computing unconditional quantiles. Firpo 18, Bitler et al. 7, and Diamond’s 19 estimators obtain unconditional quantiles because their estimators compute the difference between the treated and nontreated quantiles. If we add X to the model and the treatment effect is constant across X, then we have the following expression: Qτ X ατ βτ1X > 0.5 0.5 0.5 × 1X > 0.5,
∀τ.
2.16
However, in this case, we would be attributing no difference across quantiles. If we consider differences in the treatment effect across X ⎧ 0 if τ ⎪ ⎪ ⎨ 0.5 if τ Qτ X ατ, X ⎪ 0.5 if τ ⎪ ⎩ 1 if τ
< 0.5 ≥ 0.5 < 0.5 ≥ 0.5
if X ≤ 0.5,
2.17
if X > 0.5.
We assume that Yj ·, Qτj ·, j 0, 1, can be expressed as a function of p. In particular, for QTE, we assume that the CQTE is of the form Qτ1 p − Qτ0 p ατ, p, and therefore the ACQTE becomes δ τ Ep Qτ1 p − Qτ0 p Ep δτ p ,
2.18
which is our parameter of interest.
3. Fractile Graphical Analysis Fractile graphical analysis FGA is a nonparametric estimation method developed first by Mahalanobis 20 based on conditioning on the fractiles of the X’s. It was specifically designed to compare two populations, where the X variable was influenced by inflation and therefore not directly comparable. It has the same properties as other histogram-type estimators 32. Moreover, Bhattacharya 33 developed a conditional quantile estimation method based on FGA. Our proposal is to use FGA to develop estimators for both ATE and QTE. FGA produces a histogram-type smoother by blocking on the fractiles i.e., quantiles of the propensity score. FGA was originally developed for one covariate i.e., dimX 1, but Bhattacharya 33 and others showed that it can be extended to more covariates. However, we will only consider FGA based on a single covariate, the propensity score. One-dimensional FGA allows us to recover the graphs {p, γp}, where γ is any function of the propensity score.
8
Journal of Probability and Statistics
Assume first that the propensity score is known and it has a distribution function Hp. Further, assume that H· is continuous and strictly increasing, and p satisfies Assumption 2.2. Construct R fractile groups indexed by r on the propensity score: Irp
r r−1 , ξr/R H −1 , p ∈ p, p : ξr−1/R < p ≤ ξr/R , ξr−1/R H −1 R R
3.1
r 1, 2, . . . , R, where H −1 τ inf{p : Hp ≥ τ}. Each fractile group contains a similar number of observations i.e., about N/R, and it has an associated interval on the domain of p defined by the order statistics ξr−1/R , ξr/R , such that P p ∈ ξr−1/R , ξr/R 1/R. As the number of fractiles increases, the divergence in terms of p for all observations within the same fractile group becomes smaller, and therefore we would be gradually constructing groups with the same p-characteristics. In that case, estimates within each fractile group asymptotically satisfy the unconfoundedness assumption, provided that the conditioning set converges to a single propensity score value. The following lines provide a short review of the asymptotic properties of FGA, which can be found in the studies by Bhattacharya and Muller 32 and Bera and Gosh 21. Let ¨ gp EY | P p and σ 2 p VARY | P p be the conditional expectation and variance in terms of the propensity score, and consider the following notation: ht g ◦ H −1 t and kt σ 2 ◦ H −1 t for t r − 1 α/R with 0 ≤ α ≤ 1. Suppose that h· has bounded second derivative and k· has bounded first derivative. Then, as N → ∞ and R → ∞ so become that R/N → 0 for fixed t, the bias and the variance of an FGA estimator of ht, ht − ht −2R−1 h t1 o1 O 1 , BIAS: E ht R R R , VARIANCE: VAR ht 1 − α2 α2 kt1 o1 O N N
3.2
is so that the mean-squared error of h 2 R MSE: MSE ht 4R2 h t 1 − α2 α2 kt 1 o1, N
3.3
where 0 ≤ α Rt − Rt < 1. Therefore, the best rate of convergence of fractile graphs is obtained by letting R ON 1/3 , which yields a rate of ON −2/3 for the Integrated MSE. If p is not known, then it has to be estimated. In practice any estimate p p op 1 removes the bias. However, they will differ in the variance of the estimator, provided that the first stage i.e., the estimation of the propensity score needs to be taken into account. Hahn 13 shows that, by using the estimated propensity score, instead of the true propensity score, efficiency is achieved. Hirano et al. 12 and Firpo 18 use a semiparametric series estimator of the propensity score which produces this result. We impose the following assumption regarding the use of the estimated propensity score.
Journal of Probability and Statistics
9
Assumption 3.1 convergence of propensity score fractile groups. Let p be an estimator of the propensity score. Then, for fixed R and for all r, lim P Irp Irp 1.
N →∞
3.4
4. ATE Estimators FGA ATE estimators are based on imputing the unobserved outcome in each fractile group. Let Yi Y1i Yˇ 1i where Yˇ 1i
N k1
Wk Yk 1 pk ∈ Irpi /
N k1
Y0i
if Wi 1, if Wi 0,
4.1
Wk 1 pk ∈ Irpi , Yi Yˇ 0i
if Wi 0, if Wi 1,
4.2
pk ∈ Irpi / N pk ∈ Irpi . where Yˇ 0i N k1 1 − Wk Yk 1 k1 1 − Wk 1 Therefore, the FGA ATE estimator is N N 1 1 δ Y1i − Y0i Yˇ 1i − Yˇ 0i . N i1 N i1
4.3
Similarly, it can be expressed as R 1 δ δr , R r1
4.4
where N Wi Yi 1 pi ∈ Irp i ∈ Irp i1 1 − Wi Yi 1 p − . N N r r W 1 p ∈ I ∈ I − W p 1 1 i i i i i1 i1 p p N
δr
i1
4.5
The logic of this estimator is based on that of Hahn 13 “nonparametric imputation.” In this case, within each fractile group, EWY | Irp , E1 − WY | Irp , and EW | Irp are estimated nonparametrically using the previously estimated propensity score p. Alternatively we construct a similar estimator using the weighting technique described in the study by Hirano et al. 12. Let Y1i
Yi Y˘ 1i
if Wi 1, if Wi 0,
4.6
10
Journal of Probability and Statistics
where Y˘ 1i
N
pk 1 pk k1 Wk Yk /
∈ Irpi , Yi Y0i Y˘ 0i
where Y˘ 0i Then
N
k1 1
if Wi 0, if Wi 1,
4.7
− Wk Yk /1 − pk 1 pk ∈ Irpi . R 1 δr , δ R r1
4.8
N N R Wi 1 − Wi R Yi 1 pi ∈ Irp − Yi 1 pi ∈ Irp . δr N i1 pi N i1 1 − pi
4.9
where
This estimator suffers from the same problems of Hirano et al.’s 12 estimator; that is, the presence of occasional high/low values of the propensity score produces a very bad empirical performance. The following theorem shows that the FGA ATE estimators are consistent. The intuition behind the proof is that, as N increases, and R does it but at a smaller rate, each fractile group will have individuals with similar propensity score values. In the limit, the differences among them is negligible, and therefore the unconfoundedness assumption can be applied. In this case, the local i.e., for a given propensity score value ATE can be obtained by constructing the difference of the average treated and control individuals with that propensity score value. Theorem 4.1 consistency of ATE estimator. Consider Assumptions 2.1, 2.2, and 3.1, and assume that 1 the distribution functions of p and Y1 , Y0 | p are continuous and strictly increasing. 2 EY12 < ∞, EY02 < ∞. P P Then, δ → δ and δ → δ as N, R → ∞, R/N → 0.
Proof. See Appendix A.1.
5. QTE Estimators Define the within fractile conditional quantiles: N r Q τ1
argmin q
i1
1 pi ∈ Irp Wi Yi − q τ − 1 Yi ≤ q , N i ∈ Irp Wi i1 1 p
Journal of Probability and Statistics N i1
r argmin Q τ0 q
11
1 pi ∈ Irp 1 − Wi Yi − q τ − 1 Yi ≤ q . N i ∈ Irp 1 − Wi i1 1 p 5.1
Therefore, the QTE estimator is R R 1 1 r r . r − Q δτ Q δτ τ0 R r1 R r1 τ1
5.2
Similarly we define
r argmin Q τ1 q
r argmin Q τ0 q
N 1 1 pi ∈ Irp Wi Yi − q τ − 1 Yi ≤ q , p i1 i
1 1 pi ∈ Irp 1 − Wi Yi − q τ − 1 Yi ≤ q , 1 − p i i1
N
5.3
R R 1 1 r r . r − Q δτ Q δτ τ0 R r1 R r1 τ1
The following theorem proves the consistency of both QTE estimators. Theorem 5.1 consistency of QTE estimator. Consider Assumptions 2.1, 2.2, and 3.1, and assume that, the distribution function of p is continuous and strictly increasing. The distribution function of Y1 , Y0 | p is continuous, strictly increasing, and continuously differentiable. P P Then, for τ ∈ 0, 1, δτ → δτ , and δτ → δτ as N, R → ∞, R/N → 0. Proof. See Appendix A.2.
6. Monte Carlo Experiments We evaluate the performance of the proposed estimators with respect to other estimators based on the propensity score. We compute propensity score matching estimators using nearest-neighbor procedures with 1, 2, and 4 matches per observation, kernel and spline estimates. These estimators were designed by Barbara Sianesi for STATA 9.1, and they are available in the psmatch2 package. Additionally we compute Hirano et al. 12 semiparametric efficient estimator. In the case of QTE we compute Firpo 18 and Bitler et al. 7 estimators. We also compute QTE matching estimators following Diamond 19. In this case, for each observation, the matching procedure constructs the corresponding matched pair i.e., imputes the “closest” observation with the opposite treatment status. Then, we compute the unconditional quantiles of the imputed treated and nontreated distributions. A succinct description of some estimators appears in Appendix B.
12
Journal of Probability and Statistics Table 1: ATE Monte Carlo simulations.
Estimator FGA ATEa FGA ATE R × 2a FGA ATEb FGA ATE R × 2b Hirano et al. 12 PS matching Nearest neighbor 1 Nearest neighbor 2 Nearest neighbor 4 Kernel Spline
100 0.221 0.190 0.686 0.646 0.732
200 0.157 0.109 0.657 0.615 0.679
MSE 500 0.079 0.053 0.630 0.612 0.644
1000 0.057 0.037 0.607 0.597 0.618
2000 0.036 0.023 1.699 1.683 1.710
100 0.372 0.346 0.464 0.462 0.472
200 0.308 0.261 0.456 0.450 0.460
MAE 500 0.221 0.185 0.369 0.368 0.371
1000 0.187 0.155 0.343 0.341 0.344
2000 0.147 0.121 0.336 0.335 0.337
0.467 0.290 0.170 0.285 0.233
0.331 0.192 0.124 0.226 0.139
0.202 0.126 0.074 0.127 0.058
0.145 0.091 0.057 0.074 0.034
0.099 0.063 0.039 0.033 0.021
0.540 0.429 0.329 0.418 0.384
0.443 0.347 0.281 0.369 0.297
0.358 0.285 0.217 0.283 0.192
0.298 0.238 0.188 0.217 0.149
0.251 0.198 0.157 0.146 0.117
a δτ , b δτ . MSE: mean squared error. MAE: mean absolute error. Monte Carlo simulations based on 1000 replications of the baseline model.
Our baseline model is X1 , X2 , X3 , u, e ∼ N0, 1, W 1X1 − X2 X3 e > 0, Y1 δ X1 X2 u,
6.1
Y0 X1 X2 X3 u. In this simple model QTEs are equal to ATE for all quantiles. We set δ 2. We generate 1000 replications of the baseline models for sample sizes in {100, 200, 500, 1000, 2000}, and we compute mean square error MSE and mean absolute error MAE. Table 1 reports ATE estimators, while Table 2 shows QTE estimators for τ in {.10, .25, .50, .75, .90}. For FGA the number of fractile groups is R N 1/3 which minimizes the integrated MSE see, 32, and we also consider doubling the number of fractile groups i.e., R × 2. We consider the two FGA estimators discussed above, that is, δ and δ. The FGA ATE estimator has reasonable good performance in terms of both MSE and MAE. In almost every case, doubling the number of fractile groups results in a better performance of the δ estimator. However, the contrary occurs to the δ estimator. FGA ATE × 2 achieves the same values of the best matching estimators using 4 neighbors and δR splines. Increasing the sample size reduces both MSE and MAE at similar rates in all estimators. Overall the Hirano et al. 12 and FGA ATE δ estimators show extremely high values, mainly because a random draw may contain occasional values of the propensity score very close to the boundary i.e., 0 or 1. FGA QTE δ estimators outperform that of Firpo 18 for all sample sizes and quantiles. All the estimators show consistency, although FGA QTE reduces both MSE and MAE at higher rates than Firpo’s estimator. As in the last paragraph, doubling the number of As fractile groups improves the estimator performance, and FGA QTE δ outperform δ. expected, better estimates are found in the median case than in the extreme quantiles. Matching estimators show a relatively good performance. However, only in a few cases they
Journal of Probability and Statistics
13
Table 2: QTE Monte Carlo simulations. MSE Estimator
100
200
500
MAE 1000
2000
100
200
500
1000
2000
τ 0.10 a
FGA QTE
0.964
0.585
0.328
0.247
0.155
0.791
0.611
0.461
0.404
0.317
FGA QTE R × 2a
0.825
0.433
0.224
0.178
0.117
0.713
0.525
0.379
0.339
0.279
FGA QTE
0.863
0.491
0.243
0.146
0.090
0.802
0.593
0.409
0.312
0.241
FGA QTE R × 2b
1.334
0.849
0.391
0.241
0.146
1.074
0.847
0.561
0.433
0.339
b
Firpo 18
0.879
0.633
0.438
0.332
0.332
0.733
0.639
0.525
0.467
0.453
Bitler et al. 7
0.835
0.628
0.432
0.332
0.332
0.733
0.638
0.524
0.467
0.453
0.775
0.649
0.455
0.341
0.383
0.697
0.636
0.532
0.486
0.478
PS Matching Nearest neighbor 1 Nearest neighbor 2
0.551
0.442
0.315
0.269
0.236
0.603
0.555
0.484
0.462
0.444
Nearest neighbor 4
0.562
0.429
0.309
0.285
0.259
0.591
0.557
0.492
0.494
0.484
Kernel
0.748
0.566
0.467
0.401
0.361
0.691
0.617
0.580
0.587
0.581
Spline
0.626
0.519
0.400
0.389
0.367
0.646
0.622
0.575
0.593
0.591
τ 0.25 a
FGA QTE
0.523
0.333
0.184
0.133
0.083
0.585
0.459
0.336
0.292
0.230
FGA QTE R × 2a
0.409
0.246
0.131
0.088
0.058
0.507
0.397
0.291
0.236
0.194
FGA QTE
0.866
0.519
0.270
0.169
0.116
0.823
0.631
0.450
0.352
0.290
FGA QTE R × 2b
1.400
0.867
0.462
0.292
0.195
1.126
0.874
0.634
0.503
0.410
b
Firpo 18
0.721
0.527
0.360
0.268
0.226
0.641
0.543
0.436
0.377
0.338
Bitler et al. 7
0.687
0.527
0.360
0.268
0.226
0.635
0.543
0.436
0.377
0.338
0.952
0.757
0.471
0.309
0.209
0.731
0.621
0.517
0.420
0.364
PS Matching Nearest neighbor 1 Nearest neighbor 2
0.563
0.381
0.287
0.195
0.147
0.579
0.472
0.403
0.339
0.306
Nearest neighbor 4
0.312
0.204
0.153
0.110
0.089
0.438
0.352
0.296
0.255
0.231
Kernel
0.519
0.390
0.299
0.175
0.101
0.553
0.469
0.400
0.313
0.246
Spline
0.491
0.274
0.154
0.104
0.095
0.533
0.406
0.312
0.276
0.278
τ 0.50 a
FGA QTE
0.305
0.191
0.105
0.072
0.046
0.441
0.344
0.259
0.211
0.168
FGA QTE R × 2a
0.252
0.143
0.070
0.049
0.029
0.394
0.301
0.212
0.175
0.138
FGA QTE
0.941
0.616
0.349
0.242
0.169
0.871
0.703
0.529
0.435
0.367
FGA QTE R × 2b
1.541
0.977
0.567
0.370
0.270
1.187
0.940
0.716
0.576
0.496
b
Firpo 18
0.658
0.540
0.358
0.249
0.206
0.606
0.521
0.401
0.332
0.285
Bitler et al. 7
0.629
0.540
0.358
0.249
0.206
0.603
0.521
0.401
0.332
0.285
1.288
0.748
0.443
0.271
0.184
0.921
0.697
0.534
0.423
0.350
PS Matching Nearest neighbor 1 Nearest neighbor 2
0.993
0.608
0.359
0.229
0.152
0.809
0.624
0.489
0.390
0.317
Nearest neighbor 4
0.656
0.465
0.292
0.194
0.137
0.654
0.547
0.442
0.360
0.298
Kernel
0.367
0.250
0.130
0.084
0.060
0.476
0.396
0.292
0.234
0.200
Spline
0.648
0.349
0.181
0.121
0.093
0.642
0.476
0.343
0.285
0.235
14
Journal of Probability and Statistics Table 2: Continued. τ 0.75 a
FGA QTE
0.391
0.256
0.138
0.092
0.069
0.497
0.411
0.295
0.242
0.205
FGA QTE R × 2
0.352
0.218
0.112
0.071
0.051
0.466
0.376
0.267
0.215
0.180
FGA QTEb
1.043
0.721
0.452
0.320
0.250
0.919
0.768
0.610
0.515
0.461
FGA QTE R × 2
1.635
1.094
0.675
0.486
0.368
1.229
1.001
0.787
0.670
0.585
Firpo 18
0.652
0.561
0.356
0.346
0.250
0.615
0.548
0.420
0.390
0.322
Bitler et al. 7
0.652
0.561
0.356
0.346
0.250
0.615
0.548
0.420
0.390
0.322
a
b
PS Matching Nearest neighbor 1
1.439
1.138
0.710
0.492
0.317
0.888
0.801
0.624
0.532
0.440
Nearest neighbor 2
0.768
0.647
0.509
0.396
0.257
0.668
0.599
0.527
0.464
0.387
100
200
500
1000
2000
100
200
500
1000
2000
0.429
0.374
0.326
0.293
0.230
0.507
0.466
0.416
0.389
0.349
MSE Estimator Nearest neighbor 4
MAE
Kernel
0.486
0.576
0.407
0.328
0.255
0.534
0.556
0.484
0.445
0.400
Spline
0.713
0.596
0.342
0.284
0.258
0.631
0.578
0.438
0.412
0.408
FGA QTEa
0.711
0.398
0.229
0.167
0.125
0.666
0.505
0.385
0.328
0.287
FGA QTE R × 2
0.707
0.333
0.185
0.142
0.100
0.671
0.457
0.345
0.304
0.257
FGA QTEb
1.218
0.837
0.560
0.417
0.345
0.979
0.824
0.678
0.588
0.544
FGA QTE R × 2
1.771
1.176
0.809
0.589
0.473
1.269
1.029
0.860
0.736
0.664
Firpo 18
0.754
0.717
0.444
0.462
0.340
0.651
0.625
0.479
0.461
0.401
Bitler et al. 7
0.754
0.717
0.444
0.462
0.340
0.651
0.625
0.479
0.461
0.401
0.898
0.930
0.674
0.789
0.631
0.699
0.690
0.566
0.562
0.510
τ 0.90 a
b
PS Matching Nearest neighbor 1 Nearest neighbor 2
0.424
0.370
0.231
0.287
0.241
0.506
0.449
0.352
0.338
0.310
Nearest neighbor 4
0.337
0.204
0.112
0.122
0.113
0.456
0.356
0.249
0.228
0.217
Kernel
0.697
0.708
0.581
0.394
0.131
0.613
0.591
0.504
0.395
0.267
Spline
0.432
0.287
0.105
0.078
0.058
0.501
0.411
0.256
0.228
0.209
δτ , b δτ . MSE: mean squared error. MAE: mean absolute error. Monte Carlo simulations based on 1000 replications of the baseline model. a
outperform the FGA QTE estimator. In particular the spline matching estimator shows an outstanding performance for τ 0.9. Overall nonparametric FGA estimators, where the propensity score is reestimated show the best performance. nonparametrically i.e., δ,
7. Empirical Application We apply the estimators proposed in the paper to a widely used job training dataset first analyzed by LaLonde 2, the “National Supported Work Program” NSW. The same database was used in other applications such as those of Heckman and Hotz 34, Dehejia and Wahba 35, 36, Abadie and Imbens 11, and Firpo 18, among others.
Journal of Probability and Statistics
15
The program was designated as a random experiment for applicants who if selected would had received work experience treatment in a wide range of possible activities, like learning to operate a restaurant, a child care, or a construction work, for a period not exceeding twelve months. Eligible participants were targeted from recipients of AFDC, former addicts, former offenders, and young school dropouts. Candidates eligible for the NSW were randomized into the program between March 1975 and July 1977. The NSW data set consists of information on earnings and employment in 1978 outcome variables, whether treated or not, information on earnings and employment in 1974 and 1975, and background characteristics such as education, ethnicity, marital status, and age. We use the database provided by Guido Imbens http://www.economics.harvard.edu/faculty/imbens/software imbens/, which consists of 455 individuals, 185 treated, and 260 control observations. This particular subset is the one constructed by Dehejia and Wahba 35 and described there in more detail. We will focus on the possible effect on participants’ earnings in 1978 if any; that is, we answer the following question: what is the effect of this particular training program on future earnings? Provided that earnings is a continuous variable, we would be able to apply quantile analysis. A main drawback of this variable is that those unemployed in 1978 report earnings of zero. In 1978, 92 control and 45 treated individuals were unemployed. The average standard deviation of earnings in 1978 is $5300 $6631, which breaks into $6349 $578 for treated and $4554 $340 for control individuals. Without considering covariates, the difference between treated and nontreated is $1794 $671, which in a two-sample ttest rejects the null hypothesis of equal values t-stat 2.67, P value 0.0079. We also observe differences in terms of the percentiles in the earnings distribution. The 10th percentile for the treated control is $0 $0; the 25th percentile $485 $0; the median is $4232 $3139; the 75th percentile $9643 $7292 ; and the 90th percentile is $14582 $11551. Therefore, assuming the rank invariance property discussed above, higher quantiles of the earnings distribution seems to be associated with larger treatment effects. The propensity score is estimated by a probit model, where the dependent variable is participation and the covariates used are the individual characteristics and employment and earnings in 1974 and 1975. Note that the propensity score is of no particular interest by itself, provided that participants were randomly selected in the experiment. In this case, no particular covariate is individually significant, and a likelihood ratio test of joint significance gets chi-squared 8 8.30, P value 0.4050. As we mention above, a common support in the propensity score domain is necessary to make meaningful comparisons among treated and nontreated individuals. The empirical relevance of this assumption was pointed out by Heckman et al. 37, and it was identified as one of the major sources of bias. In our case, this has special importance since consistent estimates of treatment effect requires that both the number of treated and control is eventually large enough to apply large sample theory. Moreover, if there are no treated controls in a given fractile group, no within fractile estimate can be obtained. We use two different trimming procedures. First, provided that we may assume that F1 p ≤ F0 p, we only consider propensity score values in the range p∗ min pi , Wi 1 ≤ p ≤ max pi , Wi 0 p∗ . p
p
7.1
By doing this we drop 8 observations, and we refer to this sample as Trim 1. We also trim 2.5% in each tail of the propensity score distribution Trim 2 dropping 23 observations.
16
Journal of Probability and Statistics Table 3: ATE estimators of the effect of training on earnings. Nontrimmed
Trim 1
Trim 2
Estimator
Coef. Average Std.error Coef. Average Std. error Coef. Average Std. error
FGA ATEa
1572.2
1589.0
634.0
1608.6
1620.2
645.3
1480.9
1565.2
665.8
FGA ATE R × 2
1537.1
1606.4
659.0
1520.2
1634.4
659.8
1208.6
1572.4
682.3
FGA ATEb
1584.1
1561.9
616.7
1672.5
1625.7
637.0
1676.1
1536.1
655.5
FGA ATE R × 2
1563.2
1511.4
606.6
1604.5
1576.1
627.3
1530.2
1483.4
644.5
Hirano et al. 12
1598.2
1612.2
631.3
1731.3
1691.9
661.0
1862.0
1589.1
682.6
a
b
PS matching Nearest neighbor 1 997.2
1393.8
736.4
1101.6
1389.2
721.3
869.8
1334.4
744.8
Nearest neighbor 2 156.2
1471.6
710.6
1262.0
1477.8
702.8
984.2
1424.2
729.9
Nearest neighbor 4 1471.8
1559.0
670.0
1552.7
1571.7
671.6
1346.5
1525.1
700.4
Kernel
1629.0
1639.6
631.1
1638.9
1636.3
634.3
1482.4
1590.8
657.3
Spline
1587.0
1616.6
641.4
1614.4
1613.8
638.8
1380.3
1565.6
660.9
δτ , b δτ . Averages and standard errors are obtained using 1000 bootstrapping random samples with replacement of the original database. a
Table 3 reports the propensity score estimates used in the Monte Carlo simulation, applied to LaLonde’s data set. The first column contains the ATE estimate, while the second and third contain the average and standard deviation of a bootstrapping experiment with 1000 random samples with replacement of the original database. The last column calculates the ATE estimator for the two different trimming procedures discussed above. Table 4 estimates the QTE for the same quantiles analyzed in Table 2. The results confirm a positive average impact of training on earnings. FGA ATE estimators get $1572 and $1537, which are of the same magnitude as the kernel and spline propensity score matching estimates and the Hirano et al. 12 estimates. However, nearest-neighbor estimates are below these estimates by $100. QTE estimates show considerable variability across quantiles see Table 4. For the 10th quantile, estimates are not statistically different from zero. The median quantile is almost two-thirds of the ATE estimates, reflecting the presence of outliers in the sample or different distributional properties. Finally for the 90th quantile, the estimates produce up to a $3000 impact, twice the ATE. In other words, those who benefit more are those with a high level of unobservables. Unfortunately, all the estimators show high bootstrap standard errors.
8. Conclusion FGA provides a simple methodology for constructing nonparametric estimators of average and quantile treatment effects, under the assumption of selection on observables. In this paper we develop estimators using the estimated propensity score and we prove its consistency. Moreover, FGA QTE estimators show a better performance than that of Firpo’s 18 QTE estimator, which constitutes the most relevant estimator in the literature using the propensity score. Similar estimators can be derived for FGA in more than one dimension see for instance the discussion in 33, although its computational burden is unknown. Moreover,
Journal of Probability and Statistics
17
Table 4: QTE estimators of the effect of training on earnings. Nontrimmed Estimator
Trim 1
Trim 2
Coef. Average Std. error Coef. Average Std. error Coef. Average Std. error τ 0.10 a
FGA QTE
0.0
149.4
188.2
0.0
146.8
190.9
0.0
145.5
183.0
FGA QTE R × 2a
95.7
382.5
325.2
78.0
371.3
311.2
101.9
392.4
340.2
FGA QTEb
0.0
150.9
190.5
0.0
148.6
190.9
0.0
145.0
185.2
FGA QTE R × 2b
95.7
369.7
317.7
78.0
364.6
312.0
240.1
385.1
335.9
Firpo 18
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Bitler et al. 7
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Nearest neighbor 1 0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
PS matching Nearest neighbor 2 0.0
0.7
15.2
0.0
0.7
15.2
0.0
1.1
21.0
Nearest neighbor 4 0.0
73.8
184.5
0.0
69.7
179.3
0.0
74.5
188.7
Kernel
0.0
292.5
356.3
0.0
281.0
351.0
0.0
298.9
365.8
Spline
0.0
295.3
355.7
0.0
278.5
349.5
0.0
291.2
361.0
τ 0.25 FGA QTE
409.6
653.5
480.2
376.4
686.1
475.9
623.2
701.7
502.8
FGA QTE R × 2
628.8
853.5
546.4
487.4
864.3
540.9
836.9
880.2
547.8
FGA QTEb
414.2
649.8
488.0
361.9
685.8
484.2
500.9
705.8
511.8
FGA QTE R × 2b
694.5
843.3
546.3
363.1
853.3
554.6
856.6
872.1
561.9
Firpo 18
0.0
295.8
341.3
0.0
291.3
340.4
289.8
276.8
343.2
Bitler et al. 7
0.0
295.8
341.3
0.0
291.3
340.4
289.8
276.8
343.2
Nearest neighbor 1 0.0
159.1
290.8
0.0
145.5
278.6
0.0
134.9
273.8
Nearest neighbor 2 1067.5
804.7
523.1
1254.6
783.1
527.8
1152.1
763.5
763.5
Nearest neighbor 4 1568.0
1455.8
738.6
1682.1
1454.2
733.0
803.4
1440.0
748.0
PS matching
Kernel
3068.7
2808.3
986.1
3180.7
2864.5
992.4
3074.4
2892.7
1036.1
Spline
2628.7
2363.8
1019.7
2708.2
2486.6
1023.5
2715.1
2470.7
1053.1
FGA QTE
1131.0
1379.9
851.0
1321.2
1404.8
870.5
1026.0
1376.2
869.2
FGA QTE R × 2
914.2
1466.4
887.3
965.5
1501.7
866.6
385.5
1488.7
873.5
FGA QTEb
1193.9
1403.5
871.6
1306.0
1432.7
891.8
1066.3
1384.7
883.4
FGA QTE R × 2
1078.8
1476.8
939.7
981.7
1509.1
919.9
262.5
1492.6
916.5
Firpo 18
1063.0
1178.5
901.6
1254.9
1260.9
938.1
763.8
1257.8
973.9
Bitler et al. 7
1063.0
1178.5
901.6
1254.9
1260.9
938.1
763.8
1257.8
973.9
910.5
1034.6
368.5
984.4
1042.5
77.5
990.4
1066.2
τ 0.50
b
PS matching Nearest neighbor 1 5.2 Nearest neighbor 2 388.1
1050.9
859.9
563.4
1093.0
877.5
284.4
1088.4
898.4
Nearest neighbor 4 695.5
1171.6
810.5
801.8
1210.6
814.5
616.7
1201.0
842.8
Kernel
1567.1
1351.2
752.2
1643.1
1365.3
759.6
1587.8
1357.7
783.7
Spline
846.6
1261.7
776.9
1843.3
1259.1
783.4
1578.1
1272.3
811.9
18
Journal of Probability and Statistics Table 4: Continued.
FGA QTE 2110.9 FGA QTE R × 2 1614.6 2240.9 FGA QTEb FGA QTE R × 2b 1351.3 Firpo 18 2274.1 Bitler et al. 7 2274.1 PS matching Nearest neighbor 1 1537.0 Nearest neighbor 2 585.5 Estimator Coef. Nearest neighbor 4 1280.1 Kernel 1399.3 Spline 1381.6 FGA QTE 3093.8 FGA QTE R × 2 3170.8 3093.8 FGA QTEb FGA QTE R × 2b 3142.1 Firpo 18 2713.4 Bitler et al. 7 2713.4 PS matching Nearest neighbor 1 1861.2 Nearest neighbor 2 1279.4 Nearest neighbor 4 2047.9 Kernel 175.0 Spline 571.5
1948.9 2162.8 1968.0 2222.8 2060.9 2060.9
1180.7 1376.1 1191.9 1447.2 919.6 919.6
1663.9 1111.7 1598.8 1106.4 Nontrimmed Average Std. error 1634.1 1046.5 1735.3 972.3 1963.1 1088.7
1960.5 1640.0 1815.5 1549.0 2258.0 2258.0 1921.1 877.6 Coef. 1243.8 1399.5 1320.6
3345.7 3301.2 3493.2 3343.0 2854.3 2854.3
2315.2 2157.9 2421.2 2221.1 1890.4 1890.4
3426.5 4126.5 3518.0 4819.4 2150.6 2150.6
2425.7 2388.4 2180.9 691.5 881.2
2150.1 1915.6 1716.8 1389.5 1520.4
1392.3 1278.0 2153.7 208.0 261.8
τ 0.75 1986.9 2215.2 2010.4 2253.0 2071.3 2071.3
1190.8 1369.5 1221.5 1440.5 942.1 942.1
1699.6 1081.8 1640.8 1097.4 Trim 1 Average Std. error 1653.2 1047.7 1754.3 971.5 1954.4 1087.3 τ 0.90 3401.9 2364.0 3317.0 2174.2 3512.3 2454.9 3351.2 2213.4 2797.8 1889.2 2797.8 1889.2 2372.7 2336.6 2181.3 770.6 899.4
2111.8 1900.4 1743.9 1431.7 1477.9
1710.1 742.6 1706.4 789.2 2214.5 2214.5 1263.0 393.7 Coef. 1239.6 1180.6 1108.5
1950.1 2131.1 1954.1 2159.8 2034.2 2034.2
1233.8 1424.7 1264.7 1449.7 979.1 979.1
1637.5 1116.7 1589.5 1115.2 Trim 2 Average Std. error 1595.4 1084.2 1710.6 984.8 1870.9 1101.3
2789.1 4161.9 3189.5 3998.6 2126.2 2126.2
3233.7 3118.0 3189.9 3154.9 2715.5 2715.5
2459.3 2096.5 2442.7 2133.0 1895.0 1895.0
444.7 45.4 2153.7 327.6 327.6
2249.4 2172.0 2030.9 566.2 755.2
2130.0 1892.5 1758.1 1415.4 1458.8
δτ , b δτ . Averages and standard errors are obtained using 1000 bootstrapping random samples with replacement of the original database.
a
more efficient estimators may be obtained by applying smoothing techniques within or between fractiles 22.
Appendices A. Proof of Theorems A.1. Proof of Theorem 4.1 Proof. Let N → ∞, R and r be fixed. Then, N Wi Yi 1 pi ∈ Irp j ∈ Irp i1 1 − Wi Yi 1 p p lim − N N N →∞ r r W 1 p ∈ I ∈ I − W p 1 1 i i i i i1 i1 p p N
p lim δr N →∞
i1
by Law of Large Numbers and Assumptions 2.2 and 3.1
Journal of Probability and Statistics E W × Y | Irp E 1 − W × Y | Irp − P W | Irp P 1 − W | Irp E E W × Y | p | Irp E E 1 − W × Y | p | Irp − P W | Irp P 1 − W | Irp by Law of Iterated Expectations E E W | p × E Y1 | p | Irp E E 1 − W | p × E Y0 | p | Irp − P W | Irp P 1 − W | Irp by Assumption 1.2 .
19
A.1 Let EWp p, P WIrp ≡ pr , EY1 p g1 p, EY0 | p g0 p, δr EY1 − Y0 | Irp . Then r r E p − pr g1 p | Ir g p p | I E p − 0 p p p lim δr − δr N →∞ r r p 1−p COV p; g1 p | Irp COV p; g0 p | Irp r r p 1−p ≤
A.2
VAR p | Irp × Cr ,
where ⎛ ⎜ Cr ⎝
VAR g1 p Irp pr
⎞ VAR g0 p Irp ⎟ ⎠. 1 − pr
A.3
Now let R, N → ∞, R/N → 0. Then p lim δ − δ N →∞
RN 1 r r δ −δ p lim N → ∞ RN r1
RN 1 VAR p | Irp Cr N → ∞ RN r1 ≤ lim maxr VAR p | Irp Cr .
≤ lim
N →∞
A.4
20
Journal of Probability and Statistics
By assumptions Cr is bounded and VARp | Irp ≤ supp∈Irp p − infp∈Irp p ≤ 1/R, for all r. Then δ − δ Op 1/R. The consistency of δ can be easily proved by noting that, within each fractile group, the estimator is equivalent to that of Hirano et al. 12.
A.2. Proof of Theorem 5.1 Proof. Note that as N → ∞, R and r are fixed, by convergence of sample quantiles p −1 r −→ Fr,W1 Q τ, τ1
A.5
where E W1 Y ≤ q | Irp , Fr,W1 q ≡ P Y ≤ q | Irp , W 1 E 1 Y ≤ q | Irp , W 1 pr
A.6
and pr P W 1|Irp is defined in the proof of Theorem 4.1. Therefore, ' & ' p W r r r r τ E r 1 Y ≤ Q E 1 Y1 ≤ Q τ1 | Ip E τ1 | p | Ip . p pr &
A.7
However, in general, r | p | Irp E 1 Y1 ≤ Q r | Irp . τ/ E E 1 Y1 ≤ Q τ1 τ1
A.8
This divergence can be expressed as r | Irp τ −Fr,1 Q r | p | Irp ≤ 1 × K r , r 1 COV p, E 1 Y1 ≤ Q τ −E 1 Y1 ≤ Q τ1 τ1 τ1 1 R pr A.9 r r | Irp /pr is bounded by where Fr,1 q P Y1 ≤ q | Irp and K1 VAR1Y0 ≤ Q τ1 assumptions see Theorem 4.1. r and Qr ? By Taylor’s theorem, How does this translate into the divergence of Q τ1 τ1
r Q τ1
−
r Qτ1
r − τ Fr,1 Q τ1 1 r 1 K1 , Op op r R R fr,1 Qτ1
A.10
Journal of Probability and Statistics
21
Consider now the case that N, R → ∞, R/N → 0, 1 RN r 1 , Q − E Qτ1 p Op RN r1 τ1 R
A.11
where E1Y1 ≤ Qτ1 p | p τ for all p ∈ p, p. The same argument can be applied to show τ0 . the consistency of Q Therefore, δτ
RN 1 r r Qτ1 − Qτ0 δτ op 1. RN r1
A.12
The consistency of δτ can be easily proved by noting that, within each fractile group, the estimator is equivalent to that of Firpo 18.
B. Other ATE and QTE Estimators Hirano et al.’s 12 semiparametric efficient ATE estimator is N W i Yi
pi
i1
1 − Wi Yi − , 1 − pi
B.1
where p is a semiparametric series estimator of the propensity score. Bitler et al. 7 QTE estimator is obtained by finding the empirical quantiles of the weighted empirical distributions: 1 − Wi 1 Yi ≤ q / 1 − pi , N i i1 1 − Wi / 1 − p N pi i1 Wi 1 Yi ≤ q / F1 q , N pi i1 Wi /
F0 q
N i1
B.2
that is, F0−1 τ and F−1 τ. Firpo 18 obtains the same results by minimizing weighted convex check functions: F0−1 τ argmin q
N 1 − Wi i1
F1−1 τ argmin q
Yi − q τ − 1 Y i ≤ q ,
1 − pi N Wi i1
pi
Yi − q τ − 1 Y i ≤ q .
B.3
22
Journal of Probability and Statistics
Acknowledgment The author is grateful to Anil Bera, Antonio Galvao, and Todd Elder for helpful comments.
References 1 O. Ashenfelter, “Estimating the effect of training programs on earnings,” Review of Economics and Statistics, vol. 60, no. 1, pp. 47–57, 1978. 2 R. LaLonde, “Evaluating the econometric evaluations of training programs with experimental data,” American Economic Review, vol. 76, no. 4, pp. 604–620, 1986. 3 J. D. Angrist and A. Krueger, “Empirical strategies in labor economics,” in Handbook of Labor Economics, O. Ashenfelter and D. Card, Eds., vol. ume 1, pp. 1277–1366, Elsevier, New York, NY, USA, 1st edition, 1999. 4 G. W. Imbens, “Nonparametric estimation of average treatment effects under exogeneity: a review,” Review of Economics and Statistics, vol. 86, no. 1, pp. 4–29, 2004. 5 E. L. Lehmann, Nonparametrics: Statistical Methods Based on Ranks, Holde-Bay, San Francisco, Calif, USA, 1st edition, 2006. 6 K. Doksum, “Empirical probability plots and statistical inference for nonlinear models in the twosample case,” The Annals of Statistics, vol. 2, pp. 267–277, 1974. 7 M. P. Bitler, J. B. Gelbach, and H. W. Hoynes, “What mean impacts miss: distributional effects of welfare reform experiments,” American Economic Review, vol. 96, no. 4, pp. 988–1012, 2006. 8 J. Heckman and S. Navarro-Lozano, “Using matching, instrumental variables, and control functions to estimate economic choice models,” Review of Economics and Statistics, vol. 86, no. 1, pp. 30–57, 2004. 9 P. R. Rosenbaum and D. B. Rubin, “The central role of the propensity score in observational studies for causal effects,” Biometrika, vol. 70, no. 1, pp. 41–55, 1983. 10 P. R. Rosenbaum and D. B. Rubin, “Reducing bias in observational studies using subclassification on the propensity score,” Journal of the American Statistical Society, vol. 79, no. 387, pp. 516–524, 1984. 11 A. Abadie and G. W. Imbens, “Simple and bias-corrected matching estimators for average treatment effects,” NBER Technical Working Paper 0283, National Bureau of Economic Research, Cambridge, Mass, USA, 2002. 12 K. Hirano, G. W. Imbens, and G. Ridder, “Efficient estimation of average treatment effects using the estimated propensity score,” Econometrica, vol. 71, no. 4, pp. 1161–1189, 2003. 13 J. Hahn, “On the role of the propensity score in efficient semiparametric estimation of average treatment effects,” Econometrica, vol. 66, no. 2, pp. 315–331, 1998. 14 R. Koenker and G. Bassett,, “Regression quantiles,” Econometrica, vol. 46, no. 1, pp. 33–50, 1978. 15 A. Abadie, J. Angrist, and G. Imbens, “Instrumental variables estimates of the effect of subsidized training on the quantiles of trainee earnings,” Econometrica, vol. 70, no. 1, pp. 91–117, 2002. 16 V. Chernozhukov and C. Hansen, “The effects of 401k participation on the wealth distribution: an instrumental quantile regression analysis,” Review of Economics and Statistics, vol. 86, no. 3, pp. 735– 751, 2004. 17 V. Chernozhukov and C. Hansen, “An IV model of quantile treatment effects,” Econometrica, vol. 73, no. 1, pp. 245–261, 2005. 18 S. Firpo, “Efficient semiparametric estimation of quantile treatment effects,” Econometrica, vol. 75, no. 1, pp. 259–276, 2007. 19 A. Diamond, Reliable estimation of average and quantile causal effects in non-experimental settings, Harvard University, Cambridge, Mass, USA, 2005. 20 P. C. Mahalanobis, “A method fractile graphical analysis,” Econometrica, vol. 28, no. 2, pp. 325–351, 1960. 21 A. K. Bera and A. Gosh, Fractile Regression and Its Applications, University of Illinois at UrbanaChampaign, Champaign, Ill, USA, 2006. 22 B. Sen, “Estimation and comparison of fractile graphs using kernel smoothing techniques,” Sankhya. The Indian Journal of Statistics, vol. 67, no. 2, pp. 305–334, 2005. 23 R. Koenker and K. F. Hallock, “Quantile regression,” Journal of Economic Perspectives, vol. 15, no. 4, pp. 143–156, 2001. 24 R. Fisher, The Design of Experiments, Boyd, London, UK, 1935. 25 J. Splawa-Neyman, “On the application of probability theory to agricultural experiments. Essay on principles. Section 9,” Statistical Science, vol. 5, no. 4, pp. 465–480, 1990.
Journal of Probability and Statistics
23
26 D. B. Rubin, “Estimating causal effects of treatments in randomized and nonrandomized studies,” Journal of Educational Psychology, vol. 66, no. 5, pp. 688–701, 1974. 27 D. B. Rubin, “Assignment to a treatment group on the basis of a covariate,” Journal of Educational Statistics, vol. 2, no. 1, pp. 1–26, 1977. 28 D. B. Rubin, “Bayesian inference for causal effects: the role of randomization,” The Annals of Statistics, vol. 6, no. 1, pp. 34–58, 1978. 29 J. J. Heckman, H. Ichimura, and P. E. Todd, “Matching as an econometric evaluation estimator: evidence from evaluating a job training programme,” Review of Economic Studies, vol. 64, no. 4, pp. 605–654, 1997. 30 M. Lechner, “Earnings and employment effects of continuous off-the-job training in East Germany after unification,” Journal of Business and Economic Statistics, vol. 17, no. 1, pp. 74–90, 1999. 31 M. Lechner, “An evaluation of public-sector-sponsored continuous vocational training programs in East Germany,” Journal of Human Resources, vol. 35, no. 2, pp. 347–375, 2000. 32 P. K. Bhattacharya and H.-G. Muller, “Asymptotics for nonparametric regression,” Sankhya. The Indian ¨ Journal of Statistics. Series A, vol. 55, no. 3, pp. 420–441, 1993. 33 P. K. Bhattacharya, “On an analog of regression analysis,” Annals of Mathematical Statistics, vol. 34, pp. 1459–1473, 1963. 34 J. J. Heckman and V. Hotz, “Choosing among alternative nonexperimental methods for estimating the impact of social programs: the case of manpower training,” Journal of the American Statistical Association, vol. 84, no. 408, pp. 862–874, 1989. 35 R. H. Dehejia and S. Wahba, “Causal effects in nonexperimental studies: reevaluating the evaluation of training programs,” Journal of the American Statistical Association, vol. 94, no. 448, pp. 1053–1062, 1999. 36 R. H. Dehejia and S. Wahba, “Propensity score-matching methods for nonexperimental causal studies,” Review of Economics and Statistics, vol. 84, no. 1, pp. 151–161, 2002. 37 J. J. Heckman, H. Ichimura, and P. Todd, “Matching as an econometric evaluation estimator,” Review of Economic Studies, vol. 65, no. 2, pp. 261–294, 1998.
Hindawi Publishing Corporation Journal of Probability and Statistics Volume 2011, Article ID 718647, 39 pages doi:10.1155/2011/718647
Research Article Estimation and Properties of a Time-Varying GQARCH(1,1)-M Model Sofia Anyfantaki and Antonis Demos Athens University of Economics and Business, Department of International and European Economic Studies, 10434 Athens, Greece Correspondence should be addressed to Antonis Demos,
[email protected] Received 16 May 2011; Accepted 14 July 2011 Academic Editor: Mike Tsionas Copyright q 2011 S. Anyfantaki and A. Demos. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Time-varying GARCH-M models are commonly used in econometrics and financial economics. Yet the recursive nature of the conditional variance makes exact likelihood analysis of these models computationally infeasible. This paper outlines the issues and suggests to employ a Markov chain Monte Carlo algorithm which allows the calculation of a classical estimator via the simulated EM algorithm or a simulated Bayesian solution in only OT computational operations, where T is the sample size. Furthermore, the theoretical dynamic properties of a time-varying GQARCH1,1-M are derived. We discuss them and apply the suggested Bayesian estimation to three major stock markets.
1. Introduction Time series data, emerging from diverse fields appear to possess time-varying second conditional moments. Furthermore, theoretical results seem to postulate quite often, specific relationships between the second and the first conditional moment. For instance, in the stock market context, the first conditional moment of stock market’s excess returns, given some information set, is a possibly time-varying, linear function of volatility see, e.g., Merton 1, Glosten et al. 2. These have led to modifications and extensions of the initial ARCH model of Engle 3 and its generalization by Bollerslev 4, giving rise to a plethora of dynamic heteroscedasticity models. These models have been employed extensively to capture the time variation in the conditional variance of economic series, in general, and of financial time series, in particular see Bollerslev et al. 5 for a survey. Although the vast majority of the research in conditional heteroscedasticity is being processed aiming at the stylized facts of financial stock returns and of economic time series
2
Journal of Probability and Statistics
in general, Arvanitis and Demos 6 have shown that a family of time-varying GARCH-M models can in fact be consistent with the sample characteristics of time series describing the temporal evolution of velocity changes of turbulent fluid and gas molecules. Despite the fact that the latter statistical characteristics match in a considerable degree their financial analogues e.g., leptokurtosis, volatility clustering, and quasi long-range dependence in the squares are common, there are also significant differences in the behavior of the before mentioned physical systems as opposed to financial markets examples are the anticorrelation effect and asymmetry of velocity changes in contrast to zero autocorrelation and the leverage effect of financial returns see Barndorff-Nielsen and Shephard 7 as well as Mantegna and Stanley 8, 9. It was shown that the above-mentioned family of models can even create anticorrelation in the means as far as an AR1 time-varying parameter is introduced. It is clear that from an econometric viewpoint it is important to study how to efficiently estimate models with partially unobserved GARCH processes. In this context, our main contribution is to show how to employ the method proposed in Fiorentini et al. 10 to achieve MCMC likelihood-based estimation of a time-varying GARCH-M model by means of feasible OT algorithms, where T is the sample size. The crucial idea is to transform the GARCH model in a first-order Markov’s model. However, in our model, the error term enters the inmean equation multiplicatively and not additively as it does in the latent factor models of Fiorentini et al. 10. Thus, we show that their method applies to more complicated models, as well. We prefer to employ a GQARCH specification for the conditional variance Engle 3 and Sentana 11 since it encompasses all the existing restricted quadratic variance functions e.g., augmented ARCH model, its properties are very similar to those of GARCH models e.g., stationarity conditions but avoids some of their criticisms e.g., very easy to generalize to multivariate models. Moreover, many theories in finance involve an explicit tradeoff between the risk and the expected returns. For that matter, we use an in-mean model which is ideally suited to handling such questions in a time series context where the conditional variance may be time varying. However, a number of studies question the existence of a positive mean/variance ratio directly challenging the mean/variance paradigm. In Glosten et al. 2 when they explicitly include the nominal risk free rate in the conditioning information set, they obtain a negative ARCH-M parameter. For the above, we allow the conditional variance to affect the mean with a possibly time varying coefficient which we assume for simplicity that it follows an AR1 process. Thus, our model is a time-varying GQARCH-MAR1 model. As we shall see in Section 2.1, this model is able to capture the, so-called, stylized facts of excess stock returns. These are i the sample mean is positive and much smaller than the standard deviation, that is, high coefficient of variation, ii the autocorrelation of excess returns is insignificant with a possible exception of the 1st one, iii the distribution of returns is nonnormal mainly due to excess kurtosis and may be asymmetry negative, iv there is strong volatility clustering, that is, significant positive autocorrelation of squared returns even for high lags, and v the so-called leverage effect; that is, negative errors increase future volatility more than positive ones of the same size. The structure of the paper is as follows. In Section 2, we present the model and derive the theoretical properties the GQARCH1,1-M-AR1 model. Next, we review Bayesian and classical likelihood approaches to inference for the time-varying GQARCHM model. We show that the key task in both cases is to be able to produce consistent simulators and that the estimation problem arises from the existence of two unobserved
Journal of Probability and Statistics
3
processes, causing exact likelihood-based estimations computationally infeasible. Hence, we demonstrate that the method proposed by Fiorentini et al. 10 is needed to achieve a first-order Markov’s transformation of the model and thus, reducing the computations from OT 2 to OT. A comparison of the efficient OT calculations and the inefficient OT 2 ones simulator is also given. An illustrative empirical application on weekly returns from three major stock markets is presented in Section 4, and we conclude in Section 5.
2. GQARCH(1,1)-M-AR(1) Model The definition of our model is as follows. Definition 2.1. The time-varying parameter GQARCH1,1-M-AR1 model:
rt δt ht εt ,
εt zt h1/2 t ,
2.1
where δt 1 − ϕ δ ϕδt−1 ϕu ut ,
2.2
2 ht ω α εt−1 − γ βht−1 ,
2.3
zt Ú i.i.d. N0, 1, ut Ú i.i.d. N0, 1, ut , zt are independent for all t s, and where {rt }Tt1 are the observed excess returns, T is the sample size, {δt }Tt1 is an unobserved AR1 process independent with δ0 δ of {εt }Tt1 , and {ht }Tt1 is the conditional variance with h0 equal to the unconditional variance and ε0 0 which is supposed to follow a GQARCH1,1. It is obvious that δt is the market price of risk see, e.g., Merton 1 Glosten at al. 2. Let us call Ft−1 the sequence of natural filtrations generated by the past values of {εt } and {rt }. Modelling the theoretical properties of this model has been a quite important issue. Specifically, it would be interesting to investigate whether this model can accommodate the main stylized facts of the financial markets. On other hand, the estimation of the model requires its transformation into a first-order Markov’s model to implement the method of Fiorentini et al. 10. Let us start with the theoretical properties.
2.1. Theoretical Properties Let us consider first the moments of the conditional variance ht , needed for the moments of rt . The proof of the following lemma is based on raising ht to the appropriate power, in 2.3, and taking into account that Ez4t 3, Ez6t 15 and Ez8t 105.
4
Journal of Probability and Statistics
Lemma 2.2. If 105α4 60α3 β 18α2 β2 4αβ3 β4 < 1, the first four moments of the conditional variance of 2.3 are given by ω αγ 2 , 1− α β ω αγ 2 1 α β 4α2 γ 2 2 2 , E ht ω αγ 1 − 3α2 − 2αβ − β2 1 − α − β Eht
ω αγ 2 3 3ω αγ 2 ω αγ 2 α β 4α2 γ 2 Eht E h3t 1 − β3 − 15α3 − 9α2 β − 3αβ2 3 ω αγ 2 3α2 β2 2βα 4α2 γ 2 3α β E h2t , 1 − β3 − 15α3 − 9α2 β − 3αβ2 2 ω αγ 2 2 4ω αγ 2 α β 6α2 γ 2 Eht 4 2 E ht ω αγ 1 − 105α4 − 60α3 β − 18α2 β2 − 4αβ3 − β4 2 ω αγ 2 3α2 β2 2βα 8 ω αγ 2 3α β α2 γ 2 α2 γ 2 2 6 E ht 1 − 105α4 − 60α3 β − 18α2 β2 − 4αβ3 − β4 ω αγ 2 15α3 9α2 β 3αβ2 β3 6α2 γ 2 15α2 6αβ β2 3 4 E ht . 1 − 105α4 − 60α3 β − 18α2 β2 − 4αβ3 − β4
2.4
Consequently, the moments of rt are given in the following theorem taken from Arvanitis and Demos 6. Theorem 2.3. The first two moments of the model in 2.1, 2.2, and 2.3 are given by
Ert δEht ,
E rt2
ϕ2u δ 1 − ϕ2 2
E h2t Eht ,
2.5
whereas the skewness and kurtosis coefficients are Skrt
Srt 1.5
Var rt
,
kurtrt
κ Var2
rt ,
where ϕ2u 3 E h Srt δ δ 3 3δE h2 2δ3 E3 ht t 1 − ϕ2
ϕ2u 2 2 − 3δ δ E ht Eht Eht , 1 − ϕ2
2
2.6
Journal of Probability and Statistics 2
ϕ4u 4 2 ϕu 4 2 2 E3 ht E h 3δ 2 − δ 3 Eh κ δ 6δ t t 2 1 − ϕ2 1 − ϕ2
3ϕ2u 3 2 2 E ht − 4δ δ E h3t Eht 2 1−ϕ ϕ2u 2 δ Eh − 2 Eh 3 E h2t , 1 − ϕ2
ϕ2u 6 δ 1 − ϕ2 2
6δ2
5
2.7 and Eht , Eh2t , Eh3t , and Eh4t are given in Lemma 2.2. In terms of stationarity, the process {rt } is 4th-order stationary if and only if ϕ < 1,
105α4 60α3 β 18α2 β2 4αβ3 β4 < 1.
2.8
These conditions are the same as in Arvanitis and Demos 6, indicating that the presence of the asymmetry parameter, γ , does not affect the stationarity conditions see also Bollerslev 4 and Sentana 11. Furthermore, the 4th-order stationarity is needed as we would like to measure the autocorrelation of the squared rt ’s volatility clustering, as well as the correlation of rt2 and rt−1 leverage effect. The dynamic moments of the conditional variance and those between the conditional variance ht and the error εt are given in the following two lemmas for a proof see Appendix B. Lemma 2.4. Under the assumption of Lemma 2.2, one has that k Covht , ht−k α β V ht ,
α βk − Ak 2 k 3 2 BV ht , Cov ht , ht−k A E ht − E ht Eht α β −A
k Cov ht , h2t−k α β E h3t−1 − E h2t−1 Eht−1 , k
α β − Ak 3 2 2 k 2 Cov ht , ht−k A V ht B E ht − E h2t Eht , α β −A where A 3α2 β2 2αβ and B 22α2 γ 2 ω αγ 2 α β. Lemma 2.5. k−1 Covht , εt−k Eht εt−k −2αγ α β Eht , k−1 2 E ht−1 , Covht ht−k , εt−k Eht ht−k εt−k −2αγ α β k k−1 2 2 α β V ht 2α α β E ht , Cov ht , εt−k
2.9
6
Journal of Probability and Statistics Cov h2t , εt−k E h2t εt−k −4αγ 3α β Ak−1 E h2t − 4αγ
k−1 α β k − Ak α β − Ak−1 2 2 2α γ Eht , ω αγ α β −A α β −A 2
k
α β − Ak 3 2 A E ht −E ht Eht B V ht 4α 3α β Ak−1 E h3t Cov α β −A
k−1 α β − Ak−1 2 2 2 k−1 E h2t , A 4 2α γ ω αγ 2αB α β −A Cov h2t ht−k , εt−k E h2t ht−k εt−k −4αγ Ak−1 3α β E h3t ω αγ 2 E h2t
2 h2t , εt−k
k
k−1 α β − Ak−1 2 E ht , − 2αγ B α β −A 2.10 where A and B are given in Lemma 2.4. Furthermore, from Arvanitis and Demos 6 we know that the following results hold. Theorem 2.6. The autocovariance of returns for the model in 2.1–2.3 is given by γk Covrt , rt−k δ2 Covht , ht−k δEht εt−k ϕk
ϕ2u Eht ht−k , 1 − ϕ2
2.11
and the covariance of squares’ levels and the autocovariance of squares are Cov rt2 , rt−k E δt2 δt−k Cov h2t , ht−k Cov δt2 , δt−k E h2t Eht−k Eδt−k Covht , ht−k E δt2 E h2t εt−k Eht εt−k , 2 2 Cov rt2 , rt−k E2 δt2 Cov h2t , h2t−k Cov δt2 , δt−k E h2t h2t−k
2.12
2 E δt2 Cov h2t , εt−k 2E δt2 δt−k E h2t ht−k εt−k 2 2δEht ht−k εt−k , E δt2 Cov ht , h2t−k Cov ht , εt−k where all needed covariances and expectations of the right-hand sizes are given in Lemmas 2.4 and 2.5, ϕ2u 2 , 2ϕk δ Cov δt2 , δt−k Cov δt , δt−k 1 − ϕ2 Cov
2 δt2 , δt−k
ϕ2u ϕ4u 2k 4ϕ δ 2ϕ 2 . 1 − ϕ2 1 − ϕ2 k 2
2.13
Journal of Probability and Statistics
7
From the above theorems and lemmas it is obvious that our model can accommodate all stylized facts. For example, negative asymmetry is possible; volatility clustering and the leverage effect negative Covht , εt−k can be accommodated, and so forth. Furthermore, the model can accommodate negative autocorrelations, γk , something that is not possible for the GARCH-M model see Fiorentini and Sentana 12. Finally, another interesting issue is the diffusion limit of the time-varying GQARCH-M process. As already presented by Arvanitis 13, the weak convergence of the Time-varying GQARCH1,1-M coincides with the general conclusions presented elsewhere in the literature. These are that weak limits of the endogenous volatility models are exogenous stochastic volatility continuous-time processes. Moreover, Arvanitis 13 suggests that there is a distributional relation between the GQARCH model and the continuous-time Ornstein-Uhlenbeck models with respect to appropriate nonnegative Levy’s processes. Let us turn our attention to the estimation of our model. We will show that estimating our model is a hard task and the use of well-known methods such as the EM-algorithm cannot handle the problem due to the huge computational load that such methods require.
3. Likelihood-Inference: EM and Bayesian Approaches The purpose of this section is the estimation of the time-varying GQARCH1,1-M model. Since our model involves two unobserved components one from the time-varying in-mean parameter and one from the error term, the estimation method required is an EM and more specifically a simulated EM SEM, as the expectation terms at the E step cannot be computed. The main modern way of carrying out likelihood inference in such situations is via a Markov chain Monte Carlo MCMC algorithm see Chib 14 for an extensive review. This simulation procedure can be used either to carry out Bayesian inference or to classically estimate the parameters by means of a simulated EM algorithm. The idea behind the MCMC methods is that in order to sample a given probability distribution, which is referred to as the target distribution, a suitable Markov chain is constructed using a Metropolis-Hasting M-H algorithm or a Gibbs sampling method with the property that its limiting, invariant distribution is the target distribution. In most problems, the target distribution is absolutely continuous, and as a result the theory of MCMC methods is based on that of the Markov chains on continuous state spaces 15. This means that by simulating the Markov chain a large number of times and recording its values a sample of correlated draws from the target distribution can be obtained. It should be noted that the Markov chain samplers are invariant by construction, and, therefore, the existence of the invariant distribution does not have to be checked in any particular application of MCMC method. The Metropolis-Hasting algorithm M-H is a general MCMC method to produce sample variates from a given multivariate distribution. It is based on a candidate generating density that is used to supply a proposal value that is accepted with probability given as the ratio of the target density times the ratio of the proposal density. There are a number of choices of the proposal density e.g., random walk M-H chain, independence M-H chain, tailored MH chain and the components may be revised either in one block or in several blocks. Another MCMC method, which is special case of the multiple block M-H method with acceptance rate always equal to one, is called the Gibbs sampling method and was brought into statistical prominence by Gelfand and Smith 16. In this algorithm, the parameters are grouped into blocks, and each block is sampled according to the full conditional distribution denoted as
8
Journal of Probability and Statistics
πφt /φ/t . By Bayes’ theorem, we have πφt /φ/t ∝ πφt φ/t , the joint distribution of all blocks, and so full conditional distributions are usually quite simply derived. One cycle of p the Gibbs sampling algorithm is completed by simulating {φt }t1 , where p is the number of blocks, from the full conditional distributions, recursively updating the conditioning variables as one moves through each distribution. Under some general conditions, it is verified that the Markov chain generated by the M-H or the Gibbs sampling algorithm converges to the target density as the number of iterations becomes large. Within the Bayesian framework, MCMC methods have proved very popular, and the posterior distribution of the parameters is the target density see 17. Another application of the MCMC is the analysis of hidden Markov’s models where the approach relies on augmenting the parameter space to include the unobserved states and simulate the target distribution via the conditional distributions this procedure is called data augmentation and was pioneered by Tanner and Wong 18. Kim et al. 19 discuss an MCMC algorithm of the stochastic volatility SV model which is an example of a state space model in which the state variable ht log-volatility appears non-linearly in the observation equation. The idea is to approximate the model by a conditionally Gaussian state space model with the introduction of multinomial random variables that follow a seven-point discrete distribution. The analysis of a time-varying GQARCH-M model becomes substantially complicated since the log-likelihood of the observed variables can no longer be written in closed form. In this paper, we focus on both the Bayesian and the classical estimation of the model. Unfortunately, the non-Markovian nature of the GARCH process implies that each time we simulate one error we implicitly change all future conditional variances. As pointed out by Shephard 20, a regrettable consequence of this path dependence in volatility is that standard MCMC algorithms will evolve in OT 2 computational load see 21. Since this cost has to be borne for each parameter value, such procedures are generally infeasible for large financial datasets that we see in practice.
3.1. Estimation Problem: Simulated EM Algorithm As mentioned already, the estimation problem arises because of the fact that we have two unobserved processes. More specifically, we cannot write down the likelihood function in closed form since we do not observe both εt and δt . On the other hand, the conditional loglikelihood function of our model assuming that δt were observed would be the following: r, δ | φ, F0 ln p r | δ, φ, F0 ln p δ | φ, F0 −T ln 2π −
T T 1 1 εt 2 ln ht − 2 t1 2 t1 ht
3.1
2 T δt − δ 1 − ϕ − ϕδt−1 T 2 1 − ln ϕu − , 2 2 t1 ϕ2u where r r1 , . . . , rT , δ δ1 , . . . , δT , and h h1 , . . . , hT . However, the δt ’s are unobserved, and, thus, to classically estimate the model, we have to rely on an EM algorithm 22 to obtain estimates as close to the optimum as desired. At each iteration, the EM algorithm obtains φn 1 , where φ is the parameter vector, by maximizing
Journal of Probability and Statistics
9
the expectation of the log-likelihood conditional on the data and the current parameter values, that is, E · | r, φn , F0 with respect to φ keeping φn fixed. The E step, thus, requires the expectation of the complete log-likelihood. For our model, this is given by: T T 1 E ln ht | r, φn , F0 E · | r, φn , F0 −T ln 2π − ln ϕ2u − 2 2 t1 T 1 εt 2 n − E | r, φ , F0 2 t1 ht T 1 − E 2 t1
2 δt − δ 1 − ϕ − ϕδt−1 ϕ2u
3.2
| r, φ
n
, F0 .
It is obvious that we cannot compute such quantities. For that matter, we may rely on a simulated EM where the expectation terms are replaced by averages over simulations, and so we will have an SEM or a simulated score. The SEM log-likelihood is: 2 i 2 M M T T εt T 1 1 1 1 T 1 − ϕ δ2 i 2 SEM −T ln 2π − ln ϕu − ln ht − − 2 2 M i1 t1 2 M i1 t1 hi 2 ϕ2u t M M M T T T 1−ϕ δ 1 ϕ 1 1 1 1 i 2 i i i δ − δ δt δt−1 2 ϕ2u M i1 t1 t M i1 t1 t ϕ2u ϕ2u M i1 t1
M M T T 1 − ϕ ϕδ 1 ϕ2 1 i i 2 δt−1 . δt−1 − 2 2 M i1 t1 ϕu 2ϕu M i1 t1
3.3
T i Consequently, we need to obtain the following quantities: 1/M M i1 t1 ln ht , 1/M 2 M T i M T i M T i i M T i i i1 t1 εt /ht , 1/M i1 t1 δt , 1/M i1 t1 δt−1 , 1/M i1 t1 δt δt−1 and M T M T 2i i 2 1/M i1 t1 δt , 1/M i1 t1 δt−1 , where M is the number of simulations. Thus, to classically estimate our model by using an SEM algorithm, the basic problem is to sample from h | φ, r, F0 where φ is the vector of the unknown parameters and also sample from δ | φ, r, F0 . In terms of identification, the model is not, up to second moment, identified see Corollary 1 in Sentana and Fiorentini 23. The reason is that we can transfer unconditional variance from the error, εt , to the price of risk, δt , and vice versa. One possible solution is to fix ω such that Eht is 1 or to set ϕu to a specific value. In fact in an earlier version of the paper, we fixed ϕu to be 1 see Anyfantaki and Demos 24. Nevertheless, from a Bayesian viewpoint, the lack of identification is not too much of a problem, as the parameters are identified through their proper priors see Poirier 25. Next, we will exploit the Bayesian estimation of the model, and, since we need to resort to simulations, we will show that the key task is again to simulate from δ | φ, r, F0 .
10
Journal of Probability and Statistics
3.2. Simulation-Based Bayesian Inference In our problem, the key issue is that the likelihood function of the sample pr | φ, F0 is intractable which precludes the direct analysis of the posterior density pφ | r, F0 . This problem may be overcome by focusing instead on the posterior density of the model using Bayes’ rule: p φ, δ | r ∝ p φ, δ p r | φ, δ ∝ p δ | φ p φ p r | φ, δ ,
3.4
φ δ, ϕ, ϕu , α, β, γ, ω .
3.5
where
Now, 2 T T δt − δ 1 − ϕ − ϕδt−1 1 δt 2 , ϕ, ϕu p δ|φ p exp − . δ 2ϕ2u t1 t1 2πϕ2
3.6
u
On the other hand, 2 T T ε rt 1 p r | φ, δ p exp − t , δ, φ √ 2h {r } t−1 t 2πht t1 t1
3.7
is the full-information likelihood. Once we have the posterior density, we get the parameters’ marginal posterior density by integrating the posterior density. MCMC is one way of numerical integration. The Hammersley-Clifford theorem see Clifford 26 says that a joint distribution can be characterized by its complete conditional distribution. Hence, given initial values {δt }0 , φ0 , we draw {δt }1 from p{δt }1 | r, φ0 and then φ1 from pφ1 | {δt }1 , r. M
Iterating these steps, we finally get {δt }i , φi i1 , and under mild conditions it is shown that the distribution of the sequence converges to the joint posterior distribution pφ, δ | r. The above simulation procedure may be carried out by first dividing the parameters into two blocks: φ1 δ, ϕ, ϕ2u , φ2 α, β, γ, ω ,
3.8
Then the algorithm is described as follows. 1 Initialize φ. 2 Draw from pδt | δ / t , r, φ. 3 Draw from pφ | δ, r in the following blocks: i draw from pφ1 | δ, r using the Gibbs sampling. This is updated in one block;
Journal of Probability and Statistics
11
ii draw from pφ2 | r by M-H. This is updated in a second block. 4 Go to 2. We review the implementation of each step.
3.2.1. Gibbs Sampling The task of simulating from an AR model has been already discussed. Here, we will follow the approach of Chib 27, but we do not have any MA terms which makes inference simpler. Suppose that the prior distribution of δ, ϕ2u , ϕ is given by: p δ, ϕ2u , ϕ p δ | ϕ2u p ϕ2u p ϕ ,
3.9
which means that δ, ϕ2u is a priory independent of ϕ. Also the following holds for the prior distributions of the parameter subvector φ1 : p δ | ϕ2u ∼ N δpr , ϕ2u σδ2pr , v0 d0 , p ∼ IG , 2 2 p ϕ ∼ N ϕ0 , σϕ20 Iϕ ,
ϕ2u
3.10
where Iϕ ensures that ϕ lies outside the unit circle, IG is the inverted gamma distribution, and the hyperparameters v0 , d0 , δpr , σδ2pr , ϕ0 , σϕ20 have to be defined. Now, the joint posterior is proportional to
p
δ, ϕ, ϕ2u
| r, δ ∝
T t1
1
exp − 2πϕ2u
2 δt − 1 − ϕ δ − ϕδt−1 2ϕ2u
v0 d0 , × N ϕ0 , σϕ20 Iϕ . × N δpr , ϕ2u σδ2pr × IG 2 2
3.11
From a Bayesian viewpoint, the right-hand side of the above equation is equal to the “augmented” prior, that is, the prior augmented by the latent δ We would like to thank the associate editor for bringing this to our attention. We proceed to the generation of these parameters.
Generation of δ First we see how to generate δ. Following again Chib 27, we may write δt∗ δt − ϕδt−1 ,
δt∗ | Ft−1 ∼ N
1 − ϕ δ, ϕ2u ,
3.12
12
Journal of Probability and Statistics
or, otherwise, δt∗ 1 − ϕ δ vt ,
vt ∼ N 0, ϕ2u .
3.13
Under the above and using Chib’s 1993 notation, we have that the proposal distribution is the following Gaussian distribution see Chib 27 for a proof. Proposition 3.1. The proposal distribution of δ is ϕ2 σ 2 , δ | δ, φ, ϕ2u ∼ N δ, u δ
3.14
where ⎛ δ
σδ2 ⎝
δpr σδ2pr
⎛ σδ2 ⎝
⎞ T ∗⎠ 1−ϕ δt , t1
⎞−1
3.15
2 1 1−ϕ ⎠ . 2 σδpr
Hence, the generation of δ is completed, and we may turn on the generation of the other parameters.
Generation of ϕ2u For the generation of ϕ2u and using 27 notation, we have the following. Proposition 3.2. The proposal distribution of ϕ2u is T − v0 d0 Q d , , ϕ2u | δ, φ, δ ∼ IG 2 2
3.16
where 2 Q δ − δpr σδ−2pr , d
T 2 δt∗ − δ 1 − ϕ . t2
Finally, we turn on the generation of ϕ.
3.17
Journal of Probability and Statistics
13
Generation of ϕ For the generation of ϕ, we follow again Chib 27 and write δt 1 − ϕ δ − ϕδt−1 vt ,
vt ∼ N 0, ϕ2u .
3.18
We may now state the following proposition see Chib 27 for a proof. Proposition 3.3. The proposal distribution of ϕ is ϕ2 | δ, δ, ϕ2u ∼ N ϕ, σϕ2 ,
3.19
where ϕ
σ ϕ2
σϕ−20 ϕ0
ϕ−2 u
T
δt−1 − δδt − δ ,
t1
σϕ−2
σϕ−20
ϕ−2 u
T δt−1 − δ2 .
3.20
t1
The Gibbs sampling scheme has been completed, and the next step of the algorithm requires the generation of the conditional variance parameters via an M-H algorithm which is now presented.
3.2.2. Metropolis-Hasting Step 3-ii is the task of simulating from the posterior of the parameters of a GQARCHM process. This has been already addressed by Kim et al. 19, Bauwens and Lubrano 28, Nakatsuma 29, Ardia 30 and others. First, we need to decide on the priors. For the parameters α, β, γ, ω, we use normal densities as priors: pα ∼ N μα , Σα Iα , p β ∼ N μβ , σβ2 Iβ
3.21
p γ ∼ N μγ , σγ2 ,
3.22
and similarly
where α ω, α , Iα , Iβ are the indicators ensuring the constraints α > 0, α β < 1 and β > 0, α β < 1, respectively. μ. , σ.2 are the hyperparameters.
14
Journal of Probability and Statistics
We form the joint prior by assuming prior independence between α, β, γ , and the joint posterior is then obtained by combining the joint prior and likelihood function by Bayes’ rule: 2 T ε 1 × N μα , Σα Iα × N μβ , σβ2 Iβ × N μγ , σγ2 . exp − t p α, β, γ | r ∝ 2ht 2πht t1
3.23
For the M-H algorithm, we use the following approximated GARCH model as in Nakatsuma 29 which is derived by the well-known property of GARCH models 4: 2 2 εt2 ω α εt−1 − γ βεt−1 wt − βwt−1 ,
3.24
where wt εt2 − ht with wt ∼ N0, 2h2t . Then the corresponding approximated likelihood is written as
p ε2 | δ, r, φ2
T t1
2 ⎤ 2 2 2 − ω − α ε − γ − βε βw ε t−1 t−1 t t−1 1 ⎥, ⎢ √ exp⎣− ⎦ 2ht π 4h2t ⎡
3.25
and the generation of α, β, γ is based on the above likelihood where we update {ht } each time after the corresponding parameters are updated. The generation of the four variance parameters is given.
Generation of α For the generation of α, we first note that wt in 3.32, below, can be written as a linear function of α: wt ε2t − ζt α,
3.26
ιt , ε#t2 with where ζt 2 ε2t εt2 − β εt−1 , 2 εt2 εt2 β εt−1 ,
2 2 ε#t2 εt−1 − γ β# εt−1 ,
3.27
ιt 1 βιt−1 . Now, let the two following vectors be
Yα ε21 , . . . , ε2T , Xα ζ1 , . . . , ζT .
3.28
Journal of Probability and Statistics
15
Then the likelihood function of the approximated model is rewritten as
p ε2 | r, δ, φ2
T t1
2 ⎤ 2 ε − ζ α t t 1 ⎥ ⎢ 2 ⎦. exp⎣− 2 2ht 2π 2h2t ⎡
3.29
Using this we have the following proposal distribution of α see Nakatsuma 29 or Ardia 30 for a proof. Proposition 3.4. The proposal distribution of α is # α Iα , α | Y, X, Σ, φ2−α ∼ N μ #α , Σ
3.30
2 2 −1 −1 −1 # α Xα Λ−1 Yα Σ−1 # where μ #α Σ α μα , Σα Xα Λ Xα Σα , and Λ diag2h1 , . . . , 2hT . Iα imposes the restriction that α > 0 and α β < 1. Hence a candidate α is sampled from this proposal density and accepted with probability:
, β, γ, δ, r p α , β, γ | δ, r q α∗ | α min ∗ , 1 , p α , β, γ | δ, r q α | α∗ , β, γ, δ, r
3.31
where α∗ is the previous draw. Similar procedure is used for the generation of β and γ .
Generation of β Following Nakatsuma 29, we linearize wt by the first-order Taylor expansion wt β ≈ wt β∗ ξt β∗ β − β∗ ,
3.32
where ξt is the first-order derivative of wt β evaluated at β∗ the previous draw of the M-H sampler. Define as r t wt β ∗ gt β ∗ β ∗ ,
3.33
where gt β∗ −ξt β∗ which is computed by the recursion: 2 gt εt−1 − wt−1 β∗ gt−1 ,
3.34
wt β ≈ rt − gt β β.
3.35
ξt 0 for t ≤ 0 30. Then,
16
Journal of Probability and Statistics Let the following two vectors be Yβ r1 , . . . , rT , Xβ g1 , . . . , gT .
3.36
Then, the likelihood function of the approximated model is rewritten as
p ε | δ, r, φ2 2
T t1
$ ∗ %2 wt β ξt β∗ β − β∗ 1 . exp − 2 2h2t 2π 2h2t
3.37
We have the following proposal distribution for β for a proof see Nakatsuma 29 or Ardia 30. Proposition 3.5. The proposal distribution for β is β | Y, X, σβ2 , φ2−β ∼ N μ # β , σ#β2 Iβ ,
3.38
−1
where μ#β σ#β2 Xβ Λ−1 Yβ μβ /σβ2 , σ#β2 Xβ Λ−1 Xβ 1/σβ2 , and Λ diag2h41 , . . . , 2h4T . Iβ imposes the restriction that β > 0 and α β < 1. Hence, a candidate β is sampled from this proposal density and accepted with probability: ⎫ ⎧ α, γ | δ, r q β∗ | β, α, γ, δ, r ⎨ p β, ⎬ min , 1 ⎩ p β∗ , α, γ | δ, r q β | β∗ , α, γ, δ, r ⎭
3.39
Finally, we explain the generation of γ .
Generation of γ As with β, we linearize wt by a first-order Taylor, expansion at a point γ ∗ the previous draw in the M-H sampler. In this case, r t wt γ ∗ − g t γ ∗ γ ∗ ,
3.40
where gt γ ∗ −ξt γ ∗ which is computed by the recursion: gt −2α εt−1 − γ ∗ βgt−1 ,
3.41
wt γ ≈ rt − gt γ.
3.42
and gt 0 for t ≤ 0. Then
Journal of Probability and Statistics
17
Let again Yγ r1 , . . . , rT , Xγ g1 , . . . , gT
3.43
and the likelihood function of the approximated model is rewritten as
p ε | δ, r, φ2 2
T t1
$ ∗ %2 wt γ − g t γ ∗ γ ∗ 1 exp − . 2 2h2t 2π 2h2t
3.44
Thus, we have the following proposal distribution for γ for proof see Nakatsuma 29 and Ardia 30. Proposition 3.6. The proposal distribution of γ is #γ , σ # γ2 , γ | Y, X, σγ2 , φ2−γ ∼ N μ
3.45
−1
where μ#γ σ#γ2 Xγ Λ−1 Yγ μγ /σγ2 , σ#γ2 Xγ Λ−1 Xγ 1/σγ2 , and Λ diag2h41 , . . . , 2h4T . A candidate γ is sampled from this proposal density and accepted with probability: p γ , α, β | δ, r q γ ∗ | γ , α, β, δ, r min ∗ , 1 . p γ , α, β | δ, r q γ | γ ∗ , α, β, δ, r
3.46
The algorithm described above is a special case of a MCMC algorithm, which converges as it iterates, to draws from the required density pφ, δ | r. Posterior moments and marginal densities can be estimated simulation consistently by averaging the relevant function of interest over the sample variates. The posterior mean of φ is simply estimated by the sample mean of the simulated φ values. These estimated values can be made arbitrarily accurate by increasing the simulation sample size. However, it should be remembered that sample variates from an MCMC algorithm are a high dimensional correlated sample from the target density, and sometimes the serial correlation can be quite high for badly behaved algorithms. All that remains, therefore, is step 2. Thus, from the above, it is seen that the main task is again as with the classical estimation of the model, to simulate from δ | φ, r, F0 .
3.2.3. MCMC Simulation of ε | φ, r, F0 For a given set of parameter values and initial conditions, it is generally simpler to simulate {εt } for t 1, . . . , T and then compute {δt }Tt1 than to simulate {δt }Tt1 directly. For that matter, we concentrate on simulators of εt given r and φ. We set the mean and the variance of ε0 equal to their unconditional values, and, given that ht is a sufficient statistic for Ft−1 and the unconditional variance is a deterministic function of φ, F0 can be eliminated from the information set without any information loss.
18
Journal of Probability and Statistics
Now sampling from pε | r, φ ∝ pr | ε, φpε | φ is feasible by using an M-H algorithm where we update each time only one εt leaving all the other unchanged 20. In particular, let us write the nth iteration of a Markov chain as εtn . Then we generate a potential n , r, φ new value of the Markov chain εtnew by proposing from some candidate density gεt | ε\t n n 1 n 1 n n where ε\t {ε1 , . . . , εt−1 , εt 1 , . . . , εT } which we accept with probability ⎤ n n p εtnew | ε\t , r, φ g εtnew | ε\t , r, φ ⎥ ⎢ ⎦. min⎣1, n n n n p εt | ε\t , r, φ g εt | ε\t , r, φ ⎡
3.47
If it is accepted then, we set εtn 1 εtnew and otherwise we keep εtn 1 εtn . Although the proposal is much better since it is only in a single dimension, each time we consider modifying a single error we have to compute: n p εtnew | ε\t p rt | εtnew , hnew,t , r, φ , φ p εtnew | hnew,t , φ p rt | hn,t t t t ,φ n new p εtn | ε\t , r, φ p rt | hnew,t , φ p rt | εtn , hn,t | hn,t t t , φ p εt t ,φ new,t T p rs | εsr , hs , φ p εsr | hnew,t , φ p rs | hn,t s s ,φ ∗ new,t n,t n , φ p rs | εsn , hn,t st 1 p rs | hs s , φ p εs | hs , φ 3.48
, φ p εtnew | hnew,t ,φ p rt | εtnew , hnew,t t t new p rt | εtn , hn,t | hn,t t , φ p εt t ,φ new,t T p rs | εsn , hs , φ p εsn | hnew,t ,φ s ∗ , n,t n,t n n st 1 p rs | εs , hs , φ p εs | hs , φ where for s t 1, . . . , T, n n n n 1 hnew,t V εs | εs−1 , εs−2 , . . . , εt 1 , εtnew , εt−1 , . . . , ε1n 1 , s hn,t s
V εs |
n n n n 1 εs−1 , εs−2 , . . . , εt 1 , εtn , εt−1 , . . . , ε1n 1
3.49
while hnew,t hn,t t t .
3.50
Nevertheless, each time we revise one εt , we have also to revise T − t conditional depend variances because of the recursive nature of the GARCH model which makes hnew,t s upon εtnew for s t 1, . . . , T. And since t 1, . . . , T, it is obvious that we need to calculate T 2 normal densities, and so this algorithm is OT 2 . And this should be done for every φ. To
Journal of Probability and Statistics
19
avoid this huge computational load, we show how to use the method proposed by Fiorentini et al. 10 and so do MCMC with only OT calculations. The method is described in the following subsection.
3.3. Estimation Method Proposed: Classical and Bayesian Estimation The method proposed by Fiorentini et al. 10 is to transform the GARCH model into a firstorder Markov’s model and so do MCMC with only OT calculations. Following their transformation, we augment the state vector with the variables ht 1 and then sample the joint Markov process {ht 1 , st } | r, φ ∈ Ft where st sign εt − γ ,
3.51
so that st ±1 with probability one. The mapping is one to one and has no singularities. More specifically, if we know {ht 1 } and ϕ, then we know the value of
εt − γ
2
ht 1 − ω − βht α
∀t ≥ 1.
3.52
Hence the additional knowledge of the signs of εt − γ would reveal the entire path of {εt } so long as h0 which equals the unconditional value in our case is known, and, thus, we may now reveal also the unobserved random variable {δt } | r, φ, {ht 1 }. Now we have to sample from T p {st , ht 1 } | r, φ ∝ p st | ht 1 , ht , φ p ht 1 | ht , φ p rt | st , ht , ht 1 , φ ,
3.53
t1
where the second and the third term come from the model, and the first comes from the fact that εt | Ft−1 ∼ N0, ht but εt | {ht 1 }, Ft−1 takes values εt γ ± dt ,
3.54
where , dt
ht 1 − ω − βht . α
3.55
From the above, it is seen that we should first simulate {ht 1 } | r, φ since we do not alter the volatility process when we flip from st −1 to st 1 implying that the signs do not cause the volatility process, but we do alter εt and then simulate {st } | {ht 1 }, r, φ. The second step is a Gibbs sampling scheme whose acceptance rate is always one and also conditional on {ht 1 }, r, φ the elements of {st } are independent which further simplifies the calculations. We prefer to review first the Gibbs sampling scheme and then the simulation of the conditional variance.
20
Journal of Probability and Statistics
3.3.1. Simulations of {st } | {ht 1 }, r, φ First, we see how to sample from {st } | {ht 1 }, r, φ. To obtain the required conditionally Bernoulli distribution, we establish first some notation. We have the following see Appendix A:
εt|rt ,ht
γ dt − εt|rt ,ht γ − dt − εt|rt ,ht 1 ct √ ϕ ϕ , √ √ vt|rt ,ht vt|rt ,ht vt|rt ,ht 3.56 1 − ϕ2 rt − δht ϕ2u h2t Eεt | rt , ht , vt|rt ,ht Varεt | rt , ht 2 . ϕ2u ht 1 − ϕ2 ϕu ht 1 − ϕ2
Using the above notation, we see that the probability of drawing st 1 conditional on {ht 1 } is equal to the probability of drawing εt γ dt conditional on ht 1 , ht , rt , φ, where dt is given by 3.70, which is given by p st 1 | {ht 1 }, r, φ p εt γ dt | ht 1 , ht , rt, φ γ dt − εt/rr ,ht 1 √ ϕ . √ ct vt/rt ,ht vt/rt ,ht
3.57
Similarly for the probability of drawing st −1. Both of these quantities are easy to compute; for example, 2 γ dt − εt/rr ,ht 1 γ dt − εt/rr ,ht 1 exp − ϕ √ 3.58 √ √ vt/rt ,ht 2 vt/rt ,ht 2π and so we may simulate {st } | {ht 1 }, r, φ using a Gibbs sampling scheme. Specifically, since conditional on {ht 1 }, r, φ the elements of {st } are independent, we actually draw from the marginal distribution, and the acceptance rate for this algorithm is always one. The Gibbs sampling algorithm for drawing {st } | {ht 1 }, r, φ may be described as below. 0
0
1 Specify an initial value s0 s1 , . . . , sT . 2 Repeat for k 1, . . . , M. a Repeat for t 0, . . . , T − 1. i Draw sk 1 with probability, γ dt − εt/rr ,ht 1 , ϕ √ √ ct vt/rt ,ht vt/rt ,ht
3.59
and sk −1 with probability, 1−
γ dt − εt/rr ,ht 1 . ϕ √ √ ct vt/rt ,ht vt/rt ,ht
3 Return the values {s1 , . . . , sM }.
3.60
Journal of Probability and Statistics
21
3.3.2. Simulations of {ht 1 } | r, φ (Single Move Samplers) On the other hand, the first step involves simulating from {ht 1 } | r, φ. To avoid large dependence in the chain, we use an M-H algorithm where we simulate one ht 1 at a time leaving the others unchanged 20, 31. So if ht 1 n is the current value of the nth iteration of a Markov chain, then we draw a candidate value of the Markov chain hnew t 1 by proposing it from a candidate density proposal density ght 1 | hn/t 1 , r, φ where hn/t 1 n 1 n 1 n 1 n n {hn 1 ht 1 new with acceptance probability 1 , h2 , . . . , ht , ht 2 , . . . , hT 1 }. We set ht 1
n n n p hnew t 1 | h/t 1 , r, φ g ht 1 | h/t 1 , r, φ min 1, n , n p ht 1 | hn/t 1 , r, φ g hnew t 1 | h/t 1 , r, φ
3.61
where we have used the fact that p h | r, φ p h/t | r, φ p ht | h/t , r, φ .
3.62
However, we may simplify further the acceptance rate. More specifically, we have that p ht 1 | h/t 1 , r, φ ∝ p ht 2 | ht 1 , φ p ht 1 | ht , φ p rt 1 | ht 2 , ht 1 , φ p rt | ht 1 , ht , φ . 3.63 Now, since the following should hold: ht 1 ≥ ω βht
3.64
ht 1 ≤ β−1 ht 2 − ω,
3.65
and similarly
we have the support of the conditional distribution of ht 1 given that ht is bounded from below by ω βht , and the same applies to the distribution of ht 2 given ht 1 lower limit corresponds to dt 0 and the upper limit to dt 1 0. This means that the range of values of ht 1 compatible with ht and ht 2 in the GQARCH case is bounded from above and below; that is:
ht 1 ∈ ω βht , β−1 ht 2 − ω .
3.66
From the above, we understand that it makes sense to make the proposal to obey the support of the density, and so it is seen that we can simplify the acceptance rate by setting g ht 1 | h/t 1 , r, φ p ht 1 | ht , φ
3.67
appropriately truncated from above since the truncation from below will automatically be satisfied. But the above proposal density ignores the information contained in rt 1 , and so
22
Journal of Probability and Statistics
according to Fiorentini et al. 10 we can achieve a substantially higher acceptance rate if we propose from g ht 1 | h/t 1 , r, φ p ht 1 | rt , ht , φ .
3.68
A numerically efficient way to simulate ht 1 from pht 1 | rt , ht , φ is to sample an underlying Gaussian random variable doubly truncated by using an inverse transform method. More specifically, we may draw εt | rt , ht , φ ∼ N
1 − ϕ2 rt − δht ϕ2u ht 1 − ϕ2
,
ϕ2u h2t
ϕ2u ht 1 − ϕ2
3.69
doubly truncated so that it remains within the following bounds: εtnew ∈ γ − lt , γ lt ,
3.70
where , lt
ht 2 − ω − βω − β2 ht , βα
3.71
using an inverse transform method and then compute new 2 hnew − γ βht , t 1 ω α εt
3.72
new which in turn implies a real value for dt 1 ht 2 − ω − βhnew t 1 /α and so guarantees that lies within the acceptance bounds. hnew t 1 The inverse transform method to draw the doubly truncated Gaussian random variable first draws a uniform random number u ∼ U0, 1
3.73
and then computes the following: ⎞ 2 2 2 ⎜ γ − lt − 1 − ϕ rt − δht / ϕu ht 1 − ϕ ⎟ u 1 − uΦ⎝ ⎠ ϕ2u h2t / ϕ2u ht 1 − ϕ2 ⎛
⎞ 2 2 2 ⎜ γ lt − 1 − ϕ rt − δht / ϕu ht 1 − ϕ ⎟ uΦ⎝ ⎠. ϕ2u h2t / ϕ2u ht 1 − ϕ2 ⎛
3.74
Journal of Probability and Statistics
23
A draw is then given by εtnew Φ−1 u.
3.75
However, if the bounds are close to each other the degree of truncation is small the extra computations involved make this method unnecessarily slow, and so we prefer to use the accept-reject method where we draw
εtnew
| rt , ht , φ ∼ N
1 − ϕ2 rt − δht ϕ2u ht 1 − ϕ2
,
ϕ2u h2t
ϕ2u ht 1 − ϕ2
3.76
and accept the draw if γ − lt ≤ εtnew ≤ γ lt , and otherwise we repeat the drawing this method is inefficient if the truncation lies in the tails of the distribution. It may be worth assessing the degree of truncation first, and, depending on its tightness, choose one simulation method or the other. The conditional density of εtnew will be given according to the definition of a truncated normal distribution: new ε − εt/rr ,ht 1 ϕ t√ p εtnew | εtnew − γ ≤ lt , rt , ht , φ √ vt/rt ,ht vt/rt ,ht 0 / γ − lt − εt/rr ,ht −1 γ lt − εt/rr ,ht −Φ , × Φ √ √ vt/rt ,ht vt/rt ,ht
3.77
where Φ· is the cdf of the standard normal. By using the change of variable formula, we have that the density of hnew t 1 will be
new −1 , r p hnew | h ∈ ω βh , β − ω , h , φ h t t 2 t t t 1 t 1 / 0 γ lt − εt/rr ,ht γ − lt − εt/rr ,ht −1 × Φ − Φ . √ √ 2αdnew vt/rt ,ht vt/rt ,ht t ctnew
3.78
Using Bayes theorem we have that the acceptance probability will be
new p ht 2 | hnew t 1 , rt 1 , φ p rt 1 | ht 1 , φ min 1, . p rt 1 | hnt 1 , φ p ht 2 | hnt 1 , rt 1 , φ
3.79
Since the degree of truncation is the same for old and new, the acceptance probability will be
new n p rt 1 | hnew c dt 1 t 1 min 1, , t 1 n new n p rt 1 | ht 1 ct 1 dt 1
3.80
24
Journal of Probability and Statistics
where prt 1 | ht 1 is a mixture of two univariate normal densities so rt 1 | ht 1 ∼ N δht 1 ,
ϕ2u ht 1 1 ht 1 . 1 − ϕ2
3.81
Hence,
p rt 1 |
hnt 1
2 rt 1 − δhnt 1 exp − 2 , 2 ϕu / 1 − ϕ2 hnt 1 1 hnt 1
1
2π ϕ2u / 1 − ϕ2 hnt 1 1 hnt 1
3.82 and the acceptance probability becomes ⎡ ⎢ min⎣1,
hnt 1 hnew t 1
⎤ 3/2 n ht 2 − ω − βhnt 1 κ hnew ⎥ t 1 ⎦, n new κ hn ht 2 − ω − βht 1 t 1
3.83
where ⎡
⎛⎛
⎞2 ⎞⎤
⎜⎜ ⎟⎥ ⎢ ϕ2u i rt 1 − δhit 1 ⎟ ⎟ ⎟⎥ ⎜ ⎢ 2 ht 1 1 ⎜ i γ − ⎟ ⎜ ⎜ ⎟⎥ ⎢ 2 rt 1 − δht 1 1−ϕ ⎠ ⎟⎥ ⎜⎝ ⎢ ϕ2u i κ hit 1 exp⎢− − ⎜ ⎟⎥ h 1 ⎟⎥ ⎢ 2 t 1 ϕ2u i 2 ⎜ ϕ2u i 1 − ϕ ⎜ ⎟⎥ ⎢ 2 i 2 h h 1 h ⎝ ⎠⎦ ⎣ r i t 1 ht 2 − ω − βht 1 1 − ϕ2 t 1 1 − ϕ2 t 1 α ⎡ ⎞ ⎛ ⎤⎤ ⎡ ϕ2u i , h 1 ⎢ n i ⎥⎥ ⎜ ⎢ 1 − ϕ2 t 1 rt 1 − δhit 1 ⎟ ⎢ ⎟ ht 2 − ω − βht 1 ⎥ ⎥ ⎜ ⎢ 1 exp⎢ 2 γ − ⎥⎥ ⎟ ⎜ ⎢ ⎢ ⎠ ⎦⎥ ⎣ α ϕ2u i 2 ⎝ ϕ2u i ⎥ ⎢ h h 1 ⎥ ⎢ t 1 t 1 2 2 ⎥ ⎢ 1 − ϕ 1 − ϕ ×⎢ ⎤ ⎥ ⎞ ⎛ ⎡ ⎥. ⎢ 2 ϕu i ⎥ ⎢ , ⎥ ⎢ i n i ⎥ ⎟ ⎢ 1 − ϕ2 ht 1 1 ⎜ ⎢ rt 1 − δht 1 ⎟ ht 2 − ω − βht 1 ⎥ ⎥ ⎜ ⎢ exp⎢ ⎥ ⎥ ⎟ ⎜γ − ⎢ ⎢ ⎠ ⎦ ⎥ ⎣ ϕ2u i 2 ⎝ α ϕ2u i ⎦ ⎣ h h 1 1 − ϕ2 t 1 1 − ϕ2 t 1 3.84 Overall the MCMC of {ht 1 } | r, φ includes the following steps. 1 Specify an initial value {h0 }. 2 Repeat for n 1, . . . , M. a Repeat for t 0, . . . , T − 1.
Journal of Probability and Statistics
25
i Use an inverse transform method to simulate εtnew
| rt , ht , φ ∼ N
1 − ϕ2 rt − δht ϕ2u ht 1 − ϕ2
,
ϕ2u h2t ϕ2u ht 1 − ϕ2
3.85
doubly truncated. ii Calculate c 2 hnew βht . t 1 ω α εt − γ
3.86
Steps 2ai and 2aii are equivalent to draw ht 1 new ∼ p hnew t 1 | rt , ht , φ
3.87
appropriately truncated. iii Calculate new n p rt 1 | hnew c dt 1 t 1 αr min 1, t 1 n new . n c d p rt 1 | ht 1 t 1 t 1
3.88
iv Set
ht 1 n 1
⎧ ⎨ht 1 new
if
⎩h
otherwise
n t 1
Unif0, 1 ≤ αr
3.89
Remark 3.7. Every time we change ht 1 , we calculate only one normal density since the transformation is Markovian, and, since t 0, . . . , T − 1, we need OT calculations. new Notice that if we retain hnew is retained and we will not need to simulate st t 1 , then εt at a later stage. In fact we only need to simulate st at t T since we need to know εT . The final step involves computing
n
δt 1
n
rt 1 − εt 1 n
ht 1
,
t 0, . . . , T − 1, n 1, . . . , M.
3.90
Using all the above simulated values, we may now take average of simulations and compute the quantities needed for the SEM algortihm. As for the Bayesian inference, having completed Step 2 we may now proceed to the Gibbs sampling and M-H steps to obtain draws from the required posterior density. Thus, the first-order Markov transformation of the model made feasible an MCMC algorithm which allows the calculation of a classical estimator via the simulated EM algorithm and a simulation-based Bayesian inference in OT computational operations.
26
Journal of Probability and Statistics
3.4. A Comparison of the Simulators In order to compare the performance of the inefficient and the efficient MCMC sampler introduced in the previous subsection, we have generated realizations of size T 240 for the simple GQARCH1,1-M-AR1 model with parameters δ 0.1, ϕ 0.85, α 0.084, β 0.688, γ 0.314 which are centered around typical values that we tend to see in the empirical literature. We first examine the increase in the variance of the sample mean of εt across 500,000 simulations due to the autocorrelation in the drawings relative to an ideal but infeasible independent sampler. We do so by recording the inefficient ratios for the observations t 80 and t 160 using standard spectral density techniques. In addition, we record the mean acceptance probabilities over all observations and the average CPU time needed to simulate one complete drawing. The behavior of the two simulators is summarized in Table 1 and is very much as one would expect. The computationally inefficient sampler shows high serial correlation for both t 80 and t 160 and a low acceptance rate for each individual t. Moreover, it is extremely time consuming to compute even though our sample size is fairly small. In fact, when we increase T from 240 to 2400, and 24000 the average CPU time increases by a factor of 100 and 10000, respectively, as opposed to 10 and 100 for the other one the efficient, which makes it impossible to implement in most cases of practical interest. On the other hand, the single-move efficient sampler produces results much faster, with a reasonably high acceptance rate but more autocorrelation in the drawings for t 160.
4. Empirical Application: Bayesian Estimation of Weekly Excess Returns from Three Major Stock Markets: Dow-Jones, FTSE, and Nikkei In this section we apply the procedures described above to weekly excess returns from three major stock markets: Dow-Jones, FTSE, and Nikkei for the period of the last week of 1979:8 to the second to the last week of 2008:5 1,500 observations. To guarantee 0 ≤ β ≤ 1 − α ≤ 1 and to ensure that ω > 0 we also used some accept-reject method for the Bayesian inference. This means that, when drawing from the posterior as well as from the prior, we had to ensure that α, β > 0, α β < 1 and ω > 0. In order to implement our proposed Bayesian approach, we first have to specify the hyperparameters that characterize the prior distributions of the parameters. In this respect, our aim is to employ informative priors that would be in accordance with the “received wisdom.” In particular, for all data sets, we set the prior mean for β equal to 0.7, and for a, ω, and γ we decided to set their prior means equal to 0.15, 0.4, and 0.0, respectively. We had also to decide on the prior mean of δ. We set its prior mean equal to 0.05, for all markets. These prior means imply an annual excess return of around 4%, which is a typical value for annualized stock excess returns. Finally, we set the prior mean of ϕ equal to 0.75, of ϕ2u equal to 0.01, and the hyperparameters ν0 and do equal to 1550 and 3, respectively, for all three datasets, something which is consistent with the “common wisdom” of high autocorrelation of the price of risk. We employ a rather vague prior and set its prior variance equal to 10,000 for all datasets. We ran a chain for 200,000 simulations for the three datasets and decided to use every tenth point, instead of all points, in the sample path to avoid strong serial correlation. The posterior statistics for the Dow-Jones, FTSE, and Nikkey are reported in Table 2. Inefficiency
Journal of Probability and Statistics
27
Table 1: Comparison between the efficient and inefficient MCMC simulator. MAP
Time CPU effort
IR t 80
t 160
T 240
T 2400
T 24000
Inefficient
0.16338
45.1748
47.3602
0.031250
4.046875
382.05532
Single-move
0.60585
8.74229
110.245
0.001000
0.015625
0.1083750
Note: MAP denotes mean acceptance probability over the whole sample, while IR refers to inefficiency ratio of the MCMC drawings at observations 80 and 160. Time refers to the total CPU time in seconds taken to simulate one complete drawing.
Table 2: Bayesian inference results. Dow-Jones
PM
PSD
φ0.5
φmin
φmax
IF
δ
0.052
0.041
0.054
−0.048
0.103
3.001
ϕ
0.812
0.082
0.854
−0.992
0.999
2.115
ϕ2u
0.010
0.034
0.013
0.002
0.018
1.509
ω
0.405
0.071
0.461
0.004
0.816
2.367
α
0.152
0.040
0.291
0.001
0.873
1.048
β
0.651
0.168
0.629
0.003
0.984
2.994
γ
0.392
0.112
0.393
−0.681
0.418
5.108
PM
PSD
φ0.5
φmin
φmax
IF
0.059
0.036
0.059
−0.051
0.395
3.111
FTSE δ ϕ
0.811
0.096
0.839
−0.809
0.999
2.154
ϕ2u
0.009
0.029
0.012
0.005
0.017
1.995
ω
0.205
0.087
0.398
0.003
0.995
1.457
α
0.140
0.055
0.187
0.001
0.931
3.458
β
0.682
0.153
0.701
0.001
0.988
2.721
γ
0.374
0.102
0.381
−0.615
0.401
1.254
PM
PSD
φ0.5
φmin
φmax
IF
0.068
0.051
0.068
−0.064
0.211
2.998
Nikkei δ ϕ
0.809
0.090
0.837
−0.831
0.999
1.211
ϕ2u
0.010
0.031
0.010
0.004
0.019
2.001
ω
0.195
0.079
0.228
0.004
0.501
2.789
α
0.149
0.052
0.197
0.001
0.893
3.974
β
0.634
0.119
0.645
0.006
0.989
1.988
γ
0.408
0.123
0.409
−0.587
0.487
4.007
Note: PM denotes posterior mean, PSD posterior standard deviation, φ0.5 posterior median, φmin posterior minimum, φmax posterior maximum, and IF inefficiency factor.
factors are calculated using a Parzen window equal to 0.1T where, recall, T is the number of observations and indicate that the M-H sampling algorithm has converged and well behaved This is also justified by the ACFs of the draws. However, they are not presented for space considerations and are available upon request. With the exception of the constants δ and the ϕ2u ’s, there is low uncertainty with the estimation of the parameters. The estimated persistence, α β, for all three markets is close to 0.8 with the highest being the one of
28
Journal of Probability and Statistics
FTSE 0.822, indicating that the half life of a shock is around 3.5. The estimated asymmetry parameters are round 0.4 with “t-statistics” higher than 3.2, indicating that the leverage effect is important in all three markets. In a nut shell, all estimated parameters have plausible values, which are in accordance with previous results in the literature. We have also performed a sensitivity analysis to our choice of priors. In particular, we have halved and doubled the dispersion of the prior distributions around their respective means. Figures 1, 2, and 3 show the kernel density estimates for all parameters for all datasets for the posterior distributions for the three cases: when the variances are 10,000 baseline posterior, when the variances are halved small variance posterior, and when the variances are doubled large variance posterior. We used a canonical Epanechnikov kernel, and the optimal bandwidth was determined automatically by the data. The results which are reported in Figures 1, 2, and 3 indicate that the choice of priors does not unduly influence our conclusions. Finally, treating the estimated posterior means as the “true” parameters, we can employ the formulae of Section 2.1 and compare the moments implied by the estimates and the sample ones. One fact is immediately obvious. All order autocorrelations of excess returns implied by the estimates are positive but small, with the 1st one being around 0.04, which is in accordance with the ii stylized fact see Section 1. However, for all the three markets, the sample skewness coefficients are negative, ranging from −0.89 FTSE to −0.12 Nikkey, whereas the implied ones are all positive, ranging from 0.036 FTSE to 0.042 Dow-Jones. Nevertheless, the model is matching all the other stylized facts satisfactorily, that is, the estimated parameter values accommodate high coefficient of variation, leptokurtosis as well the volatility clustering and leverage effect.
5. Conclusions In this paper, we derive exact likelihood-based estimators for our time-varying GQARCH1,1-M model. Since in general the expression for the likelihood function is unknown, we resort to simulation methods. In this context, we show that MCMC likelihoodbased estimation of such a model can in fact be handled by means of feasible OT algorithms. Our samplers involve two main steps. First we augment the state vector to achieve a firstorder Markovian process in an analogous manner to the way in which GARCH models are simulated in practice. Then, we discuss how to simulate first the conditional variance and then the sign given these simulated series so that the unobserved in mean process is revealed as a residual term. We also develop simulation-based Bayesian inference procedures by combining within a Gibbs sampler the MCMC simulators. Furthermore, we derive the theoretical properties of this model, as far as moments and dynamic moments are concerned. In order to investigate the practical performance of the proposed procedure, we estimate within a Bayesian context our time-varying GQARCH1,1-M-AR1 model for weekly excess stock returns from the Dow-Jones, Nikkei, and FTSE index. With the exception of the returns’ skewness, the suggested likelihood-based estimation method and our model is producing satisfactory results, as far as a comparison between sample and theoretical moments is concerned. Although we have developed the method within the context of an AR1 price of risk, it applies much more widely. For example, we could assume that the market price of risk is a Bernoulli process or a Markov’s switching process. A Bernoulli’s distributed price of risk would allow a negative third moment by appropriately choosing the two values of
Journal of Probability and Statistics
−4
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −2−0.1 0 α
29 3.5 3 2.5 2 1.5 1 0.5 0 2
4
0.5
−0.5 0
β
a
b
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.6 0.5 0.4 0.3 0.2 0.1 −10
0 0
10
−5
γ
−0.1 0 ω
c
d
−0.1
0.3
7
0.25
6
0.2
5
5
4
0.15
3
0.1
2
0.05 −20
1.5
1
1
0
0 −10 −0.05 δ
10
20
0 −1
e
0
0.5
1
1.5
ϕ f
45 40 35 30 25 20 15 10 5 0 −0.05 −5 0
0.05
0.1
ϕ2u Baseline posterior Large variance posterior Small variance posterior g
Figure 1: Dow-Jones: posterior density estimates and sensitivity analysis.
30
Journal of Probability and Statistics 2.5
6 5
2
4
1.5
3
1
2
0.5
1 −0.5
0 0 −1 −0.5
0 −1
0
0.5
1
α
β
a
b
4 3 2.5 2 1.5 1 0.5 0 −1 −0.5 0
1
2
−1 −0.5 0
γ c
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −1−0.5 0 δ
2
3
4 3.5 3 2.5 2 1.5 1 0.5 0
3.5
−2
1
1
2
ω d
3 2.5 2 1.5 1 0.5 1
2
−1
0 −0.5
e
0
1
2
ϕ f
45 40 35 30 25 20 15 10 5 0 −0.05 −5 0
0.05
0.1
ϕ2u
Baseline posterior Large variance posterior Small variance posterior g
Figure 2: FTSE: posterior density estimates and sensitivity analysis.
Journal of Probability and Statistics
31
5
2
4
1.5
3
1
2 0.5
1 −1
0
0 0 −1
1
−1
2
0
1
α
β
a
−2
b
6
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 0 γ
5 4 3 2 1 −1
2
0 −1
0
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −1−0.2 0 δ
1
2
1
2
ω
c
−2
2
−0.5
d
1
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −1 −0.5 0
2
ϕ
e
f
45 40 35 30 25 20 15 10 5 0
−0.05 −5 0
0.05
0.1
ϕ2u
Baseline posterior Large variance posterior Small variance posterior g
Figure 3: Nikkei: posterior density estimates and sensitivity analysis.
32
Journal of Probability and Statistics
the in-mean process. However, this would make all computations much more complicated. In an earlier version of the paper, we assumed that the market price of risk follows a normal distribution, and we applied both the classical and the Bayesian procedure to three stock markets where we decided to set the posterior means as initial values for the simulated EM algorithm. The results suggested that the Bayesian and the classical procedures are quite in agreement see Anyfantaki and Demos 24. Finally, it is known that e.g., 32, pages 84 and 85 the EM algorithm slows down significantly in the neighborhood of the optimum. As a result, after some initial EM iterations, it is tempting to switch to a derivative-based optimization routine, which is more likely to quickly converge to the maximum. EM-type arguments can be used to facilitate this switch by allowing the computation of the score. In particular, it is easy to see that E
∂ ln p δ | r, φ, F0 n | r, φ , F0 0, ∂φ
5.1
so it is clear that the score can be obtained as the expected value given r, φ, F0 of the sum of the unobservable scores corresponding to ln pr | δ, φ, F0 and ln pδ | φ, F0 . This could be very useful for the classical estimation procedure, not presented here, as even though our algorithm is an OT one, it is still rather slow. We leave these issues for further research.
Appendices A. Proof of 3.69 and 3.82 Proof of Equation 3.69. This is easily derived using the fact that
rt δt ht εt ,
A.1
where
rt | ht ∼ N δht ,
ϕ2u ht 1 ht , 1 − ϕ2
A.2
and, consequently,
εt rt
⎛ ⎜ | ht ∼ N ⎜ ⎝
0
δht
⎛
ht
ht
⎞⎞
⎟ ⎜ ⎟ ⎟⎟, ,⎜ ϕ2u ⎠⎠ ⎝ ht ht 1 ht 1 − ϕ2
A.3
Journal of Probability and Statistics
33
and, thus; from the definition of the bivariate normal,
Eεt | rt , ht Varεt | rt , ht
1 − ϕ2 rt − δht ϕ2u ht 1 − ϕ2 ϕ2u h2t ϕ2u ht 1 − ϕ2
, A.4
.
Consequently, εt | rt , ht , φ ∼ N
1 − ϕ2 rt − δht ϕ2u ht 1 − ϕ2
,
ϕ2u h2t ϕ2u ht 1 − ϕ2
.
A.5
Proof of Equation 3.82. We have that
p rt 1 |
hrt 1
2 rt 1 − δhrt 1 , exp − 2 2 ϕu / 1 − ϕ2 hrt 1 1 hrt 1
1
2π ϕ2u / 1 − ϕ2 hrt 1 1 hrt 1
A.6 and, thus, ϕ2u / 1 − ϕ2 hrt 1 1 hrt 1 p rt 1 | hnew t 1 new p rt 1 | hrt 1 ϕ2u / 1 − ϕ2 hnew t 1 1 ht 1 × exp
2 2 rt 1 − δhrt 1 rt 1 − δhnew t 1 − new . 2 ϕ2u / 1 − ϕ2 hrt 1 1 hrt 1 2 ϕ2u / 1 − ϕ2 hnew t 1 1 ht 1 A.7
Also, , r dt 1
hrt 2 − ω − βhrt 1 , α
A.8
and so r dt 1 new dt 1
hrt 2 − ω − βhrt 1 . hrt 2 − ω − βhnew t 1
A.9
34
Journal of Probability and Statistics
Moreover, ⎡
new ct 1 r ct 1
1 ⎞2 ⎞ ⎤ 2 r new 2h − ω − βh 3 t 2 t 1 ⎜ ⎟ ⎟ ⎟ ⎟ ⎥ γ ⎢ ϕ2 ⎜ ⎜ ⎟ ⎟ ⎥ ⎢ α ⎜ ⎟ ⎟ ⎜ ⎟ ⎥ ⎢ new ⎟ 2 ⎜ ⎟ 1 − ϕ rt 1 − δht 1 ⎟ ⎢ ⎜ ⎟ ⎟ ⎥ ⎢ ⎝− ⎠ ⎠ ⎥ ⎢ ⎥ 2 new 2 h 1 − ϕ ϕ ⎢ ⎥ u r new t 1 ht 1 ϕ2u ht 1 1 − ϕ2 ⎢ 1 ⎛ ⎞2 ⎞ ⎥ ⎛ 2 ⎢ ⎥ 2 hr − ω − βhnew ⎢ 3 t 2 t 1 ⎜ ⎟ ⎟⎥ ⎜ ⎜ ⎟ ⎟⎥ 2 new 2 ⎜ γ− ⎢ ⎜ ϕu ht 1 1 − ϕ ⎜ ⎟ ⎟⎥ ⎢ ⎜ α ⎟ ⎟ ⎜− ⎟⎥ ⎜ ⎢ exp⎜ new ⎟ 2 ⎜ ⎟ ⎟⎦ ⎜ 2 new2 1 − ϕ r − δh ⎣ t 1 2ϕu ht 1 ⎜ ⎟ ⎟ ⎜ t 1 ⎝ ⎠ ⎠ ⎝− 2 new ϕu ht 1 1 − ϕ2 1 ⎡ ⎞2 ⎞ ⎤ . ⎛ ⎛ 2 r r 2h 3 t 2 − ω − βht 1 ⎟ ⎟ ⎜ ⎜ ⎟ ⎟ ⎥ ⎜ 2 r 2 ⎜ γ ⎢ ⎟ ⎟ ⎥ ⎜ ϕu ht 1 1 − ϕ ⎜ ⎢ α ⎟ ⎟ ⎜ ⎟ ⎟ ⎥ ⎜ ⎜− ⎢ exp⎜ r 2 ⎟ ⎟ ⎥ ⎜ ⎜ r2 2 − δh 1 − ϕ r ⎢ t 1 ⎟ ⎜ ⎜ 2ϕ h t 1 ⎟ u t 1 ⎢ ⎠ ⎠ ⎥ ⎝− ⎝ ⎢ ⎥ 2 r 2 ϕu ht 1 1 − ϕ ⎢ ⎥ hnew ϕ2u hrt 1 1 − ϕ2 ⎢ 1 ⎛ ⎞2 ⎞ ⎥ ⎛ t 1 2 ⎢ ⎥ 2 hr − ω − βhr ⎢ 3 t 2 t 1 ⎟ ⎟ ⎥ ⎜ ⎜ ⎜ ⎟ ⎟⎥ 2 r 2 ⎜ γ− ⎢ ϕ h 1 − ϕ ⎜ ⎟ ⎟ ⎜ u t 1 ⎢ ⎟⎥ ⎜ α ⎟ ⎜− ⎟ ⎟⎥ ⎜ ⎢ exp⎜ r 2 ⎜ ⎟ ⎟⎦ ⎜ 2 r2 1 − ϕ r − δh ⎣ t 1 2ϕ h ⎜ ⎟ ⎜ t 1 ⎠ ⎟ u t 1 ⎝ ⎠ ⎝− 2 r 2 ϕu ht 1 1 − ϕ ⎛
⎜ ⎜ 2 new ⎜ ϕu ht 1 1 − ⎜ exp⎜− ⎜ ⎜ 2ϕ2u hnew2 t 1 ⎝
⎛
A.10
And the result comes straightforward.
B. Dynamic Moments of the Conditional Variance We have 2 , ht−k − 2αγ Covεt−1 , ht−k βCovht−1 , ht−k Covht , ht−k αCov εt−1 k α β Covht−1 , ht−k · · · α β V ht , 2 εt−k − 2αγ Eεt−1 εt−k βEht−1 εt−k Eht εt−k αE εt−1
B.1
k−1 Eht−k 1 εt−k α β Eht−1 εt−k · · · α β k−1 −2αγ α β Eht−1 , since 3 2 − 2αγ εt−1 βht−1 εt−1 Eht εt−1 E αεt−1 2 −2αγ E εt−1 −2αγ Eht−1 .
B.2
Journal of Probability and Statistics
35
Also, Cov h2t , ht−k ACov h2t−1 , ht−k BCovht−1 , ht−k A2 Cov h2t−2 , ht−k ABCovht−2 , ht−k BCovht−1 , ht−k · · · Ak−1 Cov h2t−k 1 , ht−k Ak−2 BCovht−k 1 , ht−k · · · ABCovht−2 , ht−k BCovht−1 , ht−k A
k−1
B.3
k−1 − Ak−1 α β 2 Cov ht−k 1 , ht−k α β B V ht α β −A
α βk − Ak 3 2 A E ht−1 − E ht−1 Eht−1 BV ht , α β −A k
as
Cov h2t , ht−1 A E h3t−1 − E h2t−1 Eht−1 BV ht−1 ,
B.4
where A 3α2 β2 2αβ and B 22α2 γ 2 ω αγ 2 α β. Furthermore, E h2t εt−k AE h2t−1 εt−k BEht−1 εt−k A2 E h2t−2 εt−k ABEht−2 εt−k BEht−1 εt−k k−3 k−2 α β · · · Ak−1 E h2t−k 1 εt−k − 2αγ B Ak−2 · · · A α β Eht−1 −4αγ 3α β Ak−1 E h2t−1 − 4αγ
k−1 α βk − Ak α β − Ak−1 2 2 2α γ Eht−1 , ω αγ α β −A α β −A 2
B.5 as E h2t εt−1 −4αγ 3α β E h2t−1 − 4 ω αγ 2 αγ Eht−1 .
B.6
Also we have that Cov ht , h2t−k α β Cov ht−1 , h2t−k
k α β E h3t−1 − E h2t−1 Eht−1 ,
B.7
36
Journal of Probability and Statistics
as
Cov ht , h2t−1 α β E h3t−1 − E h2t−1 Eht−1 .
B.8
Now, Cov h2t , h2t−k ACov h2t−1 , h2t−k BCov ht−1 , h2t−k A2 Cov h2t−2 , h2t−k ABCov ht−2 , h2t−k BCov ht−1 , h2t−k · · · Ak−1 Cov h2t−k 1 , h2t−k Ak−1 BCov ht−k 1 , h2t−k
· · · ABCov ht−2 , h2t−k BCov ht−1 , h2t−k A V k
h2t−1
B.9
k
α β − Ak 3 B E ht−1 − E h2t−1 Eht−1 , α β −A
as
Cov h2t , h2t−1 AV h2t−1 B E h3t−1 − E h2t−1 Eht−1 .
B.10
Further, 2 2 Cov ht , εt−k α β Cov ht−1 , εt−k k−1 2 α β Cov ht−k 1 , εt−k
B.11
k k−1 2 E ht , α β V ht−1 2α α β as
2 Cov ht , εt−1 α 2E h2t V ht−1 βV ht−1 α β V ht−1 2αE h2t . Further 2 2 2 ACov h2t−1 , εt−k BCov ht−1 , εt−k Cov h2t , εt−k 2 2 2 ACov h2t−2 , εt−k ABCov ht−2 , εt−k BCov ht−1 , εt−k 2 2 · · · Ak−1 Cov h2t−k 1 , εt−k Ak−2 BCov ht−k 1 , εt−k 2 2 · · · ABCov ht−2 , εt−k BCov ht−1 , εt−k
B.12
Journal of Probability and Statistics
37
k−1 α β − Ak−1 B α β V ht−1 A Cov α β −A k−1 α β − Ak−1 2 2αB E ht α β −A k
α β − Ak k 3 2 A E ht−1 − E ht−1 Eht−1 B V ht−1 α β −A 4α 3α β Ak−1 E h3t−1
k−1 α β − Ak−1 2 2 2 k−1 2αB E h2t , A 4 2α γ ω αγ α β −A k−1
2 h2t−k 1 , εt−k
B.13 as
2 Cov h2t , εt−1 A E h3t−1 − E h2t−1 Eht−1 4α 3α β E h3t−1 4 2α2 γ 2 ω αγ 2 E h2t BV ht .
B.14
Further, k−1 Eht ht−k εt−k α β Eht−1 ht−k εt−k · · · α β Eht−k 1 ht−k εt−k k−1 E h2t−1 , −2αγ α β
B.15
as Eht ht−1 εt−1 −2αγ E h2t−1 .
B.16
Finally, E h2t ht−k εt−k AE h2t−1 ht−k εt−k BEht−1 ht−k εt−k · · · Ak−1 E h2t−k 1 ht−k εt−k Ak−2 BEht−k 1 ht−k εt−k · · · ABEht−2 ht−k εt−k BEht−1 ht−k εt−k −4αγ Ak−1 3α β E h3t−1 ω αγ 2 E h2t−1
k−1 − Ak−1 2 α β E ht−1 , − 2αγ B α β −A
B.17
38
Journal of Probability and Statistics
as E h2t ht−1 εt−1 −4αγ 3α β E h3t−1 − 4 ω αγ 2 αγ E h2t−1 .
B.18
Acknowledgments The authors would like to thank S. Arvanitis, G. Fiorentini, and E. Sentana for their valuable comments. We wish also to thank the participants of the 7th Conference on Research on Economic Theory and Econometrics Naxos, July 2008, the 14th Spring Meeting of Young Economists Istanbul, April 2009, the Doctoral Meeting of Montpellier D.M.M. in Economics, Management and Finance Montpellier, May 2009, the 2nd Ph.D. Conference in Economics 2009, in Memory of Vassilis Patsatzis Athens, May 2009, the 3rd International Conference in Computational and Financial Econometrics Limassol, October 2009, the 4th International Conference in Computational and Financial Econometrics London, December 2010 and the seminar participants at the Athens University of Economics and Business for useful discussions. They acknowledge Landis Conrad’s help for the collection of the data. Finally, they greatly benefited by the comments of the Associate Editor and an anonymous referee.
References 1 R. C. Merton, “Stationarity and persistence in the GARCH 1,1 model,” Econometric Theory, vol. 6, no. 3, pp. 318–334, 1980. 2 L. R. Glosten, R. Jaganathan, and D. E. Runkle, “On the relation between the expected value and the volatility of the nominal excess return on stocks,” Journal of Finance, vol. 48, no. 5, pp. 1101–1801, 1993. 3 R. F. Engle, “Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation,” Econometrica, vol. 50, no. 4, pp. 987–1007, 1982. 4 T. Bollerslev, “Generalized autoregressive conditional heteroskedasticity,” Journal of Econometrics, vol. 31, no. 3, pp. 307–327, 1986. 5 T. Bollerslev, R. Y. Chou, and K. F. Kroner, “ARCH modeling in finance: a review of the theory and empirical evidence,” Journal of Econometrics, vol. 52, no. 1-2, pp. 5–59, 1992. 6 S. Arvanitis and A. Demos, “Time dependence and moments of a family of time-varying parameter GARCH in mean models,” Journal of Time Series Analysis, vol. 25, no. 1, pp. 1–25, 2004. 7 O. E. Barndorff-Nielsen and N. Shephard, Modelling by Levy Processes for Financial Econometrics, Nuffield College Economics Working Papers, Oxford University, London, UK, 2000, D.P. No 2000-W3. 8 R. N. Mantegna and H. E. Stanley, “Turbulence and financial markets,” Nature, vol. 383, no. 6601, pp. 587–588, 1996. 9 R. N. Mantegna and H. E. Stanley, An Introduction to Econophysics Correlations and Complexity in Finance, Cambridge University Press, Cambridge, UK, 2000. 10 G. Fiorentini, E. Sentana, and N. Shephard, “Likelihood-based estimation of latent generalized ARCH structures,” Econometrica, vol. 72, no. 5, pp. 1481–1517, 2004. 11 E. Sentana, “Quadratic ARCH models,” The Review of Economic Studies, vol. 62, no. 4, pp. 639–661, 1995. 12 G. Fiorentini and E. Sentana, “Conditional means of time series processes and time series processes for conditional means,” International Economic Review, vol. 39, no. 4, pp. 1101–1118, 1998. 13 S. Arvanitis, “The diffusion limit of a TVP-GQARCH-M1,1 model,” Econometric Theory, vol. 20, no. 1, pp. 161–175, 2004. 14 S. Chib, “Markov chain monte carlo methods: computation and Inference,” in Handbook of Econometrics, J. J. Heckman and E. Leamer, Eds., vol. 5, pp. 3569–3649, North-Holland, Amsterdam, The Netherlands, 2001. 15 S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability, Springer, New York, NY, USA, 1993.
Journal of Probability and Statistics
39
16 A. E. Gelfand and A. F. M. Smith, “Sampling-based approaches to calculating marginal densities,” Journal of the American Statistical Association, vol. 85, no. 410, pp. 398–409, 1990. 17 L. Tierney, “Markov chains for exploring posterior distributions,” The Annals of Statistics, vol. 22, no. 4, pp. 1701–1728, 1994. 18 M. A. Tanner and W. H. Wong, “The calculation of posterior distributions by data augmentation,” Journal of the American Statistical Association, vol. 82, no. 398, pp. 528–540, 1987. 19 S. Kim, N. Shephard, and S. Chib, “Stochastic volatility: likelihood inference and comparison with ARCH models,” Review of Economic Studies, vol. 65, no. 3, pp. 361–393, 1998. 20 N. Shephard, “Statistical aspects of ARCH and stochastic volatility,” in Time Series Models in Econometrics, Finance and Other Fields, D. R. Cox, D. V. Hinkley, and O. E. B. Nielsen, Eds., pp. 1–67, Chapman & Hall, London, UK, 1996. 21 S. G. Giakoumatos, P. Dellaportas, and D. M. Politis, Bayesian Analysis of the Unobserved ARCH Model, vol. 15, Department of Statistics, Athens University of Economics and Business, Athens, Greece, 1999. 22 A. P. Dempster, N. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society B, vol. 39, no. 1, pp. 1–38, 1977. 23 E. Sentana and G. Fiorentini, “Identification, estimation and testing of conditionally heteroskedastic factor models,” Journal of Econometrics, vol. 102, no. 2, pp. 143–164, 2001. 24 S. Anyfantaki and A. Demos, “Estimation of a time-varying GQARCH1,1-M model,” Working paper, 2009, http://www.addegem-asso.fr/docs/PapersDMM2009/2.pdf. 25 D. J. Poirier, “Revising beliefs in nonidentified models,” Econometric Theory, vol. 14, no. 4, pp. 483–509, 1998. 26 P. Clifford, “Markov random fields in statistics,” in Disorder in Physical Systems: A Volume in Honor of John M. Hammersley, G. Grimmett and D. Welsh, Eds., pp. 19–32, Oxford University Press, 1990. 27 S. Chib, “Bayes regression with autoregressive errors,” Journal of Econometrics, vol. 58, no. 3, pp. 275– 294, 1993. 28 L. Bauwens and M. Lubrano, “Bayesian inference on Garch models using the Gibbs sampler,” Econometrics Journal, vol. 1, no. 1, pp. C23–C46, 1998. 29 T. Nakatsuma, “Bayesian analysis of ARMA-GARCH models: a Markov chain sampling approach,” Journal of Econometrics, vol. 95, no. 1, pp. 57–69, 2000. 30 D. Ardia, “Bayesian estimation of a Markov-switching threshold asymmetric GARCH model with student-t innovations,” The Econometrics Journal, vol. 12, no. 1, pp. 105–126, 2009. 31 S. X. Wei, “A censored-GARCH model of asset returns with price limits,” Journal of Empirical Finance, vol. 9, no. 2, pp. 197–223, 2002. 32 M. A. Tanner, Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Function, Springer, New York, NY, USA, 3rd edition, 1996.
Hindawi Publishing Corporation Journal of Probability and Statistics Volume 2011, Article ID 691058, 11 pages doi:10.1155/2011/691058
Review Article The CSS and The Two-Staged Methods for Parameter Estimation in SARFIMA Models Erol Egrioglu,1 Cagdas Hakan Aladag,2 and Cem Kadilar2 1 2
Department of Statistics, Ondokuz Mayis University, 55139 Samsun, Turkey Department of Statistics, Hacettepe University, Beytepe, 06800 Ankara, Turkey
Correspondence should be addressed to Erol Egrioglu,
[email protected] Received 19 May 2011; Accepted 1 July 2011 Academic Editor: Mike Tsionas Copyright q 2011 Erol Egrioglu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Seasonal Autoregressive Fractionally Integrated Moving Average SARFIMA models are used in the analysis of seasonal long memory-dependent time series. Two methods, which are conditional sum of squares CSS and two-staged methods introduced by Hosking 1984, are proposed to estimate the parameters of SARFIMA models. However, no simulation study has been conducted in the literature. Therefore, it is not known how these methods behave under different parameter settings and sample sizes in SARFIMA models. The aim of this study is to show the behavior of these methods by a simulation study. According to results of the simulation, advantages and disadvantages of both methods under different parameter settings and sample sizes are discussed by comparing the root mean square error RMSE obtained by the CSS and two-staged methods. As a result of the comparison, it is seen that CSS method produces better results than those obtained from the two-staged method.
1. Introduction In the recent years, there have been a lot of studies about Autoregressive Fractionally Integrated Moving Average ARFIMA models in the literature. However, most of time series in real life may have seasonality, in addition to long-term structure. Therefore, SARFIMA models have been introduced to model such time series. Generally, SARFIMA p, d, qP, D, Qs process is given in the following form: φBΦB1 − Bd 1 − Bs D Xt ΘBθBet ,
1.1
where Xt is a time series, B is the back shift operator, such as Bi Xt Xt−i , s is the seasonal lag, d and D represent the nonseasonal and seasonal fractionally differences; respectively, et is
2
Journal of Probability and Statistics
a white noise process and has normal distribution N0, σe2 , and φB, ΦB, θB, and ΘB are given by φB 1 − φ1 B − · · · − φp BP , θB 1 θ1 B · · · θq Bq , ΦB 1 − Φ1 Bs − · · · − Φp Bps , ΘB 1 Θ1 B · · · Θq Bqs ,
1.2
where p, q and P, Q are the orders of the nonseasonal and seasonal parameters, respectively. Baillie 1 and Hassler and Wolters 2 examined the basic characteristics of ARFIMA models, while some significant contributions to the SARFIMA models were presented by Giraitis and Leipus 3 , Arteche and Robinson 4 , Chung 5 , Velasco and Robinson 6 , Giraitis et al. 7 , and Haye 8 . When all parameters are different from zero in 1.1 and when some parameters such as p, q, P, Q are equal to zero, different parameter estimation methods are compared by performing simulation studies in the literature 9–11 . Seasonal long-term structure exists in time series in various study fields such as the cumulative money series in Porter-Hudak 12 , the IBM input series in Ray 13 , and the Nile River data in Montanari et al. 14 . Candelon and Gil-Alana 15 forecasted the industrial production index of countries in South America by employing the SARFIMA models. GilAlana 16 found that the GDP series in Germany, Italy, and Denmark had a structure which was suitable to use SARFIMA models. Brietzke et al. 17 utilized Durbin-Levinson algorithm for the p q P Q d 0 model. Ray 13 modified the method proposed by Hosking 18 and used this modified method for a special SARFIMA process having two different seasonal difference parameters. Darn´e et al. 19 adapted the method, proposed for ARFIMA by Chung and Baillie 20 , to SARFIMA models. However, the properties of the CSS method employed in Darn´e et al. 19
have not been examined by a simulation study yet. Arteche and Robinson 4 introduced a semiparametric method based on spectral density functions while estimating parameters for SARFIMA model in the case of d 0. GPH method used in ARFIMA is extended to be used in SARFIMA models for p q P Q d 0 by Porter-Hudak 12 , and GPH estimator has been modified by Ooms and Hassler 21 . Also, a simulation study for different values of d, D, s, and sample size has been conducted using GPH, Whittle and Exact Maximum likelihood EML by Reisen et al. 9, 10 and Palma and Chan 11 . In addition to these studies, many methods for determining seasonal longterm structure have been proposed by Hassler and Wolters 22 , Gil-Alana ˜ and Robinson 23 , Arteche 24 , and Gil-Alana 25, 26 . We examine the properties of the CSS and two staged estimation methods by a simulation study in which both methods are compared based on various parameter settings and sample sizes. In the simulation study, a specific form of the model given in 1.1 in which p, d, and q are equal to zero is examined by using the both CSS and two staged estimation methods. This model can also be expressed as SARFIMA P, D, Qs . After simulation study was conducted, the results obtained from the CSS and two staged estimation methods are compared, and it is observed that better results are obtained when the CSS method is employed.
Journal of Probability and Statistics
3
The outline of this study is as follows. Section 2 contains brief information related to SARFIMA models. The CSS method and two staged methods are explained in Sections 3 and 4, respectively. The outline of the simulation study and the results are given in Section 5. Finally, the results obtained from the simulation study are summarized in the last section.
2. SARFIMA Models When p, q, d, P , and Q are set to zero in model 1.1, this model is called as Seasonal Fractionally Integrated SFI model. The SFI model was firstly introduced by Arteche and Robinson 4 , and basic information about the model can be found in Baillie 1 . SFI model can be given by 2.1
1 − Bs D Xt et . Infinite moving average presentation of the model 2.1 is as follows: Xt ΨBs et
∞
2.2
ψk et−sk ,
k0
where ψk Γk D/ΓDΓk 1, ψk ∼ kD−1 /ΓD, for k → ∞. Infinite autoregressive presentation of the model 2.1 is as follows: ΠBs Xt
∞
πk Xt−sk et ,
2.3
k0
where πk Γk − D/Γ−DΓk 1, πk k−D−1 /Γ−D, for k → ∞. For model 2.1, autocovariance and autocorrelation functions can be, respectively, written as follows: γsk
−1k Γ1 − 2D σ2, Γk − D 1Γ1 − k − D e
ρsk
Γ1 − DΓk D , ΓDΓk − D 1
k 1, 2, . . . ,
k 1, 2, . . . ,
2.4
2.5
when k −→ ∞,
ρsk ∼
Γ1 − D 2D−1 k . ΓD
2.6
For model 2.1, spectral density function is as follows: fω
sω −2D σe2 2 sin , 2π 2
0 < ω ≤ π.
2.7
Note that the spectral density function is infinite at the frequencies 2πν/s, ν 1, . . . , s/2 .
4
Journal of Probability and Statistics
When p, q, d, D, P , and Q are different from zero in model 1.1, closed form for autocovariances cannot be determined. However, some methods, such as the splitting method presented by Bertelli and Caporin 27 , employed to calculate autocovariances of ARFIMA models, can also be used for those in SARFIMA models. Let γ1 · denote the autocovariance function of SARFIMA p, d, qP, D, Qs models. Autocovariances are calculated in terms of splitting method as follows:
γ1 k
−m
γ2 hγ3 k − h.
2.8
h−m
γ2 · and γ3 · are autocovariances functions for SARFIMA p, 0, qP, 0, Qs and SARFIMA 0, d, 00, D, 0s models, respectively. γ3 · is calculated using splitting method given in a following expression:
γ3 k
−m
γ4 hγ5 k − h.
2.9
h−m
γ4 · and γ5 · are autocovariances functions for SARFIMA 0, 0, 00, D, 0s and SARFIMA 0, d, 0 × 0, 0, 0s models, respectively. The closed form for γ4 · is given in 2.4. The autocovariances of γ5 · are autocovariances of fractionally integrated process and the closed form is given by 28 as follows:
γ5 k
−1k Γ1 − 2d σ2. Γk − d 1Γ1 − k − d e
2.10
To generate series, which are appropriate for SARFIMA P, D, Qs models, the following algorithm is applied. Step 1. Generate Z z1 , . . . , zn T random variable vector with standard normal distribution. Step 2. Obtain the matrix Σn γi − j , i, j 1, . . . , n by utilizing the expression 2.4. Step 3. Split the covariance matrix as follows: Σ LLT where, L is a lower triangular matrix. This splitting is called Cholesky. It is possible to obtain Cholesky decomposition of positive definite and symmetric matrices. Note that matrix Σn is positive definite and symmetric. Step 4. Obtain series X X1 , . . . , Xn T by using X X1 , . . . , Xn T LZ formula. X X1 , . . . , Xn has a suitable structure for SARFIMA 0, D, 0s model. Step 5. Generate series according to SARMA P, Qs model by taking X X1 , . . . , Xn as error series. By this way, the new generated series have the structure of SARFIMA P, D, Qs . This algorithm is easily extended to SARFIMA p, d, qP, D, Qs model.
Journal of Probability and Statistics
5
3. The Two-Staged Method The two-staged method can be used to estimate the parameters of SARFIMA P, D, Qs model. In the first phase of this method, it is assumed that the time series has a suitable structure to use the SARFIMA 0, D, 0s model, and seasonal fractionally difference parameter D is estimated. In the second phase, estimation of the parameter, EML method given below, can be employed. Theoretical autocovariance and autocorrelation functions for SARFIMA 0, D, 0s model are shown in 2.4 and 2.5 respectively. Let time series Xt have n observations x1 , . . . , xn , and let Ω represent the autocorrelation matrix of x1 , . . . , xn . Therefore, the likelihood function of x1 , . . . , xn is as follows:
1 LD 2π−n/2 |Ω|−1/2 exp − X Ω−1 X . 2
3.1
Cholesky decomposition is used for the matrix Ω as multiplication of lower and upper triangular matrices in calculation of the likelihood function. Instead of calculating the inverse of matrix Ω n×n, inverses of lower and upper triangular matrices are calculated by using the decomposition. Thus, the decomposition decreases computational difficulty and calculation time. Cholesky decomposition of the matrix Ω is written as follows: Ω LL .
3.2
Let W L−1 X, and it can be written −1 X Ω−1 X X LL X L−1 X L−1 X W W |Ω|
−1/2
−1/2 LL |L|−1 .
3.3
Thus, 3.1 can be rewritten as P
D X
1 exp 2π−n/2 |L|−1 exp − W W . 2
3.4
The likelihood function given in 3.4 is maximized in terms of seasonal fractionally difference parameter by using an optimization algorithm. After seasonal fractionally difference parameter is estimated by using EML, the rest of the parameters of SARMA P, Qs model are estimated in the second phase by using the classic method. In the second phase, the order of the seasonal model can be determined by using the Box-Jenkins approach. Therefore, the two-staged method can be summarized as follows. Phase 1. Estimate the parameter D by assuming the time series suitable for SARFIMA 0, D, 0s . Phase 2. Estimate seasonal autoregressive and moving average parameters by using the BoxJenkins methodology.
6
Journal of Probability and Statistics
4. The CSS Method Chung and Baillie 20 proposed a method based on minimization of conditional sum of square. This method can be used for SARFIMA p, d, qP, D, Qs models. Conditional sum of square method for SARFIMA model is as follows: S
n 2 1 1 d −1 −1 s D θ X . − B − B log σε2 BΘ BφBΦB1 1 t 2 2σε2 t1
4.1
In the CSS method, firstly, seasonal fractionally difference procedure is executed for Xt . Secondly, fractionally difference procedure is executed for 1 − Bs D Xt . Thirdly, SARMA filtering is applied to 1 − Bd 1 − Bs D Xt . By calculating sum of squares of this obtained series θ−1 BΘ−1 BφBΦB1 − Bd 1 − Bs D Xt , conditional sum of square is calculated for a fixed value of σe2 and D. Chung and Baillie 20 also emphasize that the estimations of parameters obtained by the CSS method have less bias when the mean value of the series is known. It is easy to use the CSS method because it does not need to calculate autocovariances. In the literature, the CSS method for the SARFIMA P, D, Qs model has been used only by Darn´e et al. 19 .
5. Simulation Study In this section, the parameters of SARFIMA P, D, Qs model are estimated by using the CSS and the two-staged methods separately under different parameter settings and sample sizes. Also, the advantages and the disadvantages of both methods are discussed. The algorithm, whose steps are given in Section 2, is used to generate various SARFIMA P, D, Qs models. SARFIMA 1, D, 0s and SARFIMA 0, D, 1s models are emphasized in the simulation study. For SARFIMA 1, D, 0s model, 36 different cases are examined such as seasonal fractionally difference D 0.1, 0.2, 0.3, seasonal autoregressive parameter Φ 0.3, 0.7, −0.3, −0.7, sample sizes n 120, 240, 360, and period s 4. Similarly, the same parameters are also used for SARFIMA 0, D, 1s model by taking Θ 0.3, 0.7, −0.3, −0.7. For each case, 1000 time series are generated, so totally we generate 72000 time series. The parameters of the generated time series are estimated by using both the CSS and two-staged methods whose results are summarized in Tables 1 and 2. For each 1000 time series, the mean, standard deviation, and root mean square error RMSE values of estimated parameters are exhibited in these tables. RMSE values are computed by
RMSE
1000 β − β 2 i1 i i 1000
,
5.1
where βi and βi denote the real and estimated values of parameter, respectively. In Table 1, for SARFIMA 1, D, 0s model, the simulation results for different values of Φ and sample size n are shown when the CSS and the two-stage methods are executed. From this table, for CSS method, we observe that RMSE values have sharply decreased for the estimated parameters of seasonal fractional difference and seasonal autoregressive, when the sample size increases. It is also seen that the values of RMSE do not change much whether
Journal of Probability and Statistics
7
Table 1: Simulation results of the CSS and two-stage method in SARFIMA 1, D, 04 . D
Sample Φ size
120
0.1
240
360
120
0.2
240
360
120
0.3
240
360
0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7
Mean D CSS TSM 0,06 0,14 0,08 0,15 0,05 0,02 0,04 0,00 0,05 0,15 0,06 0,16 0,04 0,01 0,04 0,00 0,04 0,15 0,05 0,16 0,04 0,01 0,04 0,00 0,12 0,21 0,14 0,20 0,10 0,07 0,08 0,00 0,11 0,21 0,12 0,21 0,10 0,06 0,09 0,00 0,11 0,22 0,11 0,22 0,10 0,06 0,10 0,00 0,21 0,27 0,25 0,26 0,20 0,16 0,18 0,02 0,22 0,27 0,23 0,27 0,21 0,17 0,20 0,01 0,21 0,27 0,22 0,27 0,21 0,17 0,20 0,01
Mean Φ CSS TSM 0,28 0,28 0,68 0,48 −0,30 −0,26 −0,69 −0,46 0,29 0,29 0,69 0,48 −0,30 −0,26 −0,69 −0,47 0,30 0,29 0,70 0,48 −0,30 −0,26 −0,69 −0,46 0,29 0,33 0,69 0,51 −0,29 −0,23 −0,69 −0,45 0,30 0,33 0,70 0,51 −0,30 −0,23 −0,69 −0,45 0,30 0,33 0,70 0,51 −0,29 −0,24 −0,69 −0,45 0,31 0,42 0,70 0,58 −0,28 −0,16 −0,68 −0,43 0,31 0,43 0,71 0,59 −0,29 −0,16 −0,69 −0,43 0,31 0,43 0,72 0,59 −0,29 −0,16 −0,69 −0,43
Std. dev. D Std. dev. Φ CSS TSM CSS TSM 0,06 0,09 0,12 0,09 0,07 0,08 0,09 0,07 0,06 0,05 0,11 0,09 0,06 0,01 0,08 0,07 0,04 0,07 0,07 0,06 0,05 0,06 0,05 0,05 0,04 0,03 0,07 0,06 0,04 0,00 0,05 0,05 0,04 0,06 0,06 0,05 0,04 0,05 0,04 0,04 0,04 0,02 0,05 0,05 0,04 0,00 0,04 0,04 0,08 0,07 0,12 0,09 0,09 0,06 0,09 0,07 0,08 0,07 0,12 0,09 0,07 0,02 0,08 0,07 0,06 0,05 0,07 0,06 0,06 0,04 0,06 0,05 0,06 0,06 0,07 0,07 0,05 0,00 0,05 0,05 0,05 0,04 0,06 0,05 0,05 0,03 0,04 0,04 0,05 0,05 0,05 0,05 0,05 0,00 0,04 0,04 0,09 0,04 0,11 0,11 0,09 0,04 0,09 0,09 0,09 0,07 0,11 0,10 0,09 0,04 0,09 0,08 0,06 0,03 0,07 0,08 0,07 0,03 0,06 0,06 0,06 0,06 0,07 0,08 0,06 0,02 0,05 0,05 0,05 0,02 0,06 0,07 0,05 0,03 0,04 0,05 0,05 0,05 0,06 0,06 0,05 0,02 0,04 0,04
RMSED CSS TSM 0,08 0,10 0,08 0,09 0,08 0,09 0,08 0,10 0,07 0,08 0,06 0,08 0,07 0,09 0,08 0,10 0,07 0,08 0,06 0,08 0,07 0,10 0,07 0,10 0,11 0,07 0,11 0,06 0,13 0,15 0,14 0,20 0,11 0,05 0,10 0,04 0,12 0,15 0,12 0,20 0,10 0,04 0,10 0,03 0,11 0,15 0,11 0,20 0,13 0,05 0,10 0,06 0,14 0,16 0,15 0,28 0,11 0,04 0,10 0,04 0,11 0,14 0,12 0,29 0,10 0,04 0,10 0,04 0,11 0,14 0,11 0,29
RMSEΦ CSS TSM 0,12 0,09 0,10 0,23 0,11 0,10 0,08 0,25 0,07 0,06 0,05 0,23 0,07 0,07 0,05 0,24 0,06 0,05 0,04 0,22 0,05 0,06 0,04 0,24 0,12 0,09 0,09 0,21 0,12 0,11 0,08 0,26 0,07 0,07 0,06 0,19 0,07 0,09 0,05 0,25 0,06 0,06 0,04 0,19 0,05 0,08 0,04 0,25 0,11 0,16 0,09 0,15 0,12 0,17 0,09 0,28 0,07 0,15 0,06 0,13 0,07 0,16 0,05 0,27 0,06 0,14 0,05 0,12 0,06 0,15 0,04 0,27
the sign of parameter of seasonal autoregressive is positive or not. In the case of having larger value of seasonal autoregressive parameter in absolute, RMSE values of seasonal autoregressive RMSEΦ parameters get smaller. When D 0.1 and D 0.2 are compared, the values of RMSED in D 0.1 are smaller than those in D 0.2, whereas the values of RMSEΦ in D 0.1, D 0.2, and D 0.3 are close with each other. Note that the values of RMSED in D 0.3 are larger than those in D 0.1.
8
Journal of Probability and Statistics
According to Table 1, when the two-staged method is executed, it is observed that the sample size does not affect significantly the values of RMSE, especially for RMSEΦ when Φ −0.7. However, when the absolute value estimated of seasonal autoregressive parameter increases, the values of RMSEΦ increase dramatically in D 0.1 and D 0.2. The values of RMSED are not affected by both the sign and magnitude of seasonal autoregressive parameter, especially in D 0.1. It is worth to point out that the values of RMSED are quite larger for the negative values of seasonal autoregressive parameters in both D 0.2 and D 0.3. It can be inferred from the comparison between D 0.1 and D 0.2 that for the negative values of seasonal autoregressive parameter, both the values of RMSED and RMSEΦ increase gradually while D is increasing. Especially, the values of RMSEΦ in D 0.3 get the biggest values when the seasonal autoregressive parameter is negative. Therefore, for the negative values of seasonal autoregressive parameters, we can say that the estimation error gets bigger while the order of seasonal fractional difference is increasing. In Table 2, for the SARFIMA 0, D, 1s model, the simulation results for different values of parameter Θ and sample size n are shown for the CSS and two-staged methods. From this table, we observe that RMSE values have decreased for the estimated parameters of seasonal fractional difference and seasonal moving average, when the sample size increases, the CSS method is executed. It is also seen that the values of RMSE do not change much whether the sign of parameter of seasonal moving average is positive or not. In the case of having larger value of seasonal moving average parameter for the negative values, RMSE values for seasonal fractionally difference RMSED are smaller. When we compare D 0.1 with D 0.2, the values of RMSED in D 0.1 are smaller than those in D 0.2, whereas the values of RMSEΘ among D 0.1, D 0.2, and D 0.3 are close with each other. Note that the values of RMSED in D 0.3 are larger than those in D 0.1. When Table 2 is examined, it is observed that the values of RMSEΘ decrease when sample size increases for two-staged method. However, there is no positive or negative relations between the value of seasonal moving average parameter and the values of RMSED and RMSEΘ when two-stage method is executed. We would like to remark that RMSEΘ has the smallest value in each sample size for Θ 0.7 and that values of RMSED are quite big for the negative values of seasonal moving average parameter with respect to its positive values when D 0.2, and 0.3 in Table 2.
6. Discussions In the literature, the two-staged method is a widely used method to estimate parameters of SARFIMA models. Although there is another method called CSS, this method has not been employed to estimate the parameters of SARFIMA model. In this study, the CSS and the twostaged methods are employed to estimate parameters of the SARFIMA models by conducting a simulation study, and by this way the properties of these two methods are examined under different parameter settings and sample sizes. From the results of the simulation, we deduce that when the sample size increases, the CSS method gives more accurate estimates. Besides, we can infer that when seasonal autoregressive parameter in SARFIMA 1, D, 04 model gets close to 1 or −1, the parameter estimates of the CSS method have less error. The CSS method produces quite good estimates for D when the seasonal autoregressive parameter in SARFIMA 1, D, 04 model and the seasonal moving average parameter in SARFIMA 0, D, 14 model are positive.
Journal of Probability and Statistics
9
Table 2: Simulation results of CSS and two-stage method in SARFIMA 0, D, 14 . D
Sample Θ size
120
0.1
240
360
120
0.2
240
360
120
0.3
240
360
0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7 0.3 0.7 −0.3 −0.7
Mean D CSS TSM 0,06 0,14 0,06 0,15 0,05 0,02 0,03 0,00 0,05 0,15 0,05 0,16 0,04 0,01 0,03 0,00 0,04 0,15 0,05 0,16 0,04 0,01 0,03 0,00 0,11 0,21 0,13 0,21 0,09 0,07 0,06 0,00 0,11 0,21 0,11 0,21 0,10 0,06 0,08 0,00 0,11 0,22 0,11 0,22 0,10 0,06 0,08 0,00 0,21 0,26 0,23 0,26 0,19 0,16 0,14 0,02 0,21 0,27 0,22 0,26 0,20 0,17 0,16 0,01 0,22 0,27 0,22 0,27 0,20 0,17 0,18 0,01
Mean Θ CSS TSM 0,29 0,31 0,61 0,64 −0,31 −0,28 −0,62 −0,62 0,29 0,31 0,66 0,67 −0,30 −0,28 −0,66 −0,66 0,30 0,31 0,68 0,69 −0,30 −0,28 −0,67 −0,67 0,30 0,34 0,62 0,65 −0,29 −0,24 −0,61 −0,60 0,30 0,34 0,67 0,68 −0,29 −0,24 −0,65 −0,63 0,30 0,34 0,68 0,70 −0,29 −0,24 −0,66 −0,64 0,30 0,40 0,62 0,68 −0,28 −0,15 −0,60 −0,55 0,31 0,40 0,67 0,71 −0,28 −0,15 −0,64 −0,55 0,31 0,40 0,68 0,73 −0,28 −0,16 −0,65 −0,56
Std. dev. D Std. dev. Θ CSS TSM CSS TSM 0,06 0,09 0,12 0,10 0,07 0,08 0,09 0,08 0,06 0,05 0,11 0,10 0,05 0,01 0,09 0,08 0,04 0,07 0,07 0,06 0,05 0,06 0,06 0,05 0,04 0,03 0,07 0,06 0,04 0,00 0,06 0,05 0,04 0,06 0,06 0,05 0,04 0,05 0,04 0,04 0,04 0,02 0,06 0,05 0,03 0,00 0,04 0,04 0,08 0,07 0,12 0,10 0,08 0,06 0,09 0,08 0,08 0,07 0,12 0,10 0,07 0,01 0,09 0,09 0,06 0,05 0,07 0,06 0,06 0,04 0,05 0,05 0,06 0,06 0,07 0,07 0,06 0,01 0,06 0,05 0,05 0,04 0,06 0,05 0,05 0,03 0,04 0,04 0,05 0,05 0,05 0,05 0,05 0,00 0,04 0,05 0,09 0,05 0,12 0,10 0,09 0,04 0,09 0,07 0,10 0,07 0,12 0,11 0,09 0,04 0,10 0,10 0,06 0,03 0,07 0,07 0,06 0,03 0,05 0,05 0,07 0,06 0,07 0,08 0,06 0,02 0,06 0,07 0,05 0,02 0,06 0,05 0,05 0,03 0,04 0,04 0,05 0,05 0,06 0,06 0,05 0,02 0,04 0,06
RMSED CSS TSM 0,08 0,10 0,08 0,09 0,08 0,09 0,09 0,10 0,07 0,08 0,07 0,08 0,07 0,09 0,08 0,10 0,07 0,08 0,07 0,08 0,07 0,10 0,08 0,10 0,12 0,07 0,11 0,06 0,13 0,15 0,16 0,20 0,11 0,05 0,10 0,04 0,12 0,15 0,14 0,20 0,10 0,04 0,10 0,03 0,11 0,15 0,13 0,20 0,13 0,06 0,11 0,06 0,15 0,16 0,19 0,28 0,11 0,04 0,10 0,05 0,12 0,15 0,15 0,29 0,10 0,04 0,10 0,05 0,11 0,14 0,13 0,29
RMSEΘ CSS TSM 0,12 0,10 0,13 0,10 0,12 0,10 0,12 0,11 0,07 0,07 0,07 0,06 0,07 0,07 0,07 0,07 0,06 0,05 0,05 0,04 0,06 0,05 0,05 0,05 0,12 0,11 0,12 0,09 0,12 0,12 0,13 0,13 0,07 0,08 0,06 0,05 0,07 0,09 0,07 0,09 0,06 0,06 0,05 0,04 0,06 0,08 0,06 0,08 0,12 0,14 0,12 0,07 0,12 0,18 0,14 0,18 0,07 0,12 0,06 0,05 0,07 0,17 0,09 0,16 0,06 0,11 0,05 0,05 0,06 0,16 0,07 0,15
When the CSS method is compared with the two-staged method, the CSS method has lower RMSE values than the two-staged method under different parameter settings and sample sizes, especially in autoregressive models. Two-staged method generates misleading results when Φ is chosen near −1 Φ −0.7. However, this is not the case for the CSS method. Based on the obtained results and simplicity of the method, for forthcoming studies it can be easily suggested that the CSS method should be preferred rather than the two-staged method in the parameter estimation for SARFIMA models.
10
Journal of Probability and Statistics
References 1 R. T. Baillie, “Long memory processes and fractional integration in econometrics,” Journal of Econometrics, vol. 73, no. 1, pp. 5–59, 1996. 2 U. Hassler and J. Wolters, “On the power of unit root tests against fractional alternatives,” Economics Letters, vol. 45, no. 1, pp. 1–5, 1994. 3 L. Giraitis and R. Leipus, “A generalized fractionally differencing approach in long-memory modeling,” Lithuanian Mathematical Journal, vol. 35, no. 1, pp. 53–65, 1995. 4 J. Arteche and P. M. Robinson, “Semiparametric inference in seasonal and cyclical long memory processes,” Journal of Time Series Analysis, vol. 21, no. 1, pp. 1–25, 2000. 5 C.-F. Chung, “A generalized fractionally integrated autoregressive moving-average process,” Journal of Time Series Analysis, vol. 17, no. 2, pp. 111–140, 1996. 6 C. Velasco and P. M. Robinson, “Whittle pseudo-maximum likelihood estimation for nonstationary time series,” Journal of the American Statistical Association, vol. 95, no. 452, pp. 1229–1243, 2000. 7 L. Giraitis, J. Hidalgo, and P. M. Robinson, “Gaussian estimation of parametric spectral density with unknown pole,” The Annals of Statistics, vol. 29, no. 4, pp. 987–1023, 2001. 8 M. O. Haye, “Asymptotic behavior of the empirical process for Gaussian data presenting seasonal long-memory,” European Series in Applied and Industrial Mathematics, vol. 6, pp. 293–309, 2002. 9 V. A. Reisen, A. L. Rodrigues, and W. Palma, “Estimating seasonal long-memory processes: a Monte Carlo study,” Journal of Statistical Computation and Simulation, vol. 76, no. 4, pp. 305–316, 2006. 10 V. A. Reisen, A. L. Rodrigues, and W. Palma, “Estimation of seasonal fractionally integrated processes,” Computational Statistics & Data Analysis, vol. 50, no. 2, pp. 568–582, 2006. 11 W. Palma and N. H. Chan, “Efficient estimation of seasonal long-range-dependent processes,” Journal of Time Series Analysis, vol. 26, no. 6, pp. 863–892, 2005. 12 S. Porter-Hudak, “An application of the seasonal fractionally differenced model to the monetary aggregates,” Journal of American Statistical Association, vol. 85, no. 410, pp. 338–344, 1990. 13 B. K. Ray, “Long-range forecasting of IBM product revenues using a seasonal fractionally differenced ARMA model,” International Journal of Forecasting, vol. 9, no. 2, pp. 255–269, 1993. 14 A. Montanari, R. Rosso, and M. S. Taqqu, “A seasonal fractional ARIMA model applied to the Nile River monthly flows at Aswan,” Water Resources Research, vol. 36, no. 5, pp. 1249–1259, 2000. 15 B. Candelon and L. A. Gil-Alana, “Seasonal and long-run fractional integration in the Industrial Production Indexes of some Latin American countries,” Journal of Policy Modeling, vol. 26, no. 3, pp. 301–313, 2004. 16 L. A. Gil-Alana, “Seasonal long memory in the aggregate output,” Economics Letters, vol. 74, no. 3, pp. 333–337, 2002. 17 E. H. M. Brietzke, S. R. C. Lopes, and C. Bisognin, “A closed formula for the Durbin-Levinson’s algorithm in seasonal fractionally integrated processes,” Mathematical and Computer Modelling, vol. 42, no. 11-12, pp. 1191–1206, 2005. 18 J. R.M. Hosking, “Modeling persistence in hydrological time series using fractional differencing,” Water Resources Research, vol. 20, no. 12, pp. 1898–1908, 1984. 19 O. Darn´e, V. Guiraud, and M. Terraza, “Forecasts of the Seasonal Fractional Integrated Series,” Journal of Forecasting, vol. 23, no. 1, pp. 1–17, 2004. 20 C.-F. Chung and R. T. Baillie, “Small sample bias in conditional sum-of-squares estimators of fractionally integrated ARMA models,” Empirical Economics, vol. 18, no. 4, pp. 791–806, 1993. 21 M. Ooms and U. Hassler, “On the effect of seasonal adjustment on the log-periodogram regression,” Economics Letters, vol. 56, no. 2, pp. 135–141, 1997. 22 U. Hassler and J. Wolters, “Long memory in inflation rates: international evidence,” Journal of Business & Economic Statistics, vol. 13, no. 1, pp. 37–45, 1995. 23 L. A. Gil-Alana ˜ and P. M. Robinson, “Testing of seasonal fractional integration in UK and Japanese consumption and income,” Journal of Applied Econometrics, vol. 16, no. 2, pp. 95–114, 2001. 24 J. Arteche, “Semiparametric robust tests on seasonal or cyclical long memory time series,” Journal of Time Series Analysis, vol. 23, no. 3, pp. 251–285, 2002. 25 L. A. Gil-Alana, “Seasonal misspecification in the context of fractionally integrated univariate time series,” Computational Economics, vol. 22, no. 1, pp. 65–74, 2003.
Journal of Probability and Statistics
11
26 L. Gil-Alana, “Modelling seasonality with fractionally integrated processes,” Tech. Rep. number, 2003. 27 S. Bertelli and M. Caporin, “A note on calculating autocovariances of long-memory processes,” Journal of Time Series Analysis, vol. 23, no. 5, pp. 503–508, 2002. 28 J. Beran, Statistics for Long-Memory Processes, vol. 61, Chapman & Hall/CRC, New York, NY, USA, 1994.