Bayesian Clustering of Many Short Time Series - CiteSeerX

Bayesian Clustering of Many Short Time Series Sylvia Fr¨ uhwirth-Schnatter∗and Sylvia Kaufmann†‡ February 2002

Abstract In the present paper we propose a method to infer on the different clusters potentially present in a panel data set that is data driven in the sense that the classification of each subject into a specific group is estimated along with the model parameters. The general model allows additionally for time-varying parameters, whereby the timing of the structural changes is also part of the model estimation. The presence of two latent variables, the group- and the state-identifying indicators, calls for Bayesian Markov chain Monte Carlo techniques. An application to individual bank lending data of the US banking sector illustrates the methodology. We obtain results that are broadly consistent with the bank lending view. Moreover, we infer a significant asymmetric effect of monetary policy over time which favors the evidence for models of credit cycles. JEL classification: C11,C15,E44,E51 Key words: Panel data, clustering, Markov switching, Markov chain Monte Carlo, Monetary policy.

1

Introduction

Panel data consisting of many, short time series occur frequently in various areas of applied econometrics such as macroeconomics, business or marketing. A dependent variable of interest, denoted by yit , is observed for N subjects i = 1, . . . , N over time t = 1, . . . , T were typically N is much larger than T - making many short time series. Additionally a set of explanatory variables Xit is observed and a regression model with subject specific regression coefficient βi is set up to explain yit (see e.g. Baltagi, 1995): yit = Xit βi + εit .

(1)

A typical example that we are going to investigate in the present paper is a panel of loans growth rates observed for about N = 4400 banks over T = 36 quarters, other examples would be growth rates of real GDP observed for N countries over T quarters. ∗

Vienna University of Economics and Business Administration, Department of Statistics, e-mail [email protected] † Oesterreichische Nationalbank, Economic Studies Division, e-mail [email protected], and University of Vienna, Department of Economics ‡ The views expressed in the paper are those of the authors and do not necessarily reflect those of the OeNB.

1

If the time series are rather short (either absolutely or relatively compared to the dimension of βi in model (1)), then estimation of βi from the individual time series yi = {yi1 , . . . , yiT }, only, will exhibit large estimation errors. In such cases, often, overall pooling is applied which means that a joint parameter βi = β is estimated for all N time series in the panel (see Hoogstrate et al., 2000 for a recent review). One of the main advantages of overall pooling is to borrow strength from the other time series in the panel to estimate the coefficient of an individual time series. In the present paper we suggest an alternative pooling method which inherits this appealing property of borrowing strength, but improves overall pooling in various respects. First, overall pooling is known to introduce a bias for subject specific coefficients, if the time series yi were generated by models with βi not being identical for all time series. The results reported in Hoogstrate et al. (2000) suggest that only in those cases where the parameters are “similar”, the gain in reducing the estimation errors may be bigger than the loss due to the bias, leading in total to reduced mean squared estimation and forecasting errors. Second, overall pooling might be inappropriate, if there are structural changes during the observation period. The basic idea of our method to avoid the first drawback is the following: rather than performing overall pooling for all time series we cluster the time series into K groups and pooling takes place only within a group. Thereby, clustering is adaptive and is performed simultaneously with parameter estimation. To this aim, a latent group indicator Si is introduced for each time series yi and is estimated jointly with all parameters using a Bayesian Markov chain Monte Carlo (MCMC) approach. A distinctive feature of the paper is thus that the clusters (or groupings) are not determined prior to the estimation, as is usually done in the literature. Rather, they are determined simultaneously with parameter estimation. The method may also be used to combine overall pooling for certain coefficients with pooling within clusters for other coefficients. Furthermore, our method allows for explicit changes of coefficients over time to make pooling less sensitive to structural changes. This is achieved by introducing a discrete latent state indicator It that refers to the regime prevailing in period t. Again, rather than fixing the indicator prior to estimation, it is assumed unobservable and the estimation yields an inference on the regime prevailing in t. The outline of the paper is the following. In Section 2 we formulate the model we use for clustering of the time series. In Section 3 we discuss Bayesian estimation of the model and present in Section 4 a simulation study to demonstrate the usefulness of our idea of Bayesian clustering of time series. Finally, in Section 5 we apply the methods to a panel of bank data to capture asymmetric effects of monetary policy on bank lending.

2 2.1

The Model Model formulation

Let {yit } , t = 1, . . . , T be a time series observed for N subjects i = 1, . . . , N . For each subject i the dependent variable yit is described by the following model: yit = Xit1 α + Xit2 βSGi + Xit3 β R (It − 1) + εit .

2

(2)

εit in (2) is the unexplained error term. We apply here the most common error model which is based on the assumption that εit is normally distributed: εit ∼ N (0, σ 2 ),

(3)

with σ 2 being independent of time and being the same for all subjects, and εit being uncorrelated between subjects and over time. We will comment on more general econometric error models in Section 6. Xit1 , Xit2 and Xit3 are explanatory variables which might be strictly exogenous as well as lagged values yi,t−p of yit . The explanatory variables effect the dependent variable yit in three different ways. Xit1 contains the effects which are modelled as being fixed, meaning that a certain increase/decrease in the explanatory variables Xit1 effects the expected change of yit for all subjects in the same way. α quantifies the influence of the fixed effects on the expected mean of yit . Xit2 contains effects which are modelled as being subject specific, meaning that an increase/decrease in the explanatory variable Xit2 effects the expected change of yit in a different way for the various subjects. For the sake of identifiability we have to assume 1 2 that Xi,· and Xi,· have no common columns meaning that an effect is either fixed or subject specific. There exist various ways to model a subject specific effect, among them random-effects models, or the general heterogeneity model (Verbeke and Molenberghs 2000). Model (2) is based on the switching regression model (Quandt and Ramsey 1978). We assume that the subjects form K groups with different effect, whereby within a group G the effect is fixed. Let β1G , . . . , βK be K different parameters. For each subject i, i = 1, . . . , N we introduce a group indicator Si taking a value k ∈ {1, . . . , K}. Si indicates G which parameter βSGi among the K possible parameters β1G , . . . , βK should be associated with subject i. Thus knowing Si is equivalent to knowing the subject specific coefficient. An important aspect of model (2) is that we do not assume to know a priori which subject belongs to which group. For each subject the group indicator Si is estimated along with all parameters from the data. A further important aspect of model (2) is explicit modelling of asymmetries over time by assuming that the various explanatory variables appearing in Xit1 or Xit2 may have a different effect on the expected value of yit depending on time t. We summarize all time3 1 2 varying effects in Xit3 (therefore all columns of Xi,· appear either in Xi,· or Xi,· ). Among the various models which capture time-varying effects we apply here a Markov switching model (Hamilton, 1989) which assumes the presence of two states in time, one of them being an “ordinary” state and the other one being an “extraordinary” one. We introduce an indicator It referring to the state at time t, t = 1, . . . , T . We assume that the indicator It takes the value 1 in the “ordinary” state and the value 0 otherwise. If It = 1, then the following model holds: yit = Xit1 α + Xit2 βSGi + εit .

(4)

Therefore α and βSGi quantify the influence of the corresponding effects in the “ordinary” state. If the economy is in the other state (It = 0), then the following model holds: yit = Xit1 α + Xit2 βSGi − Xit3 β R + εit .

(5)

β R therefore quantifies how much the effect of Xit3 on yit is reduced in the extraordinary state compared to the other state. An important aspect of the Markov switching model is that we do not assume to know a priori which state is present at time t. For each time t the state indicator It estimated along with all parameters from the data. 3

2.2

Modelling the indicators

In order to complete the model specification, we have to formulate a probabilistic model for both indicators. These probabilistic models turn out to be prior distributions within the Bayesian approach we pursue in the present paper. The basic concept behind our prior for the group indicators Si is to assume complete apriori ignorance about the group membership of a certain subject. Therefore the prior probability of each subject i to belong to group k is equal to the relative size ηk of that group: P r(Si = k) = ηk .

(6)

The group sizes η = (η1 , . . . , ηk ) which obviously sum to 1 are assumed to be unknown and are estimated along with the data. Concerning the state indicator, we assume apriori persistence of remaining in the state over time. Persistency is captured by modelling the regime indicator It as a Markov process with unknown transition matrix η M S . The probability of being in state 1 or state 0 at time t, depends on whether state 1 or state 0 occurred at time t − 1: MS P r(It = 1|It−1 = 1) = η11 , MS P r(It = 0|It−1 = 0) = η00 .

(7)

From this obviously MS MS P r(It = 0|It−1 = 1) = η10 = 1 − η11 , MS MS P r(It = 1|It−1 = 0) = η01 = 1 − η00 .

This is the Markov switching prior (Hamilton, 1989) that is commonly applied in this MS MS context with the size of η11 and η00 determining persistency. The transition matrix MS MS MS MS MS η = (η00 , η01 , η10 , η11 ) of the Markov chain It is assumed to be unknown and is estimated along with the data.

2.3

Special Cases

In this section we discuss interesting special cases of model (2). Overall pooling without time varying effects. If we exclude the existence of extraordinary occurrences in time (It = 1 for all t) and assume that only one group is present (K=1), then model (2) is just equivalent to the standard method of overall pooling (see e.g. Baltagi, 1995): ˜1α yit = X it ˜ + εit ,

(8)

˜ it1 = [Xit1 Xit2 ] and α ˜ = (α, (β1G )). where X Markov Switching overall pooling. If we allow for time-varying effects, but assume that only one group is present (K=1), then model (2) simplifies to: 3 R ˜1α yit = X it ˜ + Xit β (It − 1) + εit .

(9)

Here overall pooling takes places, however, pooling is toward a different value, depending ˜ 1 = X 3 ) pooling on the unobserved state indicator It . If all effects are time-varying (X it it 4

Si = 1 Si = 2 .. . Si = K

It = 1 β1G β2G

It = 0 β1G − β R β2G − β R

G βK

G βK − βR

Table 1: Pooling for model (2) with Xit2 = Xit3 is toward α ˜ for It = 1 and toward α ˜ − β R for It = 0 (with obvious modifications, if not all effects are time-varying). Kaufmann (2001) uses this model to assess asymmetries in bank lending for Austrian data. The Markov switching panel model of Asea and Blomberg (1998) may be cast into model (9). Their estimation relies on a version of the EM algorithm, however. Pooling under structural breaks is also discussed in Hoogstrate et al. (2000) for a panel of growth rates of real GDP of 18 OECD countries, however, the time of the structural breaks is selected prior to estimation. An important feature of our method is that the timing of structural changes is estimated simultaneously with the parameter. Pooling within clusters. If we assume the existence of K groups, but exclude timevarying as well as fixed effects, then model (2) simplifies to: yit = Xit2 βSGi + εit .

(10)

Thus all subjects with the same group indicator (Si = k) are pooled toward the same parameter βSGi with β·G being different in the different clusters. Clustering time series in a panel is not uncommon, but always carried out prior to pooling. An important feature of our method is that the clusters are not predefined, but are determined simultaneously with parameter estimation. Note that with the general model (2) these special pooling techniques may be combined. For a model with Xit1 = Xit3 Markov Switching overall pooling for α is combined with pooling within clusters for β·G . The group specific effects are unaffected by structural changes, only the fixed effect are time-varying. For a model with Xit2 = Xit3 overall pooling for α is combined with Markov Switching pooling within clusters for βSGi . In this last case, α is fixed for all time periods, whereas the effects corresponding to β·G are pooled toward 2K different values depending both on the group indicator Si as well as the state indicator It (see table 1 for details). This latter model will be used in Section 5 to analyze a panel of bank data.

3 3.1

Estimation Bayesian Estimation using MCMC

The following unknown parameters appear in model (2) and need to be estimated from G and the parameter β R , the data: the fixed effects α, the group specific effects β1G , . . . , βK the variance σ 2 appearing in the error model (3), the weights η = (η1 , . . . , ηk ) of the prior distribution (6) of the group indicators Si as well as the transition matrix η M S of the prior distribution (7) of the state indicator It . From now on we will use the notation θ to G , β R , σ 2 , η, η M S ). summarize these parameters: θ = (α, β1G , . . . , βK 5

Concerning the information given the data, we will use the following notation: yi = {yi1 , . . . , yiN } denotes all observations of subject i. yit−1 and yit denote observations up to t − 1 and t, respectively. Finally, we use y N = {y1 , . . . , yN } to denote all observations of all subjects. We discuss here estimation of the most general model (2) with obvious modifications for the more special models (8) – (10). The presence of the unobservable group indicator Si as well as the presence of the unobservable regime indicator It calls for the use of some elaborated estimation methods. Maximum likelihood estimation is feasible for the special cases (9) and (10) where just one of the indicators appears in the model. Classical maximum likelihood estimation is not feasible for the general model (2) as the marginal likelihood L(y N |θ) can not be derived in the case where both indicators are present. For this case we could at least in principle use the truncation filter described in Kim and Nelson (1999) to approximate the likelihood function. We prefer here to apply a Bayesian approach and to use Markov chain Monte Carlo (MCMC) method. The present model could be viewed as a state space model with discrete state vector It where switching (between the groups) is present. MCMC estimation of switching state space models with continuous state vector has been discussed earlier (Fr¨ uhwirth-Schnatter, 2001a) and we will extend the methods discussed there to the present model where the state vector is discrete. For MCMC estimation of the model we start with data augmentation (Tanner and Wong, 1986). We introduce the sequences S N = (S1 , . . . , SN ) and I T = (I0 , I1 , . . . , IT ) as latent variables that are estimated along with the data. We denote the augmented parameter by ψ: ψ = (S N , I T , θ). MCMC estimation means sampling from the posterior distribution π(ψ|y N ) of ψ given all data y N . π(ψ|y N ) summarizes the information the data contain about ψ and is derived from Bayes’ theorem: π(ψ|y N ) ∝

N Y T Y

G fN (yit |Si , It , α, β1G , . . . , βK , β R , σ 2 , yit−1 )π(ψ).

(11)

i=1 t=p+1

p is the maximum lag of yit appearing in the definition of Xit1 , Xit2 or Xit3 . fN (yit |·) is the density of the normal distribution with the moments easily derived from (2): G E(yit |Si , It , α, β1G , . . . , βK , β R , σ 2 , yit−1 ) = Xit1 α + Xit2 βSGi + Xit3 β R (It − 1), G V (yit |Si , It , α, β1G , . . . , βK , β R , σ 2 , yit−1 ) = σ 2 .

The prior π(ψ) is given by: π(ψ) = π(I T |η M S )π(S N |η)π(θ). The prior of S N and I T are derived from (6) and (7), respectively. Concerning S N we obtain: π(S N |η) =

N Y

π(Si |η) =

i=1

K Y #(Si =k)

ηk

,

k=1

whereas the prior of I T is given by: M S #(It =1,It−1 =0) M S #(It =0,It−1 =0) ) · ) (1 − η00 π(I T |η M S ) = (η00 M S #(It =0,It−1 =1) M S #(It =1,It−1 =1) . (1 − η11 ) ·(η11 )

6

Only the prior π(θ) of θ is user specific and we will comment on the choice of π(θ) in appendix A. For practical MCMC sampling from the posterior distribution π(ψ|y N ), ψ is divided into six different blocks and the various blocks of ψ are sampled conditional on the current value of the other parameters: 1. sample S N from the conditional distribution π(S N |θ, I T , y N ); 2. sample the group probabilities η from the conditional distribution π(η|S N ); 3. sample the transition matrix η M S from the conditional distribution π(η M S |I T ); G 4. sample all model parameters α, β1G , . . . , βK , β R jointly from the conditional distriG bution π(α, β1G , . . . , βK , β R |σ 2 , S N , I T , y N ); G 5. sample σ 2 from the conditional distribution π(σ 2 |α, β1G , . . . , βK , β R , S N , I T , y N );

6. sample I T from the conditional distribution π(I T |θ, S N , y N ). More details on this sampling scheme are found in appendix B. For this procedure we need a starting value for I T and θ. A starting value for I T is available from Markov switching overall pooling (model (9)) for which MCMC estimation is possible from a starting values for θ, only). Another possibility is to set It ad hoc by means of a variable of interest. For the application in section 5, we set It = 1 for positive growth rates of GDP. As convergence proved to be not very sensitive with respect to starting values for θ, these were set to ’reasonable values’ (see section 5 for details).

3.2

Identifiability

Model (2), like any model including discrete latent variables, is only identified up to permutations of the labelling of the groups and the states. To associate a certain parameter βkG with a certain group a unique labelling has to be selected. The most natural labelling for the groups is to associate group 1 with parameter 1, group 2 with parameter 2 and so on, however note there are K! different ways of labelling. Relabelling could be done for the K groups as well as for the two states of the economy. Therefore the full unconstrained posterior of a model with K groups and two states is multimodal with at most 2K! modes. When applying MCMC methods to such a posterior, we have to be aware of the problem of label switching which might render estimation of group specific quantities meaningless. To deal with the problem, in a first run we extend the random permutation sampler discussed in Fr¨ uhwirth-Schnatter (2001b) to sample from the unconstrained posterior. The MCMC output of the random permutation sampler is explored in order to find suitable identifiability constraints. In a second run we use the permutation sampler to sample from the constrained posterior by imposing identifiability constraints. G ˜R , β , σ 2 , η˜, η˜M S ) The permutation sampler starts with generating ψ˜ = (SÑ , I˜T , α ˜ , β˜1G , . . . , β˜K from the unconstrained posterior π(ψ|y N ) by the sampling scheme described above. Then, due to the presence of two latent indicators, we need to perform two random permutations of the labelling. The first changes the labelling of the state of the economy randomly, the second changes the labelling of the groups randomly. We discuss here details only for the case, where Xit2 = Xit3 , with obvious modifications for other cases. 7

The group specific parameter corresponding to Si = k are equal to β˜kG if I˜t = 1 and equal to β˜kG − β˜R if I˜t = 0. With a probability of 0.5 the labelling of the state of the ˜ with a probability of 0.5 the labelling of the states economy remains unchanged: ψ = ψ, of the economy is changed: It = 1 − I˜t , t = 0, . . . , T, MS ηijM S = η˜1−i,1−j , i = 0, 1, j = 0, 1, βkG = β˜kG − β˜R , k = 1, . . . , K, β R = β G − β˜G = −β˜R . k

k

All other components of ψ˜ are unaffected by relabelling the states of the economy and remain unchanged in this first step. Subsequently we will use the notation ψ˜ to denote the parameter resulting from the first relabelling step. Then we relabel the groups. There exists K! different ways to relabel the groups. We select randomly one permutation ρ(1), . . . , ρ(K) of the current labelling of the groups and ˜ from ψ˜ resulting from the first relabelling step by reordering the labelling define ψ = ρ(ψ) through this permutation: G G G (β1G , . . . , βK ) := (β˜ρ(1) , . . . , β˜ρ(K) ), N S = (S1 , . . . , SN ) := (ρ(S˜1 ), . . . , ρ(SÑ )),

η = (η1 , . . . , ηK ) := (˜ ηρ(1) , . . . , η˜ρ(K) ),

(12)

∀k = 1, . . . , K.

All other components of ψ˜ are unaffected by relabelling the groups and remain unchanged in this step. We will demonstrate in the Subsection 4.1 as well as in Section 5, how the MCMC outputs of the random permutation sampler may be explored in order to select the number of clusters. If the number of clearly separated simulated clusters is identical with the number of potential groups K, we are able to formulate identifiability constraints which separates the groups. By reordering the MCMC draws according to this constraint, we obtain MCMC draws from a model with unique labelling.

4 4.1

Bayesian Clustering of Times Series in Work - Some Illustrations Using Synthetical Data Selecting the Number of Clusters

When working with the method suggested in the present paper, two important practical issues arise. First how should we decide, if for a given data set pooling within clusters is preferable to overall pooling? Second, if the answer to the first question is yes, how many clusters should we select? As overall pooling results as the degenerate case of pooling within clusters, where the number K of clusters is equal to 1, both issues are related and could be viewed as a model selection problem concerning the number K of clusters. Our experience with a wide range of synthetical and empirical data set is, that valuable empirical evidence concerning the number of clusters is available from an exploratory evaluation of the MCMC draws. As a general rule, we select the number of clusters for a given the data set as the largest number of clearly separated simulation clusters to be 8

found in the MCMC draws. This point will be illuminated for an empirical panel of bank balance data in Subsection 5.3, whereas in this subsection the issue will be discussed for synthetical data where we know the data generating mechanism. We consider the following dynamic model where reaction to an exogeneous variable xt is subject specific: yit = c + φyi,t−1 + βi xt + εit ,

(13)

εit ∼ N (0, σ 2 ), and the model for βi being the same as in section 2.1: βi = βkG with probability ηk , k = 1, . . . , K. Each of our synthetical panels consists of N = 200 time series of length T = 20 assumed to be generated by model (13) with σ 2 = 0.1, xt ∼ N (0, 0.1) and α = (c φ)0 = (3.5 0.3)0 . We investigate six different settings of heterogeneity, the first one being a setting of homogeneity: K = 1, β1G = −0.45. For the remaining five settings, we assume three hidden groups (K = 3) and combine a small group (η1 = 0.1) with a large (η2 = 0.6) and a medium size (η3 = 0.3) one. In all settings the group specific parameters (β1G , β2G , β3G ) are chosen such that the overall mean αβ , αβ =

K X

βkG ηk ,

(14)

k=1

is identical to –0.45, however heterogeneity measured by the variability Q of the group specific parameters around the overall mean, Q=

K X

(βkG − αβ )2 ηk ,

(15)

k=1

increases and ranges from Q = 0.0039 to Q = 0.3 (see Table 3 for the other values of Q and Figure 1 to catch a glimpse of the increasing dissimilarity of the groups specific parameters over the six settings). For all settings the parameter of the biggest group is close to the overall mean and the distribution of heterogeneity is asymmetric having the smallest group further away from the overall mean than the medium size group. How would we select the number of clusters for these six synthetic data sets pretending not to know the true value? Assume that by chance we start with a model with K = 3 potential groups, which for all but the first setting would be the true number of hidden groups. For each setting model (13) is estimated using the MCMC methods described in section 3.1. After a burn-in-phase of 1000 draws we store M = 1000 MCMC draws. (m) Below we will use the superscript (m) whenever we refer to the MCMC draws, e.g. Si for the draws of the group indicators Si .

9

Figure 1: Scatter plots of the MCMC draws of (φ, βkG ) for all groups for the various simulation settings for K = 3;(a) D = 0., (b) D = 0.05, (c) D = 0.2, (d) D = 0.4, (e) D = 0.6 (f) D = 0.8 Scatterplot of MCMC draws(D=0,K=3)

Scatterplot of MCMC draws(D=0.05,K=3)

0.35

0.35

0.3

0.3

φ

0.4

φ

0.4

0.25

0.2 −2

0.25

−1.5

−1

−0.5

0

0.2 −2

0.5

−1.5

−1

G

(a) Scatterplot of MCMC draws(D=0.2,K=3)

0.3

0.3

φ

0.35

φ

0.35

0.25

0.25

−1.5

−1

−0.5

0

0.2 −2

0.5

−1.5

−1

βG

−0.5

0

0.5

βG

(c)

(d)


Scatterplot of MCMC draws(D=0.8,K=3) 0.4

0.35

0.35

0.3

0.3

φ

0.4

φ

0.5

Scatterplot of MCMC draws(D=0.4,K=3) 0.4

0.25

0.2 −2

0

(b)

0.4

0.2 −2

−0.5 βG

β

0.25

−1.5

−1

−0.5

0

0.2 −2

0.5

G

−1.5

−1

−0.5 βG

β

(e)

(f)

10

0

0.5

Figure 1 shows for each setting scatter plots of the MCMC draws (φ(m) , β·G,(m) ) of the parameter combinations (φ, βkG ) for all k = 1, . . . , 3. An interesting observation here is that the number of clusters to be seen in the MCMC simulations is not necessarily identical with the number of potential groups K = 3 allowed a priori nor with the true number of hidden groups. Although we allowed for three groups, we see only one simulation cluster for the first two settings indicating that the number of groups should be reduced to one (which is equal the true number of clusters only for the first model). For the next two settings we find two not very clearly separated simulation clusters indicating to reduce the number of potential groups to K = 2. The MCMC simulations for a model with K = 2 potential groups in Figure 2 show indeed two simulated clusters (which is in fact smaller than the true number of groups). For the third setting, however, discrimination between the simulated clusters is not very strong, leaving the selection of the appropriate number of clusters somewhat ambiguous. Figure 2: Scatter plots of the MCMC draws of (φ, βkG ) for all groups for various simulation settings for K = 2;(c) D = 0.2, (d) D = 0.4 Scatterplot of MCMC draws(D=0.2,K=2)


0.35

0.35

0.3

0.3

φ

0.4

φ

0.4

0.25

0.2 −2

0.25

−1.5

−1

−0.5

0

0.2 −2

0.5

G

−1.5

−1

−0.5

0

0.5

βG

β

(c)

(d)

Concerning the model with K = 3 potential groups, we observe from Figure 1 that only for the last two settings the number K of potential groups coincide with the number of simulation clusters, this time indicating for both models the true number of groups. For illustration, we increase for these last settings the number of potential groups from K = 3 to K = 4. Figure 3 clearly shows that the number of simulation clusters now is smaller than the potential number of groups. As a general rule, we select the number of clusters for a given the data set by increasing the number K of potential group as long as the number of clearly separated simulation clusters to be found in the MCMC draws coincide with the number of potential groups. Having less simulation clusters than potential groups K in the model is a reliable hint to reduce the number of potential groups. Thus the largest K where the number of potential groups and simulation clusters match is our final choice. To sum up, for the synthetical data we would decide for the first two settings to use overall pooling, for the next two settings to use pooling within two clusters and for the last two settings to use pooling within three clusters. We are going to demonstrate in following subsection that this decision has been a sensible one in terms of standard econometric measures.

11

Figure 3: Scatter plots of the MCMC draws of (φ, βkG ) for all groups for various simulation settings for K = 4;(e) D = 0.6, (f) D = 0.8 Scatterplot of MCMC draws(D=0.6,K=4)


0.35

0.35

0.3

0.3

φ

0.4

φ

0.4

0.25

0.2 −2

0.25

−1.5

−1

−0.5

0

0.2 −2

0.5

−1.5

−1

−0.5

βG

(e)

4.2

0

0.5

βG

(f)

Is pooling within clusters worth the effort?

For the synthetical data discussed in the previous subsections we deduced from the MCMC draws for four settings that we should use pooling within clusters rather than overall pooling, whereas for two setting overall pooling should be used. In this subsection we are going to demonstrate that this decision has been a sensible one. We are going to compare overall pooling with pooling within clusters by evaluating various mean squared estimation and forecasting errors over 50 repetitions of the simulation experiment described in the previous subsection. In order to make model (13) comparable to the pooled model: yit = c + φyi,t−1 + αβ xt + εit ,

(16)

yit = c + φyi,t−1 + αβ xt + ε˜it ,

(17)

we rewrite model (13) as

where αβ is the overall mean defined in (14) and ε˜it = εit + xt (βi − αβ ) are heterogeneous errors. Whether for a given data set pooling within clusters is to be preferred to overall pooling or not, depends on how much of the variance of ε˜it in (17) is caused by heterogeneity. The contribution of heterogeneity usually is measured by the coefficient of determination D defined by the ratio of explained over total variance (see e.g. Gelfand et al, 1995) which yields for our case study: D=

QT x2 , QT x2 + σ 2 P

(18)

where Q is defined in (15) and x2 = 1/T Tt=1 x2t . D obviously ranges from 0 to 1. If D is close to 0, unobserved heterogeneity is not the cause of variability in ε˜it in (17). In this case little gain is expected by introducing the clusters and overall pooling is expected to be preferable. The more D moves away from 0, the more of the variance of ε˜it is caused by heterogeneity, and pooling within clusters is expected to become worth the effort. 12

ˆ is the Table 2: Estimating D from the MCMC draws of a model with K = 3 groups; D average and s.e. is the standard error over 50 simulated panels D 0 0.0500 0.2000 0.4000 0.6000 0.8 ˆ 0.098 0.112 0.217 0.388 0.6 0.802 D s.e. 0.0407 0.0511 0.0749 0.0617 0.0482 0.0239 Table 3: Comparing pooling within three clusters with overall pooling through the ratio of the expected mean squared estimation forecasting error defined in (19) to (21), a ratio bigger than 1 favoring overall pooling, a ratio smaller than 1 favoring pooling within three clusters Q 0 0.00395 0.0188 0.05 0.113 0.3 D 0 0.0500 0.2000 0.4000 0.6000 0.8 MSEφ 1.03 0.776 0.668 0.823 0.819 0.366 MSEc 1.08 0.752 0.687 0.828 0.816 0.376 MSEβi 15.5 1.76 1.1 0.688 0.331 0.103 MSFE1 1 0.996 1 0.992 0.958 0.846 MSFE2 1 0.997 1 0.992 0.952 0.841 MSFE3 1 1 1 0.992 0.948 0.853 MSFE4 1 0.998 0.999 0.992 0.957 0.859 For our synthetical data D increases from 0 to 0.6 within the six settings (see Table 3 for the other values of D). Interestingly, in Figure 1 the number of simulation clusters for a model with 3 potential groups increases with D, indicating that by studying scatter plots of group specific MCMC draws we are actually learning something about the underlying ˆ from the MCMC heterogeneity. Note that for a given data set, it is possible to estimate D draws and this measure could serve as an additional tool to evaluate the usefulness of using pooling within clusters. From Table 2 we find, after averaging over all 50 simulated panels, ˆ is close to D, however a bias toward for a model with K = 3 potential group, that D overrating heterogeneity seems to be present for panels with small D. Now, is pooling with clusters really worth the effort in traditional econometric terms? We compare the pooled model (16) and pooling within K clusters under the different settings of heterogeneity by means of various estimation and forecasting errors. First of all, we show aggregated expected mean squared estimation error between the true (m) individual coefficient βi = βSGi and the posterior simulations βi : Ã

!

N M 1 X 1 X (m) MSEβi = E (βi − βi )2 , N i=1 M m=1 (m)

(m)

(19) (m)

(m)

where βi is equal to βsG,(m) with s = Si for model (13), whereas βi = αβ for model (16). Second, we show the expected mean squared estimation error of φ and c: Ã

!

M 1 X MSEφ = E (φ(m) − φ)2 , M m=1

Ã

!

M 1 X MSEc = E (c(m) − c)2 . M m=1

(20)

Finally, we consider various forecasting horizons h the expected mean squared forecasting 13

Table 4: Comparing pooling within two clusters with overall pooling for the settings three and four of table 3 Q 0.0188 0.05 D 0.2000 0.4000 MSEβi 1.31 0.644 MSFE1 1.04 1.01 MSFE2 1.01 0.986 MSFE3 1.01 0.992 MSFE4 1.01 0.989 error Ã

!

N M 1 X 1 X (m) MSFEh = E (yi,T +h − yi,T +h )2 . N i=1 M m=1 (m)

(m)

(21) (m)

yi,T +h is a Bayesian forecast computed recursively from yi,T +h = c(m) + φ(m) yi,T +h−1 + (m) (m) (m) (m) (m) βi xT +h + εi,T +h , h = 1, 2, . . ., with εi,T +h ∼ N (0, σ 2,(m) ), yiT = yiT , and βi as defined after equation (19). In Table 3, we compare pooling within three clusters with overall pooling through the ratio of the expected mean squared estimation and forecasting error defined in (19) to (21). A ratio bigger than 1 favors overall pooling, a ratio smaller than 1 favors pooling within three clusters. The expectation of all mean squared errors defined by (19) to (21) is substituted by the average of the corresponding mean squared error over the 50 simulated panels. Concerning the fixed parameters φ and c we obtain what we would expect, namely an obvious gain in efficiency by using the right model (note that for D = 0 the overall pooling model is the data generating mechanism). Concerning individual parameters βi , we loose a lot of efficiency by introducing individual parameter in a case where none are present (D = 0). Interestingly, a loss of efficiency is also present for D = 0.05, which is in line with the result of Hoogstrate et al. (2000) that overall pooling is preferable to models allowing for individual parameters also in settings where the individual parameter, although being different, are in fact rather similar. Furthermore this gain of efficiency illustrates that the decision drawn in subsection 4.1 from the MCMC draws, to use overall pooling for this setting, has been a sensible one. There is a clear gain in using pooling within three cluster for the last two settings, again justifying our decision for a model with three clusters. In Table 4 we evaluate our final choice for pooling within two clusters for the two remaining settings in comparison with overall pooling. Pooling within two clusters clearly is better than overall pooling for the fourth setting. For the third setting, where the number of clusters present in the MCMC simulations has not been distinct, overall pooling would have been the better choice than pooling within two clusters. Finally, concerning forecasting errors, there is little difference between the two models, apart from cases with extreme heterogeneity. Interestingly, pooling within three clusters is more efficient than overall pooling exactly in those cases where the number of simulation clusters discovered in subsection 4.1 coincided with the true number of groups.

14

5

Capturing Asymmetric Effects of Monetary Policy on Bank Lending

To exemplify the methodology, we will use a subsample of data that have been used by Kashyap and Stein (1995, 2000). The inference yields results that are broadly consistent with theirs documenting that monetary policy has differential effects on the lending behavior of banks of different size. However, the endogenous grouping reveals that size is not an absolute criteria to classify banks’ lending reaction, i.e. not all small (large) banks react in the same way to monetary policy actions. Moreover, we find a significant asymmetric effect of monetary policy over time in the data, which favors the evidence for models of credit cycles as developed in e.g. Kiyotaki and Moore (1997) or in Azariadis and Smith (1998).

5.1

The data

The panel of quarterly individual bank balance sheet data stems from the Report of Condition and Income (Call Report) of the Federal Reserve Board.1 The original data set contains over 14,000 banks in 1976, a number that decreased steadily since the mid eighties to less than 10,000 at the end of the nineties due to mergers, failures and fewer new charters (see also Rhoades, 2000). For the analysis we will work with a balanced data set and therefore, we will retain in the sample those banks that are present over the whole observation period. These turn out to be 4,391 banks covering approximately 30% of total assets of the banking sector in the second quarter of 1989. Table 5: Summary statistics as of 1989Q2, in million $. Total

absolute size relative size above below above below 1,002 498 95th 75th asset total asset total percentile percentile 4,391 99 4,239 220 3,293 1,130,103 0.75 0.22 0.80 0.10 40.03 3,470.91 38.53 823.45 29.03 257.43 8,531.51 59.19 4,114.01 33.03 45.31 33.85 45.72 34.48 47.19 659,818 0.76 0.20 0.82 0.08 0.51 0.61 0.50 0.61 0.49

No of banks Total assets1 /Market share Median size Mean size Mean liquidity2 Total loans3 /Market share Mean loan share 1 Total Assets: rcfd2170 2 See footnote 2 for the definition of liquidity 3 See footnote 3 for the definition of total loans

Although the coverage of the banking sector seems rather low, the basic features of the 1

The data, information on mergers, and a description on how to form consistent time series are available on www.chicagofed.org.

15

population are well captured by the restricted sample. Table 5 reproduces some summary statistics of our data set. The banks in the top 5th percentile cover 80% of the asset total in the sample, while 75% of the banks account for 10% of the asset total. As documented in Kashyap and Stein (2000), the whole bank sample has nearly the same distribution in 1993Q2 with 76% and 11% of the asset total covered by the banks in the top 5th and the bottom 75th percentiles, respectively. Also, the liquidity shares2 are representative for both size categories in the banking sector, whereby the larger banks have a lower degree of liquidity (34%) than smaller ones (47%). For the whole banking sector in 1993Q2, the respective shares amount to 36% and 45%, respectively. Note finally, that the credit market shares mirror the asset share distribution (see the last panel of table 5). Again, the mean loan shares that amount to 61% and to 49% for the largest and the smaller banks, respectively, correspond to 60% and 53% for the respective categories in the whole banking sector in 1993Q2. The model is estimated for the period starting in the first quarter of 1987 and ending in the fourth quarter of 1995. The time period is chosen so as to include the last recession that the economy experienced in the nineties. Moreover, the time dimension of nine years is set in order to cover a whole policy or interest rate cycle in the data. For the analysis, we have to clean the data from outliers. The information given in the Merger Data file accompanying the data set is used to identify outliers that are due to mergers. Statistical outliers in the loans growth rate series of each bank3 are then identified in several steps. First, we identify outliers individually for each bank as those growth rates which lie outside +/- 5 times the interquartile range around the median. Then, the loans growth rate series of each bank containing outliers is inspected visually. In 3 cases, this leads to the identification of 4, 3 and 1 additional outliers, respectively. Finally, we exclude 31 banks, the loans growth rate series of which prove to be extremely volatile or have missing values over nearly the entire observation period. Some of these banks also were involved in mergers in nearly every quarter of a sustained period. To retain a maximum number of banks in the sample, however, in general we treat outliers and mergers as missing values. The sampler described above is extended by one step that replaces a missing value in period t by an estimate of the observation given all information (see appendix C for a description and some examples).

5.2

Results

We discuss the posterior inference we obtain for three groups, K = 3. Model specification issues are discussed in the next subsection, where we show how the explorative tools can be used to find the adequate number of groups and number of lags of the explanatory variable to be included in the model. To assess the asymmetric reaction of bank lending to monetary policy changes, we estimate a reduced form equation which, for bank i, writes explicitly (see Kashyap and 2

The liquidity share is defined as fraction of securities and cash to total assets. Over the estimation period, securities exclude assets in trading accounts: U.S. Treasury Securities (rcfd0400) + U.S. Government Agency and Corporation Obligations (rcfd0600) + Obligations of States & Political Subdivisions (rcfd0900) + All Other Bonds, Stocks and Securities (rcfd0380) - Trading Account Securities (rcfd1000) + Cash (rcfd0010). 3 Total Loans, Net of Allowance and Reserve rcfd2125.

16

Stein, 1995): dloit = α0 +

3 X

αj Djt +

j=1 5 X j=1

βSGi ,j dirt−j +

5 X

α3+j dloi,t−j + α9 dyt + α10 dpt +

j=1 5 X

βjR dirt−j (It − 1) + εit ,

(22)

j=1

with εit ∼ N (0, σ 2 ), and where dloit , dyt , and dpt stand for the growth rate of loans, the nominal GDP growth rate and the inflation rate, respectively, all in percentage terms, computed as 100 times the difference of the logarithmic level. The latter two variables are included to control for the overall demand situation in the economy and for the growth rate in the nominal loan level, respectively. Djt , j = 1, 2, 3, is a set of quarterly dummy variables. All these variables are modelled as fixed effects with α = (α0 , . . . , α10 ). dirt represents the first difference of the Federal Funds rate. We only include lagged values of the interest rate change in order to alleviate a potential simultaneity and/or endogeneity problem between GDP growth and interest rate movements. Moreover, the specification complies to the standard identification made in related literature investigating monetary policy effects where it is assumed that policy moves affect real variables only with a lag while policy itself may react contemporaneously to developments in real variables. The effect of interest rate changes is modelled as bank- and state-specific. As such the model corresponds to model (2) with Xit2 = Xit3 . We obtain the posterior inference based on the last 6,000 of a total of 8,000 iterations of the random permutation sampler, whereby the parameters of the prior distributions were set in a rather uninformative manner (see appendix A for our general assumptions about the priors): • η ∼ D(4, 4, 4). MS MS • η0· ∼ D(2, 1), η1· ∼ D(1, 2).

• βkG ∼ N (0, κ·I), where κ = 4 and I is an appropriately dimensioned identity matrix. To ensure invariance with respect to state permutation, we set β R |βkG ∼ N (0, 2κ · I). • α ∼ N (0, κ · I). • σ 2 ∼ IG(1, 1). MS MS = (0.25 0.75), = (0.75 0.25), η1· The starting values were set to: η = (1/3, 1/3, 1/3), η0· 2 R G βkj = 0.01, βj = 0.02, σ = 21.5. Figure 4 presents scatter plots of the elements of the switching parameter β R (panel (a)) and of the group parameters βkG , k = 1, 2, 3, panel (b). From a first glance it is obvious that switching in time and three distinct groups are present in the data. The significant parameter values of β R document that switching in time is significant. The state identification can be based on β2R , as its unconstrained posterior distribution is clearly mirrored around zero. However, it is also obvious, that whichever restrictions we use to identify the group- and state-specific parameters, will lead to multimodal posterior distributions. Nevertheless, we will post-process the MCMC output in order to identify the state- and group-specific parameters according to restrictions that seem most appropriate for the present model specification. As will be seen shortly, this procedure yields valuable

17

Figure 4: Scatter plots of the sampled state- and group-specific parameter values, panel (a) and (b), respectively. K = 3, 5 lags of the interest rate change included in all groups. 1.5

4

1

6

0.8

4

2

1

group 1

0.6 group 2 0

0.4 0.5

group 3

G ⋅,4

⋅,2

0

β

0

0

−2

βG

4

2

βR

0.2

βR

2

group 2

−2

−4

−0.2

−4

group 3

−0.5 −0.4

−6 −6

group 1

−0.6 −1

−8

−8

−0.8 −1.5 −2

0

2

βR 1

−1 −5

0

−10 −10

5

βR 3

0

−10 10 −5

βG ⋅,1

(a)

0

5

βG ⋅,3

(b)

information on the kind of misspecification that leads to the multimodality. Here, in particular, the number of lagged interest rate changes included in the third group turns out to be too high. This conclusion is reached with the following set of identifying restrictions: 1. β2R > 0. If the restriction is violated, then permute the simulated group- and statespecific parameter vectors accordingly to fulfill it: βkG := βkG − β R , k = 1, . . . , K β R := −β R G G G G G 2. β1,2 < min(β2,2 , β3,2 ) and β2,1 < β3,1 . With these restrictions, we identify the first group of banks as the one that reacts least to interest rate changes lagged twice. Group two and three are identified by means of the effect of the interest rate change lagged once. If one of these restrictions is violated, then the vectors of group-specific parameters are permuted accordingly.

Figure 5 depicts the posterior distributions of the group-specific parameters βkG , k = 1, 2, 3, and indeed, the multimodality shows up very clearly. As just mentioned, the multimodality might be a sign of misspecification, e.g. of overparametrization, whereby sampled values of some coefficients switch between two ”different” distributions to adjust to sampled values of some other parameters that seem to be significant but in fact are not. This seems to be the case here where the last panel in figure 5 reveals that the coefficient on the fifth lag of the interest rate change for the G , might in fact be restricted to zero. Therefore, rather than increasing third group, β3,5 G = 0. the number of groups at this stage, we will first restrict β3,5 We start the sampler for the restricted model at the mean parameter values estimated from the first MCMC output and iterate over 6,000 times again, discarding the first 1,000 to perform the posterior inference. As one parameter is restricted to zero, we just use the Gibbs sampler this time, i.e. the permutation step at the end of the sampler is left out. As the switching and the three groups emerge quite distinctly out of the data, the stateand group-specific parameters are sampled from the respective posterior distribution. 18

Figure 5: Posterior distributions of the group-specific parameters, βiG , i = 1, 2, 3. 1.5

4

0.8 group 2 group 2

3

0.6

1 group 2

2

group 1 group 3

0.4

group 1 0.5

group 1

1

group 3

0.2

group 3 0 −5

0

5

10

0 −10

−5

βG ⋅,1

0

5

0 −5

0

βG ⋅,2

2.5

5

βG

⋅,3

3 2.5

2

group 2

group 2

2

1.5

1.5 1

group 1 group 3

0.5 0 −10

1

group 1

0.5 0

group 3

0 −5

10

0

βG ⋅,4

5

βG ⋅,5

Figure 6: Scatter plots of the sampled state- and group-specific parameter values, panel (a) and (b), respectively. K = 3, 5 lags of the interest rate change included in group 1 and 2, 4 lags in group 3. 1.5

0.9

1.4

0.8

1.3

0.7

1.2

0.6

1.1

0.5

4

6 4

2

group 3

2

0

−2

0.9

0.3

0.8

0.2

0.7

0.1

⋅,4

βG

⋅,2

0.4

group 2

−2

βG

4

βR

2

βR

0

1

group 1

group 2

−4

−4

group 3 −6 −6 group 1

−8

−8

0.6 0.5 0

−10

0

0.5

−0.1 1 −4

βR 1

−3

−10 0

−2

5

−12 10 −5

βG ⋅,1

βR 3

(a)

(b)

19

0

βG ⋅,3

5

Figure 7: Posterior distributions of the group-specific parameters in the restricted model, βiG , i = 1, 2, 3. 4 3

4 group 2

3

group 2

2 1

2

2 group 1

0 0

group 1

1

5

10

0 −10

−5

0

5

G β⋅,2

group 3

0 −5

0

5

βG ⋅,3

2

2.5

group 2

1.5

group 2

2 1.5

1 group 1

1

0 −20

0.5

group 1

group 3

G β⋅,1

3

group 2

1

group 3

0.5

1.5

group 1

0.5

group 3 −10

0

βG ⋅,4

10

0 −5

0

5

βG ⋅,5

Switching between states and groups does not occur which is documented in the scatter plots of figure 6. They reveal that the three groups can be discriminated as distinctively as before and that the multimodality has disappeared (see figure 7). Thus, enforcing the zero restriction leads to well-identified state- and group-specific parameters. Two elements form the interpretation of the results. The first question of interest is whether the state variable tracks specific economic time periods. Figure 8 depicts the mean posterior state probabilities (estimated by averaging over the sampled paths for It ) along with a plot of (nominal) GDP growth and interest rate changes. Interestingly, the discrimination between the two states is very clear, the mean posterior probabilities of It being near 1 in either case of It = 1 or It = 0. Worth mentioning is also that the estimate of It does not differ from the one of the unrestricted model.4 It turns out that the state It = 1 is related to periods that record either a temporary (in 1989 and in 1995) or a major slowdown (1990/1991) in economic activity. In addition, It = 1 during the second half of 1993 through the first quarter of 1994 falls into a period of economic activity gaining a strong momentum with marked inflationary pressures as documented in the Monetary Policy Report to the Congress (1994). It happens that in all identified periods liquidity constraints were relevant in the credit market. In the first half of 1989, monetary policy was tightened further under the circumstances of strongly increasing producer and consumer prices. During the recession period in 1990 and its sluggish recovery during 1991, an accommodative monetary stance intended to bolster lending incentives as financial intermediaries were still hesitant about extending new credits and lending standards were becoming tighter. Finally, in 1993, the accelerating recovery was supported by strong borrowing demands of households and businesses. In light of this 4

In order to save space, we do not reproduce the estimates of the unrestricted model. They are available upon request.

20

Figure 8: Mean posterior state probabilites along with nominal GDP growth rates and interest rate changes. The estimates are obtained by averaging over the sampled paths for It . The shaded area in the bottom panel refer to the recession identified by the NBER, whereby the peak of the cycle was reached in in July 1990 and the trough in March 1991.

It=0

1

0.5

0

1987

1988

1989

1990

1991

1992

1993

1994

1995

1987

1988

1989

1990

1991

1992

1993

1994

1995

1987

1988

1989

1990

1991

1992

1993

1994

1995

GDP (−), int.rate (−−)

It=1

1

0.5

0

2 0 −2

assessment, the bank lending reaction to interest rate changes in periods where It = 1 should fall out stronger than during periods where It = 0. The group-specific parameter estimates in the two states are summarized in table 6 with t-values in parentheses. The mean and standard error is estimated by averaging and taking the standard deviation over the sampled parameter values, respectively. Group 2 is the main group in which most of the banks are classified. When It = 1, the overall effect of interest rate changes (the sum over the coefficients) is negative, a basis point increase in the interest rate leads to a reduction in bank lending of 1.1%, whereby the maximum negative effect occurs after 3 quarters. Far less banks fall into group 1 and group 3. While the overall effect of interest rate changes is insignificant for banks of group 1, the banks in group 3 react more strongly than banks of group 2. The distinctive feature of these groups is that they include relatively more small and more liquid banks than group 2 (see the median size and the mean liquidity of each group, and figure 9). The banks in group 3 might be those banks that are exposed to the problem of informational asymmetry more strongly than the banks of group 2, and therefore, restrict lending more than banks of group 2 do. On the other hand, the overall insignificant effect of interest rate changes in group 1 might come from the fact that these banks draw down on their securities rather than restrict lending after interest rate increases. To summarize, these results are reconciled with the bank lending view of the credit channel in monetary policy transmission according to which smaller banks are more exposed to credit market frictions. This applies to most banks of group 3. However, a strong balance sheet, reflected in a high degree of liquidity, alleviates these problems especially for small banks. This effect seems be at work for banks of group 1. Overall, the state It = 1 captures periods of tight 21

Table 6: Group-specific parameters (with t-values). The estimates are obtained by averaging over the 6,000 simulated values.

dirt−1 dirt−2 dirt−3 dirt−4 dirt−5 sum no. of banks average size median size average liq.

It = 1 β1· β2· 3.63 1.06 (8.39) (9.82) -7.72 0.83 (-16.93) (8.26) 2.66 -2.32 (5.50) (-10.87) 3.40 0.21 (7.20) (1.42) -1.31 -0.89 (-2.92) (-4.89) 0.64 -1.12 (1.59) (-6.92) 75 4243 37.41 205.56 20.07 40.30 55.57 45.12

β3· 4.81 (6.98) -0.43 (-0.44) 0.40 (0.43) -7.51 (-7.55) 0.00 – -2.73 (-2.80) 41 1578.89 29.75 53.08

It = 0 β1· − β R 3.21 (7.64) -8.71 (-18.67) 5.44 (11.74) 2.98 (6.41) -0.33 (-0.79) 2.58 (6.84)

β2· − β R 0.63 (9.09) -0.15 (-1.47) 0.45 (6.03) -0.20 (-2.57) 0.09 (1.39) 0.82 (13.81)

β3· − β R 4.39 (6.36) -1.41 (-1.43) 3.18 (3.55) -7.93 (-8.24) 0.00 – -1.77 (-1.85)

βR 0.42 (3.35) 0.98 (8.31) -2.78 (-13.64) 0.42 (2.76) -0.98 (-5.34)

liquidity, where credit market imperfections, in particular informational asymmetries, lead to credit rationing after the implementation of restrictive monetary policy. The overall bank lending reaction of group 1 and 2 becomes positively related to interest rate changes when It = 0, whereby the effect for group 1 is more pronounced than for group 2. The reaction of group 3 remains negative on average, although with a marginal significance. The positive reaction of lending during these periods corroborates the interpretation of It . Hence, It = 0 identifies periods in which liquidity constraints were broadly not binding in the credit market.

5.3

Increasing the number of groups

We have performed model specification with the exploratory tools outlined in Subsection 4.1. That these are quite successful in the model choice procedure is illustrated in figure 10 which depicts the scatter plots of the group-specific parameters for the 4 and 5 groups model, respectively. We can compare whether the results would have improved if, based on the results of the unconstrained 3 groups model, we would have pursued this strategy for the posterior inference. The two left panels show the scatter plot of the group-specific parameters for the 4 groups model. Similarly as for the unrestricted, 3 G and on groups model, the group-specific identification can be based sequentially on β·,2 G β·,1 . However, also here there is no set of restrictions that will lead to unimodal posterior distributions for all parameters. One way to go from here, would be to increase the number of groups to 5 to allow the fifth group to capture the multimodality in the upper left group. The two scatter plots in the right panel of figure 10 show the result of the 5 22

Figure 9: Scatter plot of the liquidity share against the log of size of banks in group 1 and 3, as of 1989Q2. The lines represent the median of the liquidity share and of the log of size of banks in group 2. 100

90

80

liquidity (1989/2)

70

60

50

40

30

20

10

0

6

8

10

12

14

16

18

log size (1989/2)

groups model specification. Again, however, the group-specific parameters are not wellidentifiable. Even more importantly, the groups can not be discriminated distinctively as simulated parameter values seem to switch between groups. Figure 10: Scatter plots of the sampled group-specific parameter values for K = 4 and K = 5, panel (a) and (b), respectively. 15

10

15

10

8 10

8

6

10

6

4

4

⋅,4

2

βG

βG

βG

⋅,2

βG

0

5

⋅,2

2

⋅,4

5

0 −2

0

−4 −5

−4

−6

−5

−6

−8 −10 −10

0

−10 10 −10

−8

0

−10 −10

10

G

β⋅,3

βG

⋅,1

0 −2

0

−10 10 −20

20

G

⋅,1

(a)

0

β⋅,3

βG

(b)

Finally, our model choice is corroborated by the fact that, in both extended specifications, the estimate of the state variable would lack an economic interpretation due to its very low state persistence, leading to an erratic picture of the posterior state probabilities.

23

6

Concluding Remarks

In the present paper we suggested a new method of modelling and forecasting a panel consisting of many short time series. The main idea of our method is to cluster the time series into K groups with group membership being unknown apriori. Within each group pooling toward a fixed group coefficient is carried out. Our method also allows for mixed pooling where pooling within clusters for part of the coefficients is combined with overall pooling for the remaining parameters. Additionally, we introduced Markov switching pooling to account for potential time variation of the coefficients. Our main application came from analyzing panel data from the monetary sector where we tried to capture asymmetric effects of monetary policy on bank lending. Other potential applications are to marketing to capture unobserved heterogeneity under changing consumer preferences (see eg. Fr¨ uhwirth et al., 2001) or to find convergence clubs in macro-economic panels (Canova, 1999). For the sake of readability we decided to discuss a somewhat limited version of model (2) within the current paper. We want to conclude the paper by discussing potential extensions of this model. First, it is rather straightforward to relax the assumption of error model (3) to deal with contemporaneous correlation between the subject at a given time and autocorrelated errors. Second, model (2) assumes that the conditional distribution of the observations is normal. An interesting aspect of the model is that the marginal distribution f (yit |y t−1 , θ) of yit where the unknown group indicator Si is integrated out, is a non-normal distribution. Nevertheless the assumption of a normal distribution is of concern for computing pooled estimators within each group which may be sensitive to outlying values. In our case study in Section 5 we reduced this sensitivity by treating outlying values as missing. Alternatively, it might be useful to extend the conditional distribution of yit to a tν distribution. This is achieved by introducing error variance heterogeneity in the following way: εit ∼ N (0, σi2 ),

2 σi2 = λ−1 i σ ,

λi ∼ G(ν/2, ν/2).

Third, the structural part of model (2) is restricted to the assumption that within each group subject specific coefficients are the same and that time-varying effects may be captured by a Markov switching model. The idea of clustering time series into groups where we assume the same model within in each group, however, may be extended in a straightforward way to other model classes which are more general than the one applied in (2). To give an example, the group specific model could be a random effects-model and we could use a continuous parameter-varying model to capture structural changes: yit = Xit1 α + Xit2 βi + Xit3 γt + εit .

(23)

where βi ∼ N (βSGi , QG Si ) and γt ∼ N (γt−1 , V ), with γ1 = 0. For such a model pooling within a cluster is substituted by the softer tool of shrinkage within a cluster meaning that although individual coefficients are pulled toward the common group mean βSGi , the presence of apriori variation of βi around βSGi allows for differences of the individual βi also within the group. The covariance matrix QG k influence the amount of shrinkage taking place with the limiting case of pooling within the cluster if QG k = 0. As in a Bayesian , . . . , QG setting it is natural to estimate the covariance matrices QG K simultaneously with 1 24

the remaining parameters and latent variables, the data will tell us what the best balance between individuality and conformity will be when estimating βi . A similar model, without the time-varying extension, is estimated in Canova (1999). Finally, we see potential extensions to models with conditional distributions, that are non-normal by the nature of the observations yit , an example being panels where yit is a binary indicator or a categorial variable. Clustering of non-normal time series could be carried out as outlined in this paper, with pooling all time series within a group by the help of a logit-, probit- or multinomial model with group specific parameters. A potential application of such a model could be analyzing a panel of credit risk data for many firms.

Acknowledgements The work of the first author was partly supported by the Austrian Science Foundation (FWF) under grant SFB 010 (’Adaptive Information Systems and Modeling in Economics and Management Science’). We want to thank Melanie Groschan for assembling the bank panel data set.

References Asea, P. K. and S. Brock Blomberg (1998). Lending cycles. Journal of Econometrics 83, 89–128. Azariadis, C. and B. Smith (1998). Financial intermediation and regime switching in business cycles. The American Economic Review 88, 516–536. Baltagi, B. H. (1995). Econometric Analysis of Panel Data. New York: Wiley. Canova, F. (1999, June). Testing for convergence clubs in income per-capita: A predictive density approach. mimeo Universitat Pompeu Fabra. De Bondt, G. (1999). Banks and monetary transmission in Europe: Empirical evidence. BNL Quarterly Review 209, 149–168. Fr¨ uhwirth-Schnatter, S. (2001a). Fully Bayesian analysis of switching Gaussian state space models. Annals of the Institute of Mathematical Statistics 53, 31–49. Fr¨ uhwirth-Schnatter, S. (2001b). MCMC estimation of classical and dynamic switching and mixture models. Journal of the American Statistical Association 96, 194–209. Fr¨ uhwirth-Schnatter, S., T. Otter, and R. T¨ uchler (2001, November). Unobserved preference changes in metric conjoint analysis. Invited paper at the Bayesian Applications and Methods in Marketing Conference, Ohio State University, Columbus. Gelfand, A., S. Sahu, and B. Carlin (1995). Efficient parametrisations for normal linear mixed models. Biometrika 82, 479–488. Hamilton, J. D. (1989). A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica 57, 357–384. Hoogstrate, A. J., F. C. Palm, and G. A. Pfann (2000). Pooling in dynamic panel-data models: An application to forecasting GDP growth rates. Journal of Business & Economic Statistics 18, 274–283. 25

Kashyap, A. and J. Stein (1995). The impact of monetary policy on bank balance sheets. Carnegie-Rochester Conference Series on Public Policy 42, 151–195. Kashyap, A. K. and J. C. Stein (2000). What do a million bank observations have to say about the transmission of monetary policy? The American Economic Review 90, 407–428. Kaufmann, S. (2001). Asymmetries in bank lending behaviour. Austria during the 90s. Working Paper 97, European Central Bank. Kiyotaki, N. and J. Moore (1997). Credit cycles. Journal of Political Economy 105, 211–248. Monetary Policy Report to the Congress (1994). Federal Reserve Board Bulletin, March 1994. Quandt, R. and J. Ramsey (1978). Estimating mixtures of normal distributions and switching regression. Journal of the American Statistical Association 73, 730–738. Rhoades, S. A. (2000). Bank mergers and banking structure in the United States, 198098. Staff Study 174, Board of Governors of the Federal Reserve System. Stein, J. (1998). An adverse-selection model of bank asset and liability management with implications for the transmission of monetary policy. RAND Journal of Economics 29, 466–486. Verbeke, G. and G. Molenberghs (2000). Linear Mixed Modesl for Longitudinal Data. NewYork/Berlin/Heidelberg: Springer.

A

Choice of the Prior

For Bayesian estimation we have to select the prior π(θ) of θ. Assuming independence between various components, this prior is given by: G π(θ) = π(η)π(η M S )π(α, β1G , . . . , βK , β R )π(σ 2 ).

In this paper the focus lies on Bayesian estimation in situations where we lack strong prior information. From a theoretical point of view, being fully non-informative about θ is possible only for the fixed effects α and the variance parameter σ 2 . Being nonG informative about β1G , . . . , βK , β R , η, and η M S is not possible, as improper priors on these parameters may result in improper posteriors (Diebolt and Robert, 1994; Roeder and Wasserman, 1997). A “natural” prior distribution π(η) for the relative group sizes η is a Dirichlet prior D(e1,0 , . . . , eK,0 ), which is the conjugate prior in the complete data setting, where S N is assumed to be known. A common choice is ek,0 = 1, k = 1, . . . , K which leads to a uniform prior on the unit simplex. We may select ek,0 bigger than 1 to exclude empty groups a priori. MS MS , are independent a priori and η0· The two conditional transition distributions η1· and each follows a Dirichlet distribution D(fi0,0 , fi1,0 ), i = 0, 1: f

−1 f

−1

π(ηi0 , ηi1 ) ∝ ηi0i0,0 ηi1i1,0 .

26

For the fixed parameters α we use a normal prior N (c0 , C0 ). Concerning the group G specific parameters β1G , . . . , βK , we assume that they are independent a priori. In the context of mixture modelling it is now common practice to use hierarchical priors for being weakly informative about group specific parameters (see e.g. Richardson and Green, 1997; Roeder and Wasserman, 1997; Stephens, 1997): π(βkG ) ∼ N (b0 , B0 ).

(24)

This allows different parameters for the various groups, however with a slight restriction expressed by the prior. Furthermore this prior is invariant to relabeling the groups. The prior on β R is selected in such a way that the resulting prior is invariant to relabeling the states of It . We give details for that case where Xit2 = Xit3 . Then the group specific parameters are equal to βkG for It = 1 and βkG − β R for It = 0. Therefore to obtain invariance with respect to relabeling It , the prior must be identical in both states of It . This implies the following prior on (βkG , β R ): π(βkG , β R ) ∼ N (r0 , R0 ),

(25)

where Ã

r0 =

b0 0

!

Ã

R0−1 =

,

2B0−1 −B0−1 −B0−1 B0−1

!

.

Finally, a “natural” prior for the variance parameter σ 2 is an inverted gamma prior, σ 2 ∼ IG(νε,0 , Gε,0 ), which is the conjugate prior in the complete data setting, where S N and I T are assumed to be known.

B

MCMC Sampling

For practical MCMC sampling from the posterior distribution π(ψ|y N ), ψ is divided into six different blocks and the various blocks of ψ are sampled conditional on the current value of the other parameters. 1. Sampling S N from the conditional distribution π(S N |θ, I T , y N ). From Bayes’ Theorem we obtain: π(S1 , . . . , SN |θ, I T , y N ) ∝

N Y T Y i=1 t=p+1

f (yit |α, βSGi , β R , σ 2 , It , yit−1 )π(Si |η),

where p is the lag of yit . Therefore S1 , . . . , SN are conditionally independent given θ, I T , y N and Si may be sampled from the discrete distribution π(Si = k|yi , θ, I T ), k = 1, . . . , K: π(Si = k|yi , θ, I T ) ∝

T Y

f (yit |α, βkG , β R , σ 2 , It , yit−1 ) · ηk ,

(26)

t=p+1

where f (yit |βkG , α, β R , σ 2 , It , yit−1 ) is the density of a normal distribution with mean Xit1 α + Xit2 βkG + Xit3 β R (It − 1) and variance σ 2 . 27

2. Sampling the group probabilities η from the conditional distribution π(η|S N ). The conditional distribution π(η|S N ) is a Dirichlet distribution D(e1,N , . . . , eK,N ), where ek,N = ek,0 + #(Si = k),

k = 1, . . . , K.

3. Sampling the transition matrix η M S from the conditional distribution π(η M S |I T ). MS MS The two conditional transition distributions η1· and η0· , are independent a posterior and each follows a Dirichlet distribution D(fi0,T , fi1,T ), i = 0, 1: M S fi0,T −1 M S fi1,T −1 MS T MS , ) (ηi1 ) |I ) ∝ (ηi0 , ηi1 π(ηi0 fij,T = fij,0 + #(It = j, It−1 = i), i = 0, 1, j = 0, 1. G 4. Sampling of all model parameters α, β1G , . . . , βK , β R jointly from the conditional disG G R 2 N T N tribution π(α, β1 , . . . , βK , β |σ , S , I , y ). Conditional on S N and I T model (2) is a classical regression model:

yit = Zit α? + εit ,

εit ∼ N(0, σ 2 ),

G with parameter α? = (α, β1G , . . . , βK , β R ) and

Zit =

³

(1)

Xit1 Xit2 Di

(K)

· · · Xit2 Di

Xit3 (It − 1)

´

,

(k)

where we used the coding Di = 1 iff Si = k, for k = 1, . . . , K. The posterior of G α? = (α, β1G , . . . , βK , β R ) is given by π(α? |σ 2 , S N , I T , y N ) ∼ N(aN , AN ), where AN = (

N X T X

0

−1 Zit Zit /σ 2 + A−1 0 ) ,

(27)

i=1 t=p+1

aN = AN (

N X T X

0

Zit yit /σ 2 + A−1 0 a0 ).

i=1 t=p+1

The joint normal prior N (a0 , A0 ) for α? is constructed in an obvious way from the G normal priors of the group specific parameters β1G , . . . , βK , β R and the fixed effects α, respectively. G , β R , S N , I T , y N ). 5. Sampling σ 2 from the conditional distribution π(σ 2 |α, β1G , . . . , βK The posterior is given by the following inverted gamma distribution: G σ 2 |α, β1G , . . . , βK , β R , S N , I T , y N ∼ IG(νε,N , Gε,N ), νε,N = νε,0 + N (T − p)/2,

  N X T X Gε,N = Gε,0 + 1/2  (yit − Xit1 α − Xit2 βSGi − Xit3 (It − 1)β R )2  . i=1 t=p+1

6. Sampling I T from the conditional distribution π(I T |θ, S N , y N ). This step is carried out in a multimove manner as in Chib (1996). First we run a (forward) filter to compute π(It |θ, S N , y N,t ) starting for t = 1 from the prior distribution π(I0 ). y N,t contains observations from all banks up to t. This step is straightforward as the 28

observation densities f (yit |y N,t−1 , I T , θ, S N ) of all observations at time t depend on the past I T through It , only: π(It |θ, S N , y N,t ) ∝

N Y

f (yit |yit−1 , θ, Si , It )π(It |θ, S N , y N,t−1 )

(28)

i=1

where f (yit |yit−1 , θ, Si , It ) is the density of a normal distribution with mean Xit1 α + Xit2 βSGi + Xit3 β R (It − 1) and variance σ 2 . π(It |θ, S N , y N,t−1 ) is given by extrapolation: π(It |θ, S N , y N,t−1 ) =

1 X

S π(It−1 |θ, S N , y N,t−1 )ηIMt−1 ,It .

It−1 =0

Given the filter probabilities we run a backward sampler starting from t = T with sampling IT from π(IT |θ, S N , y N,T ). For t = T − 1, . . . , 0 we sample from π(It |It+1 , . . . , IT , θ, S N , y N,T ) which is given by: π(It |It+1 , . . . , IT , θ, S N , y N,T ) = π(It |It+1 , θ, S N , y N,t ) ∝ π(It |θ, S N , y N,t )ηIMt ,ISt+1

C

How to Deal with Mergers and Missing Values

If mergers occurred for bank i at time t or an extremely outlying value is present, we treat yit as missing and estimate yit along with ψ from the data using MCMC methods. It is possible to consider more than one missing value for each bank. Let y˜i summarize all missing values for time series i, let yi? denote the remaining observations. For each bank with missing values we use the median of the non-missing values as starting values. The steps one to six of the MCMC sampling scheme described in appendix B are carried out conditional on a given value for all missing observations. The scheme is then concluded by an additional step sampling the missing values y˜i jointly for ? all banks from the conditional posterior π(˜ y1 , . . . , yÑ |ψ, y1? , . . . , yN ). The presence of lagged values of yit as explanatory variables for future observations leads to a somewhat tedious algebra to compute the posterior distribution of the missing values given the remaining observations and the parameter ψ. A second problem with the presence of lagged values (lag of p periods) arises with missing value at the very beginning of the time series (t = 1, . . . , p). These missing observations appear only as right hand variables in our model and MCMC estimation of these variables turned out to be sometimes instable. These numerical problems however could be avoided by using a slightly informative prior. We assume apriori independence of all missing values with the mean given by the median of the non-missing values and a diagonal covariance matrix that depends of the inter quartile range of the non-missing values: π(˜ y1 , . . . , yÑ ) =

N Y

π(˜ yi ),

i=1

π(˜ yi ) ∼ N (m0i , C0i ), where m0i is the median of the non-missing values yi? and C0i depends on the inter quartile range IQRi of the non-missing values through: C0i = (5/1.34 · IQRi )2 . The posterior distribution of the missing values. Obviously for any time series the missing values y˜i are independent from the missing values of the other time series given 29

ψ. Therefore we sample each y˜i separately from π(˜ yi |ψ, yi? ). Within a certain time series the missing values would be independent, only if the time between the missing values were bigger than the lag p. This, however, was not the case for all time series within our panel. Therefore we derived the joint posterior of all missing values y˜i for each time series. To this aim we rewrite the general model (2) as 



yi,t−p    yit = Φit  ...   + cit + εit , yi,t−1

t = p + 1, . . . , T

(29)

where Φit =

³

[p]

βit

[1]

´

· · · βit

[l]

with βit being the parameter belonging to lag l, and cit = Xit? βit? , with Xit? consisting of those columns of [Xit1 Xit2 Xit3 (It − 1)] which do not contain lagged values of the dependent variable. βit? consists of the corresponding parameters. Equation (29) is equivalent to the following model: 0 = Bi y i + c i + ε i ,

(30)

0

where yi = (yi1 , . . . , yiT ) , Bi ∈

Bayesian Clustering of Many Short Time Series - CiteSeerX

Bayesian Clustering of Many Short Time Series - CiteSeerX

Suggest Documents

Fuzzy Clustering of Short Time-Series and

Fuzzy Clustering of Short Time-Series and Unevenly ... - CiteSeerX

Bayesian Time Series Analysis - CiteSeerX

Using emergent clustering methods to analyse short time series gene ...

Bayesian Clustering by Dynamics - CiteSeerX

Recent Techniques of Clustering of Time Series Data: A ... - CiteSeerX

bayesian analysis of nonlinear time series models with a ... - CiteSeerX

bayesian analysis of econometric time series models ... - CiteSeerX

ODAC: Hierarchical Clustering of Time Series Data Streams - CiteSeerX

Time series clustering of mRNA and lncRNA

Fuzzy Clustering Based Segmentation of Time-Series

Incremental Clustering of Time-Series by Fuzzy Clustering - IIS, SINICA

Gradient pattern analysis of short nonstationary time series - CiteSeerX

A Platform for Processing Expression of Short Time Series ... - CiteSeerX

A Bayesian Change Point Model for Historical Time Series ... - CiteSeerX

key words: Clustering, price forecasting, time series model - CiteSeerX

A new method for time series clustering - CiteSeerX

Why does Subsequence Time-Series Clustering Produce ... - CiteSeerX

Clustering Time Series with Hidden Markov Models and ... - CiteSeerX

real time clustering of time series using triangular potentials - arXiv

Efficient Bayesian inference for natural time series

Bayesian time series classification - Peter Sykacek

CLUSTERING TIME SERIES, SUBSPACE ... - Project Euclid

Spectral Clustering for Time Series - Google Sites