Computation of Restricted Maximum-penalized-likelihood Estimates in

Computation of Restricted Maximum-penalized-likelihood Estimates in Hidden Markov Models Tero Aittokallio Olli Nevalainen

Department of Mathematical Sciences, University of Turku

Jussi Tolvi

Department of Economics, University of Turku

Kalle Lertola Esa Uusipaikka

Department of Statistics, University of Turku

Turku Centre for Computer Science TUCS Technical Report No 380 November 2000 ISBN 952-12-0761-2 ISSN 1239-1891

Abstract The maximum-penalized-likelihood estimation for hidden Markov models with general observation densities is described. All statistical inference, including the model estimation, testing, and selection, is based on the restricted optimization of the penalized likelihood function with respect to the chosen model family. The method is used in an economic application, where stock market index returns are modeled with hidden Markov models. Special emphasis is placed on modeling isolated outliers in the data, which has usually been ignored in previous research. The chosen model ts the data well, and is capable of modeling the outliers as well as a structural change in the series.

Keywords: hidden Markov model, likelihood inference, parameter estimation, model selection, stock markets, volatility, outliers

TUCS Research Group

Algorithmics and Biomathematical Research Groups

1

Introduction

Hidden Markov models (HMMs), or Markov switching models, are by now a popular class of models in time series econometrics, see, for example Hamilton (1994) for an introduction. They have been applied in economics in the study of, among several other topics, exchange rates (Engel & Hamilton, 1990), business cycles (Hamilton, 1989) and stock market returns (Rydén, Teräsvirta, & Åsbrink, 1998). Models like these are capable of producing time series realizations with a wide variety of characteristics. Rydén et al. (1998) and Timmermann (2000) provide some illustrations and theoretical results on this. In a HMM, the probability distribution of an observation is based on the value of an underlying state variable. The underlying state is assumed to follow a Markov chain, which introduces serial dependence into the data. The probability distributions associated with the hidden states are all assumed to be of the same form: only the mean and dispersion parameters are allowed to dier across the states. Moreover, in applications with continuous distributions, the observations are often assumed to be normally distributed. The rst idea of the present paper is that we extend the model to accommodate dierent probability distributions for dierent states. The theory is general in the sense that any distribution could be used, including explanatory variables (e.g. a regression model) for the observed values. Our second idea is to use generalized Lagrange multipliers to restrict parameters in the model estimation. We will study the model using daily stock market index returns. The statistical properties of stock market returns have been the subject of numerous research papers. For an introduction, see Campbell, Lo, and MacKinlay (1997), and for examples using hidden Markov models, see Rydén et al. (1998) and Kim, Nelson, and Startz (1998). Our observed data is the daily closing price of the Dow Jones Industrial Average (DJIA) index. We examine the daily returns, that is, the logarithmic dierences of the index. The data spans from February 12, 1992 to January 11, 2000; a total of 2000 observations. We wish to test, whether our model type is applicable for this heteroskedastic process. Can one expect large uctuations to have the same size on both the positive and negative sides? The variability of the uctuations of dierent signs is also of interest. HMMs can perhaps be seen as a supplementary method to the approach using autoregressive conditionally heteroskedastic (ARCH) models. Some descriptive statistics for our data, along with the ve largest and ve smallest observations, and some quantiles for the data are given in Table 1. The main cause of the negative sample skewness of the data can be seen 1

Summary statistics Quantiles and order statistics Mean 0.00063277 a1 (min) -0.074549 a2000 (max) 0.048605 Median 0.00066348 a2 -0.065782 a1999 0.046008 Variance 0.000076752 a3 -0.042831 a1998 0.040647 StdDev 0.0087608 a4 -0.034672 a1997 0.037535 Skewness -0.54597 a5 -0.032234 a1996 0.033206 Kurtosis 9.2993 0.005Q -0.028891 0.995Q 0.025139 0.01Q -0.023584 0.99Q 0.022139 Table 1: Descriptive statistics for the data (T =2000). ai is the ith order statistic and Q = aT . clearly from the order statistics and the quantiles. The two smallest observations account for a large part of the skewness, and if these two observations are omitted, the sample skewness decreases to less than -0.04. Excess kurtosis is also notable, as is usual for data of this kind. The observed sample kurtosis is also largely due to only a few observations. If the ve smallest and ve largest observations are omitted, the sample kurtosis decreases to 4.23. A few of the observations are therefore highly inuential. In earlier work of this kind, the usual procedure has been to exclude these observations from the data, or to set all observations which are larger than, say, four standard deviations in absolute value equal to (plus or minus) four standard deviations. Some stylized facts for data sets such as ours are discussed in Rydén et al. (1998). The most important aspects are no linear temporal dependence between observations, but strong dependence in the volatility (conditional variance) of the data. As mentioned earlier, extreme observations have in previous work usually been ignored, but here we will place special emphasis on them. The third idea of the paper is in modeling the outliers explicitly with the HMM. In our application, the HMMs have varying numbers of states, and also dierent probability distributions for dierent states. The states describe dierent aspects of the data: the rst state of our models corresponds to normal times, when the price movements are small. This is modeled with a normal distribution. The second state is for more volatile times, and it is described with a normal distribution with a higher variance. Since these two states do not seem to describe all aspects of the data, we will add more states to the models. In the three and four-state models, the additional states are intended to model the outlier observations. Furthermore, since there seems to be a structural break in our sample, we also model our data with a two-regime, eight-state HMM. 2

2

Model characterization

A HMM is a two-level stochastic process with the following properties. First, the underlying discrete-time process fqt g1 t=1 with a nite state space S = f1; 2; : : : ; N g is a Markov chain governed by the N N stochastic matrix A = (aij )i;j2S . The transition probabilities from state i to state j

aij = P (qt = j jqt?1 = i);

t = 2; 3; : : : ; i; j 2 S: (1) are assumed to be time-homogeneous, and at time t = 1 the state q1 has the initial distribution = (i)i2S , where i = P (q1 = i). Both A and satisfy the stochastic constraints

aij 0;

X

j 2S

aij = 1;

i 0;

X

i2S

i = 1;

i; j 2 S:

(2)

In our modeling idea, the states correspond to dierent periods of the time series as follows. The rst two states describe the normal and high variance periods that are well known to occur and alternate in nancial data. Outliers in the data are modeled with the help of special outlier states which have a small probability of occurring and the property that they occur as isolated observations only. In other words, two outliers of the same state can not appear consecutively. This is achieved by setting the self-transition probability (1) equal to zero in the outlier states. However, we leave open the possibility that outliers produced by dierent states may occur one after the other. Secondly, we dene the real-valued observation process fYtg1 t=1 as a sequence of random variables where Yt depends on the state sequence through the current state only. The conditional density of Ytjqt = i is denoted by p(yt; i), where the vector = (i )i2S contains the parameters of the observation densities. In our application, the probabilistic function p(yt; i) of the Markov chain can have two dierent forms; the normal density 2 ( y 1 t ? i ) p(yt; i ) = p exp ? 22 ; (3) 2i i where i = (i; i2 ), or the lognormal density for the observations yt > Ki

1 p exp ? [log(yt ? Ki ) ? i]2 ; p(yt; i ) = 2i2 (yt ? Ki) 2i

(4)

where i = (i; i2). The situation yt < Ki is reduced to the preceding one by replacing yt by ?yt and Ki by ?Ki in (4). The vector is subject to restrictions and testing in our HMMs. 3

Finally, for any given sequence of T states s = i1; i2 ; : : : ; iT and observed realizations y = y1; y2; : : : ; yT we assume that the conditional density of y can be decomposed as

p(yjs; ) =

T Y t=1

p(yt; it ):

(5)

Equation (5) states that the observations are conditionally independent given the state sequence. A HMM is identied by the values of the transition matrix A, initial distribution , and parameter vector , that is, the HMM is compactly denoted by the three-tuple = (A; ; ). Due to the three assumptions the likelihood of the model , i.e. the joint density of the observed data y, is given by

L (y) = p(y; ) =

X

s

i1 p(y1; i1 )

T Y t=2

ait?1it p(yt; it );

(6)

where the summation takes into account all possible N T state sequences s. A HMM is characterized by the fact, that the state process is not observable. Therefore, one can only make probabilistic predictions on the underlying state, but all the statistical inference about the process must be made through the observation sequence.

3

Likelihood based statistical inference

This section gives a short overview over the likelihood theory in HMM context. A detailed presentation of the basic inference procedures, including the evaluation and estimation problems in standard situations, can be found in the tutorial of Rabiner (1989). We will use the notation of Rabiner, and extend the inference to the case of general observation densities with restricted parameter values. Moreover, hypothesis testing and model selection are briey discussed. The section does not aim to be a complete theoretical survey, but describes as examples the methods that will be applied to our models.

3.1 Computation of likelihood

Evaluation of the likelihood straightforwardly by equation (6) involves on the order of TN T operations. Fortunately, we can eciently compute the likelihood of a HMM without the exponential growth in complexity by using 4

the so-called forward-backward method introduced by Baum, Petrie, Soules, and Weiss (1970). The method starts by decomposing the likelihood by states in S as

L(y) =

X

i2S

(t; i) (t; i)

(7)

for any t = 1; 2; : : : ; T . The forward and backward partial likelihoods, (t; i) = p(y1; : : : ; yt; qt = i; ) and (t; i) = p(yt+1; : : : ; yT jqt = i; ) respectively, can be computed recursively from "

(t; i) = and

(t; i) =

X

j 2S X

j 2S

#

(t ? 1; j )aji p(yt; i );

aij p(yt+1; j ) (t + 1; j );

t = 2; 3; : : : ; T;

(8)

t = T ? 1; : : : ; 1:

(9)

The recursions are initialized with (1; i) = i p(y1; i ) and (T; i) = 1 for all i 2 S . Thus, setting t = T in (7) gives

L (y) =

N X i=1

(T; i):

(10)

From the equation (10) it is seen that the recursive computation of the likelihood requires only on the order of N 2 T operations. Moreover, one only needs the forward partial likelihoods to solve the evaluation problem, but the backward partial likelihoods will be used in the estimation problem in the next section. For that purpose, we also dene the variables

(t; i; j ) = p(qt = i; qt+1 = j jy; ) and (t; i) = p(qt = ijy; ). These can be expressed in terms of the forward and backward likelihoods as

(t; i; j ) = P (t; i()t;aiji)pa(ytp+1(y; j ); ()t +(t1;+j )1; j ) ; t = 1; : : : ; T ? 1; ij t+1 j i;j 2S (11) and

(t; i) =

X

j 2S

(t; i; j ) = P (t; i()t; i)( t; i()t; i) ; t = 1; 2; : : : ; T i2S

for all i; j 2 S . 5

(12)

3.2 Restricted maximum-likelihood estimation

Estimation of the parameters of model for given observations y can be considered as an optimization problem. Several optimization criteria have been proposed, for example the segmental K -means algorithm of Juang and Rabiner (1990), but we focus on maximizing the log-likelihood function log L(y) with respect to . In order to obtain the maximum-likelihood estimate (MLE) of , we use a simple iterative algorithm in which the model parameters are improved by each iteration upon in the sense of increasing log-likelihood. The procedure is analogous to the Baum-Welch algorithm (Baum et al., 1970) and is based on the same ideas as the EM algorithm of Dempster, Laird, and Rubin (1977). The idea of the EM algorithm is to maximize the expectation of the log-likelihood with respect to a new model M given the unobserved data distributed according to the old model M . Wu (1983) showed that each step M 7! M guarantees an increase in likelihood. In HMM context, the expectation step consists of determining the auxiliary function Q(; ) which can be decomposed both by states and parame ) in the form ters of = (; A;

Q(; ) =

N X i=1

+

(1; i) log i +

N X T X i=1 t=1

N X T ?1 X i;j =1 t=1

(t; i; j ) log aij

(t; i) log p(yt; i ):

(13)

Therefore, the maximization step of Q(; ) can be performed separately for the parameters i, aij , and i of the new model . The weights (t; i; j ) and (t; i) in (13) are computed from the old model using the equations (11) and (12), respectively. Rabiner (1989) considered the stochastic constraints (2) by means of the Lagrange multipliers. However, we introduce also additional constraints to the model parameters, including their upper and lower bounds, and parameter ties between states. Therefore, we will apply the generalized Lagrange multipliers, so-called Karush-Kuhn-Tucker (KKT) multipliers u. Let us rst consider the re-estimation of the mean parameter i of a Gaussian observation density (3) subject to the constraint Mi;1 i Mi;2. We recall that is a KKT point of Q(; ) if T X t=1

(t; i)(i ? yt)=i2 ? u1 + u2 = 0; 6

(14)

and

u1(Mi;1 ? i) = 0; u2(i ? Mi;2 ) = 0 (15) for some pair u1 0; u2 0. The original Baum-Welch formula is obtained from the equation (14) when u1 = u2 = 0, i.e. PT

(t; i)yt : ~i = Pt=1 (16) T t=1 (t; i) Notice that the above equation is simply a weighted MLE of i with respect to the weights (t; i). The KKT optimality conditions (15) give the following simple algorithm for the re-estimation of i: choose i = Mi;j if ~i violates the constraint corresponding to Mi;j (i.e. uj > 0) for j = 1; 2, and choose i = ~i if the both constraints are satised by ~i. Conditions analogous to

(14) and (15) can be derived from the auxiliary function (13) for the other parameters of the model as well. For example, the unconstrained formula for the variance of a Gaussian density is PT 2 2 ~i = t=1P T(t; i)(yt ? ~i) (17) t=1 (t; i) and selection of i2 proceeds as above depending on the constraints placed for it. As a second case we consider the estimation of parameters of a lognormal density (4). A location parameter Ki must be xed a priori since the MLE of the parameter would be minfytj (t; i) > 0; t = 1; : : : ; T g which yields a singularity, i.e. L (y) = 1, c.f. Johnson and Kotz (1970) in context of unweighted MLE. With known Ki the estimation of the parameters of a lognormal distribution can be reduced to that of a normal distribution. If the observation Yt has a lognormal distribution then the random variable Zt = log(Yt ? Ki) (18) is normally distributed. Therefore, the MLEs of i and i2 are obtained by using the formulae (16) and (17), respectively, with yt transformed to zt via the function (18). The observations yt Ki have no eect on estimates since they produce weights (t; i) = 0. The situation where yt < Ki can be handled by replacing Yt ? Ki by Ki ? Yt in the transformation (18), and the constraints as in the case of a normal distribution. For a general observation density the transition matrix A and the initial state distribution have formulae of the form PT ?1 a~ij = Pt=1T ?1 (t; i; j ) ; (19)

( t; i ) t=1 7

and ~i = (1; i) for all i; j 2 S . Note that in the selection of aij and i the parameters that violate constraints are again corrected, but also the other parameters are scaled in order to fulll the stochastic constraints (2). In summary, starting from an initial model, the new model is used in place of the old model and the expectation and maximization steps are repeated. At each iteration, the re-estimated model satises all the constraints and is a KKT point of Q(; ). Moreover, log L (y) log L (y) with equality if is a KKT point of log L(y), in which case also = . Therefore, a xed point ^ produced by the re-estimation algorithm originates from a KKT point (a local maximum or possibly a saddle point) of the loglikelihood surface.

3.3 Hypothesis testing Testing hypotheses concerning the parameters of a HMM is divided into two parts. The inference procedures for the model selection are discussed in the next section, but till then we will assume that the model has been specied correctly. Therefore, the standard asymptotic theory of the likelihood ratio test (LRT) (Cox & Hinkley, 1974) can be used, and the hypothesis testing reduces to determining the MLE of the model under the null hypothesis. In this paper, we will not be concerned with the problem of parameters lying on the boundary of the parameter space. This problem has been discussed in general context by Self and Liang (1987). Under mild regularity conditions, the MLE ^ of a HMM has been proven to be asymptotically normally distributed by Bickel, Ritov, and Rydén (1998). Given the asymptotic normality of the estimator, the likelihood ratio (LR) statistic has the customary chi-squared distribution 2 under the null hypothesis. Formally, let 1 and 2 be two nested models with parameter spaces 1 and 2 , respectively, and 1 2. Then under the null hypothesis that the true parameter vector is an interior point of 1 , we have asymptotically

LR(1; 2) = 2 log L^2 (y) ? log L^1 (y) 2r ;

(20)

where r = dim 2 ? dim 1 is the decrease in degrees of freedom. Again, we give only the explicit formulae for the LRTs that will be used in this paper; Michalek and Timmer (1999) have presented the likelihood equations for a general parameterization of a HMM. For instance, consider the null hypothesis i = j for some pair i; j 2 S . One can derive from the auxiliary function (13) the KKT conditions similar to (14) and (15) for the 8

re-estimate i = j . It is easy to show that

~i = ~j =

PT

t; i)=~i2 (t; j )=~j2 yt

t; i)=~i2 + (t; j )=~j2

t=1 ( PT t=1 (

(21)

is the unconstrained formula corresponding to (16). The upper and lower values can be considered similarly. To test the simple hypothesis i2 = j2 for some pair i; j 2 S , one needs the formula PT 2 2 2 2 (22) ~i = ~j = t=1 [ (t;PiT)(yt ? ~i) + (t; j )(yt ? ~j ) ] : [

( t; i ) +

( t; j )] t=1 In these examples, the chi-squared distribution in (20) has r = 1 degrees of freedom.

3.4 Model selection

So far, the number of states of the Markov process as well as the form of the observation densities in each state was assumed to be known. Model selection methods attempt to determine them by techniques similar to LRT. However, hypotheses concerning size of the model or form of the densities cannot be tested using the conventional LRT since the regularity conditions fail to hold. For instance, assume that one tries to t an N -state model when the true process has N ? 1 states. Then, under the null hypothesis, the parameters that describe the N th state are unidentied, and hence the information matrix is singular. To overcome the identication problem, several alternatives for the LRT have been proposed, including the tests that bound the exact LR (Hansen, 1992, 1996) or approximate it by parametric bootstrap (Rydén et al., 1998). Also, the use of penalized likelihood methods (Leroux & Puterman, 1992; Rydén, 1995), the Lagrange multiplier principle (Hamilton, 1996), Bayesian inference (Robert, Rydén, & Titterington, 2000), and covariance function of a HMM (Zhang & Stine, 2000) have been suggested for estimating the order of the Markov process. Penalized likelihood criteria, such as the Akaike (1974) information criterion (AIC) and the Bayesian information criterion (BIC) of Schwarz (1978), seem to be the easiest ones to use in practice after the computation of the likelihood function. We apply the AIC in determining both the number of states and the observation densities of the model with parameter space as follows. First, we compute the MLE ^ and its log-likelihood value log L^ (y), and evaluate the criterion (23) AIC() = log L^ (y) ? dim : 9

The model with the largest AIC value is then chosen. In order to get the prober number of states, we compute the criterion (23) for HMMs with an increasing number of states N . The test enables heuristic determination of the smallest number of states that can describe the observations y. Naturally, we can not test all the possible density functions, but at least one can make an intelligent guess among the candidates under consideration. Building a time series model is an iterative procedure. After parameter estimation one has to assess model adequacy by checking whether the assumptions of the model are satised. HMMs have been evaluated, for example, by testing the Markov condition (Timmer & Klein, 1997) or the spectral properties (Rydén et al., 1998) of the process. Evaluation of most time series models, however, can be accomplished through analysis of their residuals with respect to model assumptions. Although residuals of a HMM are not readily constructable we will investigate this approach in the following. We dene the standardized residuals of the model as yt ? E (Ytjqt = it ; ) ; t = 1; 2; : : : ; T: (24) r(t) = p V ar(Ytjqt = it ; ) Before computation of the residuals, one needs to predict a state sequence i1; : : : ; iT corresponding to the model (and observations y). This so-called decoding problem is usually solved using the Viterbi algorithm (Rabiner, 1989), which produces the most likely state sequence. Here, for simplicity, we take another approach and choose the states it which are individually most likely. Using the weights (12), we can solve the individually most likely state at time t as (25) it = argmax ^ (t; i): i2S

The sequence i1; : : : ; iT maximizes the expected number of correct states. In the case that the observation density of state it is normal, i.e. it = (it ; i2t ), the residual (24) takes the form r(t) = (yt ? ît )=ît . If the density of state it is lognormal, i.e. it = (it ; i2t ), we use again the transformation (18), and get r(t) = [log(yt ? Kit ) ? ît ]=ît for the observations yt > Kit . Heuristically we can state that if the model is adequate the residual process fr(t)gTt=1 should be white noise. In addition to the sample skewness and kurtosis of these residuals, we compute a normality test by Doornik and Hansen (1994), which has, under the null hypothesis of normality, an approximate 2 distribution with two degrees of freedom. To check the covariance structure of the residuals, we also calculate the ARCH tests (Engle, 1982) with 5 lags. This test has a 2 distribution with ve degrees of freedom. Although we lack a theoretical justication for the residuals (24) they can at least be used to detect model misspecication. 10

3.5 Implementation issues While the previous sections primarily considered theoretical aspects of HMMs, this section focuses briey on practical implementation issues, including scaling, initial parameter values, global optimization, unbounded likelihood function, and model identication problems. In theory, the forward-backward method (7)-(10) provides the formulae required to compute the likelihood of a HMM . In practice, however, the method is numerically unstable because the forward partial likelihood (t; i) converges to zero or diverges to innity as t increases. Therefore, for suciently large T it is impossible to compute (T; i) using any machine precision. The same is true for the backward partial likelihood P (t; i) as well. However, by scaling of both (t; i) and (t; i) by the sum i2S (t; i) their dynamic range can be kept near to one. Moreover, the re-estimation formulae remain exactly the same provided that the weights (11) and (12) are computed using the scaled partial likelihoods. The only real change is that only log L (y) can be computed, but not L (y) since it would be out of the dynamic range of any computer in any case. Under mild regularity conditions, Leroux (1992) proved that the MLE is asymptotically a consistent estimator of the parameters of a HMM. Again, in practice, the likelihood surface of a HMM is generally multimodal, and therefore the iterative re-estimation algorithm may converge towards a local maximum or even a saddle point. The xed point which the algorithm converges to depends on the initial values of the model parameters. We use the re-estimation algorithm with several, say n, starting points and then choose the xed point ^ with the largest log-likelihood value, that is, ^ = argmaxflog L^1 (y); : : : ; log L^n (y)g. This simple global optimization technique is known as Multistart method (Törn & ilinkas, 1989). To evaluate as many xed points as possible, and hence increase the chance of nding the global maximum, one needs a good coverage of the parameter space. Here, we consider random starting values subject to stochastic (2) and other constraints. Uniform random values of and of rows in A are obtained by sampling from a Dirichlet distribution with unity parameters, that P is, we get equal density to any vector (or rows in A) satisfying ki=1 i = 1, where k is the number of nonzero components. This type of Dirichlet random vector can be generated by sampling rst k values from an exponential distribution with mean 2, and then scaling these values using the sum of the generated numbers. Random values of i are obtained by sampling uniformly over the segment of real line determined by the constraints of i. Uniform and exponential random numbers, in turn, are computed using the Tausworthe random number generator of Tezuka and L'Ecuyer (1991). Simulation stud11

ies of Rabiner, Juang, Levinson, and Sondhi (1985) have shown that HMMs are most sensitive to initial values of parameters of observation densities, but the initial estimates of A and have only a minor eect on xed points. In HMMs, the likelihood function may in some situations be unbounded. In particular, for the Gaussian observation densities the log-likelihood is trivially and globally maximized by letting one of the densities to be degenerate at one of the observations. The re-estimation algorithm may or may not produce xed points that converge to one of these trivial solutions, depending on initial values. To avoid overow, we introduce constraints to the dispersion parameters to insure that they do not fall below a prespecied level D . Nádas (1983) proposed another way of handling the problem of singularities of the likelihood function by applying certain additional a priori information, such as assuming that every density is sampled at least twice or not allowing the means to be equal to any observation. Unfortunately, even in the case of bounded likelihood and innitely many observations, the parameters of a HMM are not strictly identiable. For instance, the states in S can be renamed in N ! dierent ways without changing the statistical properties of the observations. In fact, the consistency of the MLE was stated by Leroux (1992) in terms of convergence of the equivalence class of the MLE only, where the equivalence relation includes, for instance, the permutation of the states. Therefore, we consider that two xed points, say ^1 and ^2, are identical if mini k^1 ? ^2(i)k < I , where i scans over all possible permutations of states of 2 , and kk is the Euclidean distance. One can decrease the number of possible permutations by xing certain parameters of a model before the estimation. In addition to dierent permutations of the states, there may be many parameter sets each of which dene the same distribution for the observations. To overcome the identiability problems, more sophisticated probabilistic distance measures, such as the distance suggested by Juang and Rabiner (1985), are required that can compare dierent types of models as well. The above global re-estimation algorithm was implemented in C. The computation of a xed point was terminated as soon as an iteration step 7! was met, where the relative change of log-likelihood and of each parameter value was less than a given threshold level S . In summary, given the specication of the model (i.e. the topology of transition matrix, the number of states, and the form of observation densities), we start the algorithm from n initial points, and then arrange the resulting dierent (in regard to I ) xed points in the decreasing order with respect to their log-likelihood values. Finally, the xed point ^ with the largest value among the non-degenerate (in regard to D ) xed points is chosen. 12

4

Results

4.1 Model estimation We turn now to our empirical application, and start with a description of the models to be estimated. Models 1, 2 and 3 are one, two and threestate HMMs where a normal distribution is chosen for all states. The main restriction in these models is that we set a33 = 0 in 3, meaning that the process is not allowed to stay in the outlier state for longer than one period at a time. The same restrictions (a33 = a44 = 0) are made for the outlier states in our four-state models 4 and 5. In model 4 all of the states still have a normal distribution, but now the means i of each state are also restricted, such that they must be between -0.01 and 0.01 for the rst two states, and smaller than -0.01 and larger than 0.01 for the third and fourth state, respectively. Finally, in model 5 the last two states have a lognormal distribution with restricted location parameters, namely K3 = ?0:01 and K4 = 0:01. Besides the restrictions of the outlier states described above, the observation densities have the following constraints. In state i with normal density, the mean must be in the range of the observations, that is, ?0:0746 i 0:0487 (see Table 1). On the other hand, transformation (18) gives the corresponding bounds for a lognormal state i as ?9:84 i ?3:25 for yt > 0:01 and ?10:7 i ?2:74 for yt < 0:01. However, none of these constraints nor the upper bounds of the dispersion parameters i2 102; i2 102; i 2 S , were active during the estimation process. Whereas the above restrictions were applied only in the random selection of starting points, a lower bound for the dispersion parameters was introduced to prevent overows while evaluating the log-likelihood function. The double precision value "D = 10?200 was used as the lower bound in each state regardless of the form of the observation density. The restricted MLEs for the parameters of the observation densities were computed using the formulae of Section 3. The zero constraints for the transition probabilities can be considered more easily: the equality aii = 0 can be guaranteed during re-estimation by introducing the constraint to the starting values. This is due to the fact that any parameter in A (or ) set to zero initially will remain at zero throughout the re-estimation procedure. Table 2 presents some information on the estimation of the models 1 to 5. One can assume that the more complicated the model, the more dicult the estimation becomes. This seems to hold especially for the models 3 and 4, where several local maxima (or possibly saddle points) are identied, many iterations are required, and the frequency of nding the largest local 13

Model Model Fixed Deg. No. of Freq. of Logid. dim. points points iterat. best point likelihood 1 2 1 0 2 1000 6636.56 2 7 6 2 139 954 6819.18 3 13 30 7 591 245 6847.91 4 21 40 6 512 501 6863.67 5 21 18 1 253 790 6862.54 Table 2: Results of model estimation using Multistart optimization (n = 1000 starting points). The columns are the total number of parameters in the three-tuple = (A; ; ), the number of dierent xed points identied during the estimation ("I = 10?3), the number of times that the algorithm converged to a degenerate point ("D = 10?200), the average number of iterations required to satisfy the stopping criterion ("S = 10?6), the number of starting points that converged to the best xed point ^, and the log-likelihood log L^ of the best xed point. maximum is quite small. On the other hand, model 5 was somewhat easier to estimate. The three-state model 3 had the smallest probability of nding the candidate ^ of the global maximum when the estimation algorithm was started at random models, and it had also the largest number of iterations needed to converge the algorithm. Notice, however, that the time complexity of model estimation depends on the product of the model dimension and the number of iterations. This product is the largest for the four-state model 4. Moreover, likelihood surface of 4 seemed to be the most complex with respect to the number of xed points. The estimates for the mean and dispersion parameters for models 2 and 4 are given in Table 3. Note that we do not present detailed information for all of the models. Model 2 is in a sense the benchmark (two-state) model, and model 4 is our preferred model (see the next section). The estimated transition probabilities are shown in Figure 1. The gure also gives a visual description of the model topology. Solid arrows indicate potential transitions (i.e. those that are not restricted to have a probability equal to zero a priori), and dashed arrows indicate the transitions for which the probabilities were estimated to be equal to zero. Figure 2 plots the conditional observation densities for the models 2 and 4. As can be seen, the outlier states appear as clearly isolated from the rst two states in model 4. Figure 3 shows the data, the estimated probabilities of each state, and the estimated residuals, for models 2 and 4. From the top part of the gure we can see how the two-state model separates the series 14

Model State id Param. 1 2 3 4 2 î 0.00075 0.00044 î 0.0060 0.012 4 î 0.00078 0.00045 -0.070 0.043 î 0.0056 0.011 0.0044 0.0047 Table 3: Estimated parameters of the observation densities.

1

-

0:990

1

~ -

0:986

}

q

i

3

3

]

0:00213 0:00984

0:0156

2

0:0139

0:0146

2

~

0:981

0:984

+

1:0

0:00208

4

^ q

i

}

1:0

Figure 1: Model topology and the estimated transition probabilities for models 2 (left graph) and 4 (right graph). Solid edges indicate positive transition probabilities, dashed edges indicate transition probabilities estimated to zero, and no edge indicates a transition set to zero initially.

100 50 0 -0.075

-0.05

-0.025

0

0.025

0.05

-0.075

-0.05

-0.025

0

0.025

0.05

100 50 0

Figure 2: Estimated observation densities for models 2 and 4. 15

yt

0.03 0.00 -0.03 -0.06 0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

pHqt =s1 L

1.00 0.50 0.00

pHqt =s2 L

1.00 0.50 0.00

it

2

rHtL

1

3.00 0.00 -3.00 -6.00

Figure 3: Top panel: daily DJIA index series fytg2000 t=1 . Second panel: probability of market being in the rst state f 2 (t; 1)g2000 t=1 computed from the model 2 by equation (12). Third panel: probability for the second state. Fourth panel: individually most likely state sequence fit g2000 t=1 dened in (25). 2000 Fifth panel: standardized residual process fr2 (t)gt=1 computed from equation (24). Panels of the next page: the same series for the model 4 . into the two states. The process seems to spend most of the time up to about observation 1200 in the rst state, and after that point in the second state, although there are also brief switches to the other state. The residual series of this model contains large observations. For model 4 we can rst note that there are more switches between the rst two states in the rst part of our sample. In the last roughly 800 observations the process is mostly in the second state, apart from the isolated visits to the outlier states. As intended, the states three and four model the extreme observation, which are hence excluded from the residual series. We nd some common features in all of the estimated models. For in16

pHqt =s1 L

1.00 0.50

pHqt =s2 L

0.00 0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

1.00 0.50

pHqt =s3 L

0.00 1.00 0.50

pHqt =s4 L

0.00 1.00 0.50

it

0.00 4 3 2 1

rHtL

3.00 0.00 -3.00

stance, the diagonal elements of the transition matrices are very close to one, unless restricted to be equal to zero, as in the outlier states. This means that the process tends to switch from one state to another only infrequently. The means of the states are, apart from the outlier states, always close to zero. Also, the models are clearly able to discriminate between periods of normal volatility (small variance) and high volatility (larger variance). The transitions to the outlier states occur more frequently from the second state (the more volatile one). Similarly, from the outlier states the transition is either to the other outlier state or back to the more volatile of the two non-outlier states. In model 4, for example, the process only enters the outlier states from state two, and always leaves the negative outlier state to the positive outlier state, and similarly it always leaves the positive outlier state to state two (see Figure 1). 17

Model

AIC Skewness Kurtosis Normality ARCH 1 6634.56 -0.55 9.29 918.58 [0.00] 42.53 [0.00] 2 6812.18 -0.10 4.91 192.44 [0.00] 88.02 [0.00] 3 6834.91 -0.064 3.76 40.82 [0.00] 30.05 [0.00] 4 6842.67 -0.14 3.41 16.76 [0.00] 0.96 [0.44] 5 6841.54 -0.078 3.15 3.69 [0.16] 1.03 [0.40] Table 4: Model selection and residual diagnostics. The normality and ARCH test statistics have their p values in brackets. We nd that the outlier states in models 4 and 5 do indeed have estimated means away from the overall mean of the series. In addition, the estimated variances of the outlier states are relatively small usually smaller than the variance of the rst state. This can be interpreted as an indication of presence of a small number of observations, which clearly do not belong to the same process as the majority of the observations.

4.2 Model evaluation and hypothesis testing

We turn next to the questions of model selection and evaluation. Table 4 gives the values of the AIC criteria, sample skewness and kurtosis measures for the residuals, and results for the diagnostic tests on the residuals, for models 1 to 5. As can be seen, the normality and ARCH tests indicate serious deciencies in the rst three models, whereas the introduction of a fourth state into the model reduces the ARCH test statistics to clearly insignicant values. The normality test still rejects the null hypothesis of normality in the four-state model with normally distributed outlier states. On the other hand, for model 5 these diagnostics do not indicate any departures from i.i.d. normality. Based on these results, the rst four models (1 to 4) are not adequate, whereas 5 is. By considering the AIC values in (23), we note that among these models the largest value is found for model 4 (the four-state model with normally distributed outlier states) followed closely by model 5 (the four-state model with lognormal outlier states). As an aside, we can note that if the more stringent (with respect to additional parameters) BIC criteria were used, the three-state model 3 would have been the favored model. This is due to the fact that in a HMM the number of parameters increases considerably with each additional state (see Table 2). We will take model 4 as our favored model based on the AIC values, despite the fact that the normality test still rejects the null hypothesis for this model. 18

We also did some hypothesis testing on our preferred model 4. The rst hypothesis tests the equality of the absolute values of the means of the outlier states, that is, the null hypothesis is that 3 = ?4. The LR test statistic from equation (20) is 3.81, which is nearly, but not quite signicant at the 5% level (with a p value of 0.051 when r = 1 is used). We can not therefore convincingly reject this null hypothesis. The next test is that of equality of the variances of the two outlier states, or 32 = 42. For this test we get a test statistic of 0.0091 (p = 0:924; r = 1), and can not reject the hypothesis. Together these two tests indicate that the groups of positive and negative outliers are quite similar, with possibly somewhat dierent absolute values for their means. The two other tests test the means of the rst two states, with null hypotheses of 1 = 2 and 1 = 2 = 0. For these we get test statistics of 12.82 (p < 0:001; r = 1) and 18.94 (p < 0:0001; r = 2), respectively, and can clearly reject both hypotheses. Thus we nd that in this sample the returns are smaller in volatile times than in normal times.

4.3 Eight-state models

Finally, we experiment with two more models. There is, based on a visual inspection of the data and the results presented in the previous section, quite clearly a structural change in the series. At around observation t = 1200 the overall variance of the series increases noticeably. Therefore we consider also two models with eight states, where the transition matrix is restricted as follows. First, we hypothesize two distinct regimes, each with four states (normal times, volatile times and two outlier states), as in models 4 and 5. The transition between these two regimes can only occur once, from either of the rst two states in regime one to either of the rst two states in regime two. These eight-state models will be denoted models 6 and 7, where the outlier states have a normal and a lognormal distribution, respectively. Although not reported here, we also estimated our four-state models of section 4.1 for the two subsamples (of 1210 and 790 observations) separately. The corresponding parameter estimates were nearly identical to those obtained in our eight-state models. Turning now to our eight-state models, results for which can be found in Tables 5 and 6, and Figures 4, 5 and 6, we can make the following observations. In these models, the means and variances of the outlier states in the second regime coincide almost exactly with those obtained in the four-state models, apart from the variance of the positive outlier state. This is hardly surprising, considering that the large outliers during the second regime are the most dicult observations to model. Based on Table 6, we prefer the model with lognormal outlier states (model 7) of these two. The diagnostics 19

State Param. 1 2 3 4 5 6 7 8 î 0.00079 0.00021 0.00097 -0.0030 î 0.0050 0.0072 0.010 0.016 î -4.5 -5.8 -2.8 -3.4 î 0.36 0.67 0.073 0.15 Table 5: Estimated parameters of the observation densities in the eight-state model 7. 0:978

3

0:112

1 5 /

q

i

0:00828

]

1:0

3

0:00125

s

/

]

0:00141

+

0:0359

3

0:00298

2 6

^ q -

i

7

0:0405

0:861 0:0114

0:0121

q

i

0:0268

4

0:997

s W +

1:0

0:521 0:0395

i

7

7

0:949

0:479

8

^ q

0:920

Figure 4: Model topology and the estimated transition probabilities for model 7 (see the legend of Figure 1 for explanation of arrows). do not point to any misspecication for either of the models. Figure 4 gives the transition probabilities for model 7 . In both regimes, the process never moves straight from the high volatility state to the low volatility state. In other words, for the process to get from volatile times to normal times, one or two outliers has to occur rst. Similarly, in regime two outlier states are only entered from the high volatility state. From Figure 5 we can see that during the rst regime the process switches quite frequently between the dierent states, and several outliers occur. On the other hand, during the second regime the process stays mostly in the rst state, with few occasional switches to the second state and the outlier states. As to the 20

yt

0.03 0.00 -0.03 -0.06 0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

pHqt =s5 L

pHqt =s1 L

1.00 0.50 0.00

pHqt =s6 L

pHqt =s2 L

1.00 0.50 0.00

pHqt =s57 L

pHqt =s3 L

1.00 0.50 0.00

pHqt =s8 L

pHqt =s4 L

1.00 0.50

it

0.00

8 6 4 2

rHtL

3.00 0.00 -3.00

Figure 5: Evaluation time series for model 7 (see the legend of Fig. 3 for explanation of the seven panels). The solid lines indicate the rst regime (states 1-4) and dashed lines the second (5-8). timing of the structural change, in model 7 the expected time to move for the rst time to the second regime is 1218.9. This is quite close to our initial visual estimate. A lag of a few periods between the actual change and the timing estimated in the HMM is to be expected. 21

200 150 100 50 0 -0.075

-0.05

-0.025

0

0.025

0.05

Figure 6: Estimated observation densities for model 7 . The solid plots indicate the rst regime and the dashed plots the second regime. When compared with Fig. 2, note the dierent scale in y-axis. In Figure 6, the observation densities for model 7 indicate that the outlier states of the rst regime are closer to the majority of the observations than in the second regime. The outlier states are also more asymmetric in the rst regime. Compared to the two, three and four-state models discussed earlier, we nd that the eight-state models are preferred by the AIC criteria (see tables 4 and 6). Although their AIC values are again very close to each other, model 7 is the more favored model of these two, and also of all our models. This is rather remarkable, considering the large number of additional parameters in the eight-state models. It seems therefore that the structural change pointed out in the previous section is indeed important for this series, and it should be taken into account in the analysis. Note also, that the dimension of the eight-state models is 43, and, not surprisingly, they would not have been selected if the BIC criteria were used.

Model

AIC Skewness Kurtosis Normality ARCH 6 6869.65 -0.071 3.01 1.71 [0.43] 0.75 [0.58] 7 6870.14 -0.039 3.09 1.36 [0.51] 0.44 [0.82] Table 6: Model selection and residual diagnostics for the eight-state models. The normality and ARCH test statistics have their p values in brackets. 22

5

Conclusions

In this paper we have described, rst, an extension to HMMs, allowing the use of dierent probability distributions for the hidden states, and, second, restricted estimation of the model using the generalized Lagrange multipliers. This extension should be useful in several empirical applications. Our example used the extension to model outliers in stock market return data. We experimented with several alternative model specications, some of which result in adequate models for the data. Three states were not enough to remove ARCH eects from the data, nor to make the residuals white noise. Four-state models, where two of the states correspond to isolated outliers, did achieve this, depending on the selected distribution for the outlier states. This shows that the presence of only a few outliers in even a long series can cause serious problems for hidden Markov modeling. The presence of a structural change in our series was also convincingly demonstrated, since our eight-state models improve the t further. Summing up, our HMMs were capable of modeling two problematic features of the data, namely the outliers and the major regime change. Our initial modeling idea seems therefore appropriate for data of this kind. Finally, we note that Kim et al. (1998) use a simple three-state HMM for monthly stock market index returns, and nd that most of the nonnormality, and all of the ARCH eects disappear from their data once the series is standardized with the estimated variances of the three hidden states. It seems therefore that HMMs are capable of whitening stock market returns to a considerable extent, and for a wide range of data frequencies. This information could prove useful in further applications in empirical nance, in, for example, risk management.

References Akaike, H. (1974). A new look at the statistical model identication. IEEE Transactions on Automatic Control, AC19, 716723. Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41, 164171. Bickel, P. J., Ritov, Y., & Rydén, T. (1998). Asymptotic normality of the maximum-likelihood estimator for general hidden Markov models. The Annals of Statistics, 26, 16141635. 23

Campbell, J. Y., Lo, A. W., & MacKinlay, A. C. (1997). The econometrics of nancial markets. Princeton, New Jersey: Princeton University Press. Cox, D. R., & Hinkley, D. V. (1974). Theoretical statistics. New York: Chapman and Hall. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 138. Doornik, J. A., & Hansen, H. (1994). An omnibus test for univariate and multivariate normality (Working paper). Nueld College, Oxford. Engel, C., & Hamilton, J. D. (1990). Long swings in the Dollar: Are they in the data and do markets know it? The American Economic Review, 80, 689713. Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of United Kingdom ination. Econometrica, 50, 9871006. Hamilton, J. D. (1989). A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica, 57, 357384. Hamilton, J. D. (1994). Time series analysis. Princeton, New Jersey: Princeton University Press. Hamilton, J. D. (1996). Specication testing in Markov-switching time-series models. Journal of Econometrics, 70, 127157. Hansen, B. E. (1992). The likelihood ratio test under nonstandard conditions: testing the Markov switching model of GNP. Journal of Applied Econometrics, 7, S61S82. Hansen, B. E. (1996). Erratum: The likelihood ratio test under nonstandard conditions: testing the Markov switching model of GNP. Journal of Applied Econometrics, 11, 195198. Johnson, N. L., & Kotz, S. (1970). Continuous univariate distributions-1. Boston: Houghton Miin Company. Juang, B.-H., & Rabiner, L. R. (1985). A probabilistic distance measure for hidden Markov models. AT&T Technical Journal, 64, 391408. 24

Juang, B.-H., & Rabiner, L. R. (1990). The segmental K-means algorithm for estimating parameters of hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38, 16391641. Kim, C.-J., Nelson, C. R., & Startz, R. (1998). Testing for mean reversion in heteroskedastic data based on Gibbs-sampling-augmented randomization. Journal of Empirical Finance, 5, 131-154. Leroux, B. G. (1992). Maximum-likelihood estimation for hidden Markov models. Stochastic Processes and their Applications, 40, 127143. Leroux, B. G., & Puterman, M. L. (1992). Maximum-penalized-likelihood estimation for independent and Markov-dependent mixture models. Biometrics, 48, 545558. Michalek, S., & Timmer, J. (1999). Estimating rate constants in hidden Markov models by the EM algorithm. IEEE Transactions on Signal Processing, 47, 226228. Nádas, A. (1983). Hidden Markov chains, the forward-backward algorithm, and initial statistics. IEEE Transactions on Acoustics, Speech, and Signal Processing, 31, 504506. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257 286. Rabiner, L. R., Juang, B.-H., Levinson, S. E., & Sondhi, M. M. (1985). Some properties of continuous hidden Markov model representation. AT&T Technical Journal, 64, 12511270. Robert, C. P., Rydén, T., & Titterington, D. M. (2000). Bayesian inference in hidden Markov models through the reversible jump Markov chain Monte Carlo method. Journal of the Royal Statistical Society B, 62, 5775. Rydén, T. (1995). Estimating the order of hidden Markov models. Statistics, 26, 345354. Rydén, T., Teräsvirta, T., & Åsbrink, S. (1998). Stylized facts of daily return series and the hidden Markov model. Journal of Applied Econometrics, 13, 217244. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461464. 25

Self, S. G., & Liang, K.-Y. (1987). Asymptotic properties of maximum likelihood estimator and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association, 82, 605610. Tezuka, S., & L'Ecuyer, P. (1991). Ecient and portable combined Tausworthe random number generators. ACM Transactions on Modeling and Computer Simulation, 1, 99112. Timmer, J., & Klein, S. (1997). Testing the Markov condition in ion channel recordings. Physical Review E, 55, 33063311. Timmermann, A. (2000). Moments of Markov switching models. Journal of Econometrics, 96, 75111. Törn, A., & ilinkas, A. (1989). Global optimization. Berlin: Springer-Verlag. Wu, C. F. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics, 11, 95103. Zhang, J., & Stine, R. A. (2000). Autocovariance structure of Markov regime switching models and model selection. Journal of Time Series Analysis. (Forthcoming)

26

Turku Centre for Computer Science Lemminkäisenkatu 14 FIN-20520 Turku Finland http://www.tucs.abo.

University of Turku Department of Mathematical Sciences

Åbo Akademi University Department of Computer Science Institute for Advanced Management Systems Research

Turku School of Economics and Business Administration Institute of Information Systems Science

Computation of Restricted Maximum-penalized-likelihood Estimates in

Computation of Restricted Maximum-penalized-likelihood Estimates in

Suggest Documents

Global Estimates of Errors in Quantum Computation

computation of nonparametric concavity-restricted ... - CiteSeerX

Computation of Maximum Likelihood Estimates for Multiresponse ...

The empirical bias of estimates by restricted maximum ...

Computation of Ï-estimates for regression - UBC Department of Statistics

Parallel computation of kernel density estimates ... - Semantic Scholar

Computation of Minimum Cross Entropy Spectral Estimates: An ...

OUTER RESTRICTED DERIVATIONS OF NILPOTENT RESTRICTED

Mixed norm estimates for a restricted X-ray transform in R4 and R5 M

RESTRICTED USE NON-DISCLOSURE AGREEMENT This Restricted ...

Restricted Common Superstring and Restricted Common ...

• Design Estimates • Design Estimates • Design Estimates • Bid ...

In Support of Pragmatic Computation

Generalized Halfspaces in Restricted-Orientation

Revenue estimates Capital estimates

Accurate Accurate computation computation computation of drag for a ...

Reversible computation, quantum computation

restricted

Restricted Roots and Restricted Form of Weyl Dimension Formula for ...

Computation in Space and Space in Computation - Hal

LaMI Computation in Space & Space in Computation - CiteSeerX

Computation in Space and Space in Computation - Archive ouverte HAL

Estimates

Universality in Quantum Computation

Computation of Restricted Maximum-penalized-likelihood Estimates in