On Binary and Categorical Time Series Models with ...

1 downloads 0 Views 817KB Size Report
Motivated by the work of Russell and Engle (1998, 2005), who proposed a ... for the analysis of categorical data have been studied; see the books by Joe (1997) ...
On Binary and Categorical Time Series Models with Feedback Theodoros Moysiadis Konstantinos Fokianos

Department of Mathematics & Statistics, University of Cyprus e-mail:{moysiadis.theodoros, fokianos}@ucy.ac.cy

Submitted: July 2013 1st Revision: February 2014 2nd Revision: June 2014

Abstract We study the problem of ergodicity, stationarity and maximum likelihood estimation for multinomial logistic models that include a latent process. Our work includes various models that have been proposed for the analysis of binary and, more general, categorical time series. We give verifiable ergodicity and stationarity conditions for the analysis of such time series data. In addition, we study maximum likelihood estimation and prove that, under mild conditions, the estimator is asymptotically normally distributed. These results are applied to real and simulated data.

Keywords: autocorrelation; categorical data; hidden Markov models; latent process; logistic regression; multinomial regression; nominal data; prediction; weak dependence. AMS subject classifications: primary 62M10, secondary 62F12, 62J12.

1 Introduction Motivated by the work of Russell and Engle (1998, 2005), who proposed a new approach to model financial transactions data, we study the problem of ergodicity, stationarity and maximum likelihood estimation for the so called autoregressive conditional multinomial (ACM) model with the logistic link, and its generalizations. More specifically,

1

Russell and Engle (1998, 2005) develop a model for the joint distribution of discrete price changes and durations conditional on the observed history. This joint distribution of both processes is decomposed into the product of the conditional density of the mark and the marginal density of the arrival times, both conditioned on the past filtration of the joint information set. Modeling of the conditional density of the mark is accomplished by the ACM model of order (l1 ,l2 ), which is a general class, contained in the class introduced by model (8), when l1 = l2 = 1. The authors examine theoretical properties of the model. However, the issue of obtaining sufficient conditions for the transaction price process to be stationary and ergodic, has not been investigated thoroughly in the literature, to the best of our knowledge. We address the problem of stationarity and ergodicity for the general case of model (8); see Theorem 1. The goal of this contribution is to provide the necessary conditions, under which the MLE estimator is asymptotically normally distributed; see Lemma 1 and Theorem 2. One version of the ACM model is based on multinomial regression; a method with generalizes naturally the standard logistic regression. An alternative approach to the ACM class of models is based on the probit link function. Such autoregressive models have been considered by Zeger and Qaqish (1988), Rydberg and Shephard (2003), Kauppi and Saikkonen (2008), among others. The work by de Jong and Woutersen (2011) provides asymptotic results for the case of the dynamic probit model (4). Our approach is based on the theory of generalized linear models, see McCullagh and Nelder (1989), and in particular our focus is on the multinomial distribution. It is an elementary exercise to show that the multinomial distribution belongs to the multivariate exponential family and as such, the theory of generalized linear models can be applied for modeling various types of categorical data; nominal, interval and scale. In this contribution we will be working with nominal data and therefore the multinomial logistic model is the natural candidate for model fitting; see Kaufmann (1987), Fahrmeir and Tutz (2001), Kedem and Fokianos (2002, Ch. 3) and Fokianos and Kedem (2003) for further discussion on modeling issues regarding categorical data. We note that Markov chains provide a simple but important example of categorical time series where lagged values of the response are important in determining its future states. Markov modeling in the context of categorical time series however, can be problematic for two reasons. First, as the order of the Markov chain increases so does the number of free parameters; in fact, the number of free parameters increases exponentially fast. In addition, the assumption of Markov property requires the specification of the joint dynamics of the response and any possible covariates observed jointly; such a specification might not be possible, in general. We will be studying models for binary and, more generally, categorical time series, which are driven by a latent process or a feedback mechanism. This type of models is quite analogous to GARCH models -see Bollerslev (1986)but they are defined in terms of conditional log-odds instead of conditional variances. Such feedback models make 2

possible low dimensional parametrization, yet they can accommodate quite complicated data structures. Paradigms of feedback models, in the context of count time series for example, have been studied recently by Fokianos et al. (2009), Franke (2010), Fokianos and Tjøstheim (2011), Neumann (2011), Fokianos and Tjøstheim (2012) and Doukhan et al. (2012). In particular, we note that this contribution is closer to the modeling approach suggested by Fokianos and Tjøstheim (2011), because the main idea is essentially to use the canonical link process to model the observed data. Several other models for the analysis of categorical data have been studied; see the books by Joe (1997) and MacDonald and Zucchini (1997) and the recent articles by Biswas and Song (2009) and Weiß (2011). The outline of the paper is as follows. Section 2 puts forward the main model that we consider and discusses some of its properties. Section 3 develops results regarding its probabilistic properties. Section 4 discusses maximum likelihood estimation. In Section 5, we verify all the theoretical results via a simulation study for two special cases of model (8) that are of specific interest that is model (3) and model (7). Finally, in Section 6 we perform a real data analysis, where we give additional motivation for financial applications. An Appendix contains the proofs of all theoretical results.

2 Dynamic Modeling of Binary and Categorical Time Series We will be interested on a categorical time series, say {Y˜t , t = 1, . . . , N }, where N denotes the sample size. Let m be the number of possible categories. This means that for each t, the possible values of Y˜t are 1, 2, . . . , m − 1, m, corresponding to the first, second category and so on. In general, and especially for nominal data, the aforementioned assignment of integer values to the categories is rather arbitrary. It is usually made as a matter of convenience, and it should be clear that such an assignment is not unique. However, it is useful to note, that regardless of any assignment, the tth observation of a categorical time series, can be expressed by the vector Yt = (Y1t , Y2t , . . . , Yqt )T of length q = m − 1 with the following elements

Yjt

  1, if the jth category is observed at time t, =  0, otherwise,

for t = 1, 2, . . . , N and j = 1, 2, . . . , q. Consider an increasing sequence of σ-fields, say {Ft }t≥1 , which will be specified in detail later. Denote further by pt = (p1t , p2t , . . . , pqt )T the vector of conditional success probabilities given Ft−1 , that is pjt = P (Yjt = 1|Ft−1 ) = E(Yjt |Ft−1 ), t = 1, 2, . . . , N, j = 1, 2, . . . , q.

3

It is clear that the last category is recovered by the correspondence Ymt = 1 −

q ∑

Yjt

and

pmt = 1 −

q ∑

pjt .

j=1

j=1

In what follows d, A, B are generally unknown parameters. In fact, d is a real vector of dimension q and A, B are q × q real matrices. Even though these symbols will be employed for defining distinct models, their meaning will be clear from the context. Our focus is on developing and studying models for categorical time series, which include a feedback mechanism or an unobserved hidden process. For instance, one can consider the following linear model pt = d + Apt−1 + BYt−1 ,

t ∈ Z,

(1)

which can be viewed as a simple generalized linear model with identity link for categorical data. Such model was suggested by Russell and Engle (1998) and Qaqish (2003). However model (1) cannot be applied easily to data, since its structure implies complicated restrictions on the parameters d, A and B. This is so, because each element of the vector of probabilities pt should lie between zero and one. In fact, model (1) imposes more complicated restrictions on d, A, B, when covariates are under consideration. An alternative model, which falls within the framework of generalized linear models, is the one that corresponds to the canonical link function of a multinomial random variable; see Kedem and Fokianos (2002, Ch.3). It is given by λt = d + Aλt−1 + BYt−1 , where λt =

t ∈ Z,

(2)

( )T p1t p2t pqt log , log , . . . , log . pmt pmt pmt

Note that λt is a function defined on {0, 1}q × Rq , that takes values in Rq . Comparing (1) with (2) we observe that there is no need for complicated restrictions on the parameter space. Model (2) allows for the inclusion of covariates in a straightforward manner, without any restriction. This is so, because of the properties of the logistic link. A simple example of (2), is given in the case of a binary time series; that is when the number of categories m is equal to two. In this case, we obtain a type of an autologistic model with a feedback. In the case of a binary time series, set pt = P (Yt = 1|Ft−1 ) and let λt = log

pt . 1 − pt

Then (2) becomes λt = d + a1 λt−1 + b1 Yt−1 , 4

t ∈ Z,

(3)

for some unknown parameters d, a1 , b1 . By recalling that the logistic link corresponds to the canonical link of the Bernoulli distribution, we note that model (3) is similar to the log-linear model proposed by Fokianos and Tjøstheim (2011) for the analysis of count time series. Note that the parameters d, a1 , b1 appearing in (3), generally belong to the set of real numbers; further constraints will be given below to ensure ergodicity and stationarity of the joint process (Yt , λt ). In addition, covariates can be easily included in the right hand side of equation (3) because the log–odds process, λt , can take values in R; see also Kedem and Fokianos (2002, Ch.2) and Tjøstheim (2012, p.471). An alternative modeling approach is based on extensions of the static probit model, which is defined through the specification pt = Φ(πt ),

πt = d + β T xt−1 ,

where Φ(·) denotes the c.d.f. of a standard normal random variable, d is a constant term, xt−1 contains explanatory variables and β is the vector consisting of other parameters. The dynamic model -which is an immediate extensionis obtained by adding a lagged term of the response process. It has been considered by Cox (1981), and Zeger and Qaqish (1988), among others, and given by πt = d + δ1 Yt−1 + β T xt−1 .

(4)

Kauppi and Saikkonen (2008) consider the model πt = d + a1 πt−1 + δ1 Yt−1 + β T xt−1 ,

(5)

and assume that |a1 | < 1. They refer to this model as a dynamic autoregressive model. Nyberg (2010) studied a special case of (5) for δ1 = 0, that is πt = d + a1 πt−1 + β T xt−1 . In addition, Nyberg (2010, Ch.2,p.43) considers an extended version of the following form πt = d + a1 πt−1 + β T xt−1 + γ T zt−1 Yt−1 , where the interaction term γ T zt−1 Yt−1 is added and zt−1 includes predictors, that may be different from those included in xt−1 . Rydberg and Shephard (2003) introduced the model πt = d + β T xt−1 + gt , where gt = a1 gt−1 + δ1 Yt−1 . Kauppi (2008) proposed the following alternative πt = d + δ1 Yt−1 + gt , where gt = a1 gt−1 + β T xt−1 . 5

Suppose we include no covariates in (5), that is β = 0. Then we have πt = d + a1 πt−1 + δ1 Yt−1 .

(6)

Models (3) and (6) are similar in the sense that they are defined by means of a latent process (λt and πt respectively), which is a transformation of pt . Model (6) is defined by means of the probit transformation, whereas model (3) can be expressed as pt = FL (λt ),

λt = d + a1 λt−1 + b1 Yt−1 ,

where FL (·) denotes the standard logistic c.d.f. In de Jong and Woutersen (2011) it is shown that, under some regularity conditions, the standard asymptotic theory of ML estimation applies to model (4). However, a formal proof concerning the asymptotic distribution of MLE for the autoregressive model (5) still remains to be investigated; see also Nyberg (2010, p.43). This contribution fills this gap for a non-linear multinomial logistic model. Our results can be extended in the case of probit regression, without any major obstacles. Example 1. An example of model (2) with m = 3 categories is given by the following           λ1t d1 a11 a12 λ1(t−1) b11 b12 Y1(t−1) = +  +  , λt =  λ2t d2 a21 a22 λ2(t−1) b21 b22 Y2(t−1)

(7)

with λ1t = log

p1t , p3t

λ2t = log

p2t , p3t

where di , aij , bij , i, j = 1, 2, are all real valued parameters, which satisfy conditions to be determined in the sequel. Figure 1(a) displays a simulated time series from this model with length equal to N = 150. The true value of the parameters, displayed below, satisfy the condition given later on in Example 2      0.2 0.2 0.15 0.15 , A =   , and B =  d= −0.3 0.1 0.35 0.30

 0.25

.

0.20

The generating process starts with arbitrary values for Y1 , λ1 , p1 in order to get λ2 , p2 and afterwards Y2 is generated through p2 and so on. In Figure 1 (b), (c), (d) the respective transition probabilities of the three categories are displayed. Model (2) implies that there exists a feedback mechanism which determines the evolution of the observed process. Hence, we expect that such model will be more parsimonious than a model which does not include λt−1 . Furthermore, by recursion, we can write λt =

( t−1 ∑

) k

A

d + At λ0 +

k=0

t ∑ k=1

6

At−k BYk−1 ,

(b)

0

50

100

0.42 0.45

p1t

2.0 1.0

Yt

3.0

(a)

150

0

50

150

100

150

0.28

p3t

0.27

0.34

(d)

0.23

p2t

(c)

100

0

50

100

150

0

50

Figure 1: Simulated realization of a 3-state time series generated through model (7) and of the corresponding transition probabilities; see Example 1 for parameter specification. (a) Yt , (b) p1t , (c) p2t , (d) p3t . where A0 = Iq and Iq is the q × q identity matrix. Hence, λt is determined by the values of the observed process Yt and the initial value λ0 and therefore model (2) is an observation driven model; see Cox (1981). In addition, model (2) takes into account strong correlation through the feedback mechanism. This is a similar phenomenon for count time series models, see Fokianos (2012). Recall that q = m − 1. Model (2) has in general (m − 1) (2m − 1) unrestricted parameters, since d contains q parameters and both matrices A, B contain q 2 unrestricted parameters. However, imposing restrictions on A and B will, in general, reduce the number of parameters. A special case is when A and B in (2) are diagonal. Then the ∑q processes λit = log(pit /(1 − i=1 pit )), i = 1, . . . , q, depend on their own past and the respective past values of Yt . Similar interpretations can be given for other parametrizations.

7

3

Main Results

In this section we study conditions for ergodicity and stationarity of a slightly more general version of model (2). The main tool is the notion of weak dependence, which was introduced by Doukhan and Louhichi (1999); see also Dedecker et al. (2007). If {ξt } is an i.i.d. sequence, which satisfies some conditions, then Doukhan and Wintenberger (2008) prove the existence of a weakly dependent strictly stationary solution of Xt = F (Xt−1 , Xt−2 , Xt−3 , . . . ; ξt ) that is a chain with infinite memory. Doukhan et al. (2012) assume a contraction type condition (see also Fokianos et al. (2009), Neumann (2011), Fokianos and Tjøstheim (2012)) to prove stationarity, ergodicity and infer regarding the asymptotic theory of a Poisson based model for the analysis of count time series. We shall consider the general case of model (2). Recall that the vector of responses is denoted by Yt = (Y1t , . . . , Yqt )T . Let Ft denote the σ-field generated by Ys , for s ≤ t and λ0 , which denotes some initial value. We assume that Yt , given the σ-field Ft−1 , is multinomially distributed with probability vector pt . Then model (2) can be further generalized as follows:    λ1t f1 (Yt−1 , λt−1 )        λ2t   f2 (Yt−1 , λt−1 )   λt =  ..  ..  =   .   .    λqt

     = f (Yt−1 , λt−1 ),   

t ∈ Z,

(8)

fq (Yt−1 , λt−1 )

where fi , i = 1, . . . , q, are functions defined on {0, 1}q × Rq , that take values in Rq . Model (8) allows for a variety of non-linear models for the analysis of binary and categorical time series. The non-linear specification, introduced by the components of the function f, is made possible by the fact that the logistic function assumes values in R. Note that (8) reduces to model (2) when f (·) is a linear function. In the following results we have shown that in the general case of model (8) there exists a unique solution, which is stationary and ergodic and has finite moments of any order. To state the theorem, we recall that the L1 norm of an ∑m m × n matrix M = (mij ) is given as ∥M ∥ = max1≤j≤n i=1 |mij |. In addition, we denote by the same symbol the L1 norm of a vector and their meaning will be clear upon the context. Theorem 1. Consider model (8) and assume that for any X = (Y, λ) , X ′ = (Y ′ , λ′ ) in {0, 1}q × Rq , there exist q × q real matrices M, N , such that ∥f (Y, λ) − f (Y ′ , λ′ )∥ ≤ ∥M ∥ · ∥λ − λ′ ∥ + ∥N ∥ · ∥Y − Y ′ ∥.

(9)

Suppose further that [1 + 3(q − 1)]∥N ∥/4 + ∥M ∥ < 1. Then there exists a causal solution of (8), {(Yt , λt ), t ∈ Z}, which is stationary, ergodic and satisfies E||(Y0 , λ0 )||s < ∞, ∀s ∈ N. 8

We give below some examples to bring across the essence of the above theorem. The first example refers to the case of the multinomial logistic model (2), while the second example discusses the linear model (1). Example 2. Recall that in the case of model (2), the function f (·) appearing in (8) equals to the linear function. In this case, we have that ′ ′ ∥f (Yt−1 , λt−1 ) − f (Yt−1 , λ′t−1 )∥ ≤ ∥A∥ · ∥λt−1 − λ′t−1 ∥ + ∥B∥ · ∥Yt−1 − Yt−1 ∥.

Hence, we need to assume [1 + 3(q − 1)]∥B∥/4 + ∥A∥ < 1, to obtain a stationary and ergodic solution of (2). In particular, when m = 3, the required condition becomes ∥A∥ + ∥B∥ < 1. Example 3. We note that the proof of Theorem 1, accommodates the case of model (1). Recall that pt = f (Yt−1 , pt−1 ) with pt given by (1) and therefore we have that ′ ′ ∥f (Yt−1 , pt−1 ) − f (Yt−1 , p′t−1 )∥ ≤ ∥A∥ · ∥pt−1 − p′t−1 ∥ + ∥B∥ · ∥Yt−1 − Yt−1 ∥.

It turns out that it is sufficient to assume that [1 + 2(q − 1)]∥B∥ + ∥A∥ < 1, in addition to any constraints implied by model (1), which force {pt } to stay between zero and one. This short discussion shows that the application of model (1) is quite challenging for any real application. Hence, it will not be considered any further. We give below conditions for the special case of a binary time series. Corollary 1. Consider model (8) for q = 1 and assume there exist β1 , β2 > 0, such that |f (Y, λ) − f (Y ′ , λ′ )| ≤ β1 · |λ − λ′ | + β2 · |Y − Y ′ |. Suppose that a = β1 + β2 < 1. Then there exists a causal solution {(Yt , λt ), t ∈ Z}, which is stationary, ergodic and satisfies E||(Y0 , λ0 )||s < ∞, ∀s ∈ N. Remark 1. For the specific case of model (3), it is enough to assume 4|a1 | + |b1 | < 4 and the result of Theorem 1 is true. We note that in this case, the values of a1 and b1 are allowed to belong to a larger set of values than the typical stationarity region of ordinary ARMA models. Remark 2. For the specific case of model (6), it is enough to assume

√ √ 2π|a1 |+|δ1 | < 2π and the result of Theorem

1 is also true. Note that in this case, δ1 is restricted to a smaller set as compared to the set obtained for model (3). Figure 2 displays the range of the parameters a1 and b1 (δ1 ) to ensure stationarity for models (3) and (6) respectively. Equipped with the aforementioned results, we now turn our attention to the problem of maximum likelihood estimation. The following section gives sufficient conditions for the MLE to be consistent and asymptotically normally distributed. 9

Figure 2: Stationarity region for models (3) (solid line) and (6) (dashed line). The horizontal axis corresponds to the range of the values of a1 and the vertical axis corresponds to the range of the values of b1 (δ1 ).

4

Inference

Recall model (8) and assume that the function f (·) depends on some finite dimensional parameter θ, with s = dim(θ). For instance when model (2) holds, then θ = (dT , vec(A)T , vec(B)T )T denotes the vector of unknown parameters of model (2). Recall that the vectorization of an m × n matrix A, which is denoted as vec(A), is defined as the mn × 1 column vector obtained by stacking the columns of the matrix A on top of one another, that is vec(A) = (a11 , a21 , . . . , am1 , a12 , . . . , a1n , . . . , amn )T . In what follows, we discuss the estimation of θ, based on the conditional likelihood function for model (8). The likelihood function is given by

LN (θ) =

N ∏

P (Yt = yt |Ft−1 ) =

t=1

N ∏ m ∏

P (Yjt = 1|Ft−1 )Yjt =

t=1 j=1

N ∏ m ∏

Y

pjtjt (θ).

(10)

t=1 j=1

The conditional log-likelihood is equal to lN (θ) =

N ∑ t=1

lt (θ) =

N ∑ m ∑

Yjt log pjt (θ).

(11)

t=1 j=1

ˆ we maximize (10). We denote by θ ˆ = In order to compute the conditional maximum likelihood estimator (MLE) θ arg max(LN (θ)), where Θ ⊆ Rs denotes the parameter space. This in turn implies that if there exists a solution to θ∈Θ

10

the system of equations ∇lN (θ) = 0, this would be a local conditional MLE. The score function is equal to SN (θ) =

N ∑

N ∑ ∂lt (θ)

∇lt (θ) =

t=1

∂θ

t=1

=

N ∑ ∂λt (θ) t=1

∂θ

(Yt − pt (θ)).

(12)

In particular, for model (2) we obtain that ) ( ∂λt (θ) ∂λt (θ) ∂λt (θ) ∂λt (θ) , , = ∂dT ∂vec(A)T ∂vec(B)T ∂θ T ( ) ∂λt−1 (θ) ∂λt−1 (θ) ∂λt−1 (θ) T T = Iq + A , λ (θ) ⊗ I + A , λ (θ) ⊗ I + , q q t−1 t−1 ∂dT ∂vec(A)T ∂vec(B)T where Iq denotes the identity matrix of order q and ⊗ denotes the Kronecker’s product. The conditional information matrix is given by ] ∑ N ∂λt (θ) ∂λt (θ) ∂λt (θ) GN (θ) = Cov (Yt − pt (θ)) Ft−1 = Σt (θ) , ∂θ ∂θ ∂θ T t=1 t=1 N ∑

[

where Σt (θ) is the conditional covariance matrix of Yt with generic elements   p (θ)(1 − p (θ)), if i = j, it it (ij) σt (θ) =  −p (θ)p (θ), if i ̸= j, it

jt

for i, j = 1, 2, . . . .q. We also define the matrix [ G(θ)

=

E

] ∂λt (θ) ∂λt (θ) Σt (θ) , ∂θ ∂θ T

(13)

where expectation is taken in respect to the stationary distribution. By further differentiation of the score equations we obtain the Hessian matrix as follows HN (θ)

=



N ∑ ∂ 2 lt (θ) t=1

=

∂θ∂θ T

N [ ∑ ∂λt (θ)

Σt (θ)

∂λt (θ) ∂θ T

]

∂θ ] q [ 2 N ∑ ∑ ∂ λrt (θ) (Yrt − prt (θ)) − ∂θ∂θ T t=1 r=1 t=1

= GN (θ) − RN (θ).

(14)

For model (2), we have the simplification HN (θ) = GN (θ). The following lemmas contain the main results used for ˆ But first we state the main assumptions, used to prove Lemma 1, Lemma A-2 and proving asymptotic normality of θ. Theorem 2. 11

Assumption 1

The parameter θ belongs to a compact set Θ and the true value, θ 0 , belongs to the interior of Θ.

Assumption 2

The components of ∂fr /∂θ, r = 1, . . . , q, are assumed to be linearly independent.

Assumption 3

The functions fr (·), r = 1, . . . , q are four times differentiable with respect to θ and λ. In addition if



T

T T

x = (θ , λ ) = (θ1 , . . . , θs , λ1 , . . . , λq )T , then ∂fr (Y, λ; θ) ∂fr (Y ′ , λ′ ; θ) ≤ br1i ||Y − Y ′ || + br2i ||λ − λ′ ||, i = 1, . . . , s + q, − ∂x∗i ∂x∗i ∂ 2 f (Y, λ; θ) ∂ 2 f (Y ′ , λ′ ; θ) r r − ≤ br1ij ||Y − Y ′ || + br2ij ||λ − λ′ ||, i, j = 1, . . . , s + q, ∂x∗i ∂x∗j ∂x∗i ∂x∗j ∂ 3 f (Y, λ; θ) ∂ 3 f (Y ′ , λ′ ; θ) r r − ≤ br1ijk ||Y − Y ′ || + br2ijk ||λ − λ′ ||, i, j, k = 1, . . . , s + q. ∂x∗i ∂x∗j ∂x∗k ∂x∗i ∂x∗j ∂x∗k We further assume that ∀i, j, k ∈ {1, . . . , s + q} the following hold ∑s+q ∑s+q ∑s+q (br1i + br2i ) < ∞, i,j,k (br1ijk + br2ijk ) < ∞, i,j (br1ij + br2ij ) < ∞, i E|∂fr (0, 0; θ)/∂x∗i | < ∞,

E|∂ 2 fr (0, 0; θ)/∂x∗i ∂x∗j | < ∞,

E|∂ 3 fr (0, 0; θ)/∂x∗i ∂x∗j ∂x∗k | < ∞.

Assumptions 1-3 are standard regularity conditions for proving asymptotic normality of the conditional MLE. Assumption 1 rules out the possibility of the true parameter to belong to the boundary of the parameter space. Assumption 2 states that the elements of the matrix ∂f /∂θ are linearly independent. Therefore the conditional information matrix is positive definite and its inverse would exist. Assumption 3 implies that the function f is sufficiently smooth, so that higher order derivatives of the conditional log-likelihood function exist and are finite. Lemma 1. Under the Assumptions of Theorem 1 and Assumptions 1-3, we have the following results, as N → ∞. (i) The score function defined in (12) satisfies 1 D √ SN (θ 0 ) −→ N (0, G(θ 0 )), N where G(θ) is a positive definite matrix, defined in (13). (ii) The Hessian matrix defined in (14) satisfies 1 p HN (θ 0 ) −→ G(θ 0 ). N { √ } (iii) Within the neighborhood of the true value, ON (θ 0 ) = θ : ||θ − θ 0 || ≤ r/ N , r > 0, N 1 ∑ ∂ 3 lt (θ) max sup ≤ Kn , i,j,k θ∈O(θ 0 ) N ∂θi ∂θj ∂θk t=1

12

p

such that Kn −→ K, where K is a constant. Lemma A-2 in the Appendix shows that the initial values used for likelihood estimation are irrelevant when con˜ 0 , different starting values for model (8), then the sidering asymptotics. In particular, we show that given λ0 and λ difference between the corresponding log-likelihood functions tends to zero. Theorem 2. For model (8) and under the Assumptions of Theorem 1 and Assumptions 1-3, there exists an open neighborhood of the true value θ 0 , namely O = ON (θ 0 ), such that a locally unique conditional MLE exists with ˆ which is probability converging to one, as N → ∞. Furthermore, there exists a sequence of conditional MLEs, θ, consistent and asymptotically normal √ D ˆ − θ 0 ) −→ N (θ N (0, G−1 (θ 0 )), where G is defined in (13). Global uniqueness of the MLE cannot be guaranteed, in general, unless further conditions on f are imposed. We conjecture that for model (2), there will exist a unique MLE under some conditions (see Albert and Anderson (1984) and Kaufmann (1989), for instance). This issue, however, deserves further investigation. Remark 3. It is obvious that Theorem 2 holds true for model (3).

5 Simulations In what follows, we present a limited simulation study. All computations have been implemented in R (R Core Team (2013)). The results are based on 1000 simulations for both cases. We discard the first 500 observations after generating data, to ensure that the stationarity region has been reached.

5.1

Binary Case

For the binary time series model (3), the data are generated using as initial value p0 = 0.5, which gives λ0 = 0. For the process of derivatives we set ∂λ0 (θ)/∂θ = (1, 1, 1)T . Maximum likelihood estimators are calculated by maximizing the log-likelihood function given in (11) for m = 2. To obtain initial values for the parameter vector, we employ the function glm built in R. Recall that parameters a1 and b1 have to satisfy the condition 4|a1 | + |b1 | < 4, (see Remark 1). Hence, we apply constrained optimization by employing the function constrOptim of R. The results of the simulation are displayed in Table 1. The estimators are in general consistent. Moreover, the existence of small 13

deviations from the true value, when the sample size is small (N = 500), especially for the parameter a1 , is alleviated when the sample size increases to N = 1000. The estimation is more accurate when the sample size increases further to N = 2000. Further increase of the sample size to N = 2500 does not provide significant improvement. The reason for requiring rather large sample sizes to improve upon estimation is the form of the data which takes only two values; zero or one. To save space, we do not report histograms and QQ-plots of the estimators but we point out that the asserted normality is achieved quite satisfactorily (results are available from the authors). Sample size N =500

N =1000

N =2000

N =2500



0.445

(0.253)

0.483

(0.196)

0.492

(0.135)

0.493

(0.121)

a ˆ1

-0.411

(0.340)

-0.472

(0.236)

-0.492

(0.148)

-0.491

(0.146)

ˆb1

0.519

(0.201)

0.506

(0.136)

0.506

(0.098)

0.503

(0.085)



0.515

(0.290)

0.495

(0.205)

0.493

(0.136)

0.503

(0.126)

a ˆ1

-0.313

(0.225)

-0.295

(0.159)

-0.300

(0.109)

-0.301

(0.096)

ˆb1

0.998

(0.220)

0.997

(0.156)

1.004

(0.109)

1.000

(0.099)



-0.491

(0.188)

-0.495

(0.132)

-0.499

(0.094)

-0.498

(0.082)

a ˆ1

-0.295

(0.091)

-0.295

(0.067)

-0.298

(0.046)

-0.299

(0.040)

ˆb1

-2.047

(0.317)

-2.019

(0.221)

-2.003

(0.153)

-2.016

(0.140)

Table 1: Maximum likelihood estimators and their standard errors (in parentheses) for model (3). True parameters are (d, a1 , b1 ) = (0.5, −0.5, 0.5) (upper panel), (d, a1 , b1 ) = (0.5, −0.3, 1) (middle panel) and (d, a1 , b1 ) = (−0.5, −0.3, −2) (lower panel) for different sample sizes. Results are based on 1000 runs.

5.2

Categorical Case

For the multinomial time series model with m = 3 categories, the data are generated through equation (7) with initial value λ0 = (0, 0)T . The starting value for the derivative process is a 10 × 2 matrix of zero’s. In order to maximize the log-likelihood function (11) for m = 3, we use constrained optimization. The starting vector for the parameter values, which initiates the optimization algorithm, is taken by a simple linear modeling with response λt and predictors λt−1 and Yt−1 . To obtain λt , we apply glm and optim in each row of the 2 × N data matrix, calculate λ1t and λ2t iteratively through model (3) (based on the parameter estimators obtained by glm and optim) and combine them into

14

λt . The condition that must be satisfied by the parameters in this case is ||A|| + ||B|| < 1 (see Example 2). The results are displayed in Table 2. We observe that the estimators are consistent. In addition, the estimation improves when the sample size increases. Figure 3, which demonstrates the QQ-plots and the histograms of the estimators, confirms the claim that the estimators are normally distributed, with the exception of small deviations in some cases. Sample size N =1000

N =1500

N =2000

dˆ1

-0.474

(0.197)

-0.485

(0.174)

-0.481

(0.155)

dˆ2

0.356

(0.178)

0.360

(0.154)

0.360

(0.144)

a ˆ11

-0.183

(0.310)

-0.203

(0.286)

-0.201

(0.285)

a ˆ21

0.006

(0.295)

0.009

(0.271)

0.008

(0.262)

a ˆ12

0.009

(0.300)

0.024

(0.288)

0.042

(0.272)

a ˆ22

-0.165

(0.296)

-0.183

(0.270)

-0.180

(0.275)

ˆb11

-0.236

(0.188)

-0.253

(0.173)

-0.265

(0.161)

ˆb21

0.093

(0.170)

0.100

(0.142)

0.100

(0.135)

ˆb12

0.097

(0.162)

0.097

(0.146)

0.098

(0.126)

ˆb22

-0.250

(0.151)

-0.258

(0.127)

-0.255

(0.116)

Table 2: Maximum likelihood estimators and their standard errors (in parentheses) for model (7). True parameters are (d1 , d2 , a11 , a21 , a12 , a22 , b11 , b21 , b12 , b22 )T = (−0.5, 0.4, −0.23, 0.1, 0.08, −0.2, −0.3, 0.1, 0.1, −0.25)T , for different sample sizes. Results are based on 1000 runs.

6

Data Analysis

A multinomial logit model may be applied in many cases where strong correlation is present. Geophysical or environmental time series are frequently autocorrelated because of the usual presence of inertia in the basic characteristics of a physical system. The same phenomenon is typical for financial data. We will see how the proposed methodology applies to real data.

15

d2 = 0.4

−3

−1

1

2

3

3 1 0

Sample Quantiles

−2 −3

−1 0

1

2

3

−3

0.0

−3

0.00

0.05

−2

0.1

−1

0.2

1 0

Sample Quantiles

0.20 0.15 0.10

−1

0.25

0.3

2

3 2

0.30

d2 = 0.4

0.4

d1 = −0.5

0.35

d1 = −0.5

−3

−1

1

2

3

−3

Theoretical Quantiles

a21=0.1

a12=0.08

3

0.3 0.2

0.3

0.3

1

2

−2

0

1

2

3

0

1

2

3

0

1

2

3

2

3

3 Sample Quantiles

2

3 1 0

−3

−2

Sample Quantiles −1

−1

a22=−0.2

−2 −3

−3

−1

0

1

2

3

−3

−1

0

1

Theoretical Quantiles

Theoretical Quantiles

b11=−0.3

b21=0.1

b12=0.1

b22=−0.25

1

2

3

2

3

0.4 0.2 0.1 1

2

3

−2

−1

0

1

2

3

2

4

3 Sample Quantiles

2

3 1

−3

−2

Sample Quantiles

−2 −3 −3

0

b22=−0.25

2

3 1 0

Sample Quantiles

−1 −2 −3

1

Theoretical Quantiles

−1 0

b12=0.1

2

3 2 1 0 −1 −2

0

0.0 −3

1

−1 0

b21=0.1

−3

−1

0.3

0.3 0.2 0.1 −3

0

3

−1

2

0

1

−1

−1 0

0.0

0.0

0.1

0.1

0.2

0.2

0.3

0.3

0.4

0.4

0.4

Theoretical Quantiles

b11=−0.3

−3

−3

Theoretical Quantiles

0.0 −3

2

−3

−2 −3 −1

1

2

3 2 1 0

Sample Quantiles

1 0 −1 −2 −3 −3

0

a12=0.08

−1

2

3

a21=0.1

−1

1

0

0.0

0.1 0.0 −3 −2 −1

0

3

−1

2

−1

1

0.1

0.2

0.2 0.1 0.0 −1 0

a11=−0.23

Sample Quantiles

2

a22=−0.2

0.3 0.2 0.1 0.0 −3

Sample Quantiles

1

0.4

0.4

0.4

a11=−0.23

−1 0

Theoretical Quantiles

−3

Theoretical Quantiles

−1

0

1

2

Theoretical Quantiles

3

−3

−1

0

1

2

3

Theoretical Quantiles

Figure 3: QQ-plots and histograms of the standardized MLE obtained by model (7) with the true parameter vector θ = (d1 , d2 , a11 , a21 , a12 , a22 , b11 , b21 , b12 , b22 )T = (−0.5, 0.4, −0.23, 0.1, 0.08, −0.2, −0.3, 0.1, 0.1, −0.25)T and for sample size equal to 1500. Results are based on 1000 runs.

16

6.1

A Binary time series example

Some related work on binary time series in the context of financial applications is reviewed first. Christoffersen and Diebold (2006), and Christoffersen et al. (2007), study the dependence in asset return signs (of the change in price) and volatility. They show that the predictability of stock return volatility implies predictability of the sign of the returns. Even more, it is possible to have sign dependence without conditional mean dependence. It is also stated in Breen et al. (1989), among others, that only the direction of excess stock returns is predictable. Nyberg (2010) states that in spite of the potentiality to predict the sign of stock returns, in very few studies binary time series models have been considered for this purpose. Instead, the binary nature of the sign is incorporated into other kind of models. Leung et al. (2000) compare the prediction power of continuous dependent models and discrete models (including probit models) and find that discrete models outperform standard continuous models regarding the number of correct sign forecasts and investment returns. Nyberg (2010) applies different probit specifications to predict the sign of U.S. stock returns and finds that their in-sample predictive power is statistically significant for the signs of excess stock returns. Finally Butler and Malaikah (1992) perform runs tests to compare the number of runs of consecutive returns of either positive or negative sign or zero to the expected number of runs, assuming serial independence. Motivated by the above discussion, we consider the binary model (3) and apply the proposed modeling approach to the logarithmic returns of the weekly closing prices of the stock ”Johnson & Johnson” for the period from 2/1/1970 to 28/5/2013. These data have been obtained from http://finance.yahoo.com. We consider the binary time series Yt , which equals to 1 when the logarithmic return at time t is positive (equivalently the price change is positive) and 0 otherwise (equivalently the price change is negative). The size of this time series is 2264. Based on those data, the estimators for model (3) are d = 0.204 (0.078), a1 = −0.624 (0.379), b1 = −0.098 (0.076), with the standard errors given inside the parentheses. We compare model (3) with two other classes of models. The first one is the class of the logit models of the form λt = d +

l ∑

bi Yt−i ,

(15)

i=1

where d, bi ∈ R and l represents the lag of the respective model. The second class consists of the Bernoulli-Hidden Markov models (HMMs) of order k (k = 2, 3, 4); see Zucchini and MacDonald (2009, Ch.10), for more.

17

Models model (3)

l=1

l=2

l=3

l=4

B-HMM2

B-HMM3

B-HMM4

dim(θ)

3

2

3

4

5

4

9

16

−lN (θ)

1565.345

1565.726

1564.831

1563.727

1561.651

1565.597

1562.181

1558.164

AIC

3136.690

3135.453

3135.661

3135.454

3133.302

3139.194

3142.362

3148.328

AICc

3136.701

3135.458

3135.672

3135.472

3133.328

3139.314

3142.911

3150.023

BIC

3153.865

3146.902

3152.836

3158.354

3161.926

3154.486

3176.769

3209.497

Table 3: The number of the parameters dim(θ) and the values of the negative log-likelihood (−lN (θ)), AIC, AICc and BIC criterions are displayed respectively for model (3), models (15) for l = 1, 2, 3, 4 and the Bernoulli-HMMs of order k = 2, 3, 4 for the binary time series derived from the logarithmic returns of the weekly closing prices of the stock ”Johnson & Johnson” for the period from 2/1/1970 to 28/5/2013.

Table 3 reports the results of the data analysis. Note that, we compare all models in terms of the negative loglikelihood function, the Akaike information criterion (AIC), the corrected Akaike information criterion (AICc) and the Bayesian information criterion (BIC). The results show that there are no significant differences among the various models. This fact is due to the sample interval of the data and has been noted by Christoffersen and Diebold (2006), who state that sign dependence is not likely to be found in very high-frequency (e.g. daily) or very low-frequency (e.g. annual) returns; instead, it is more likely to be found at intermediate return horizons. Following this line of argument, we consider subsets of the initial dataset, representing monthly, bimonthly and trimonthly closing prices of the stock ”Johnson & Johnson” for the same period. The performance of model (3) improves considerably and it outperforms most of other models in the case of intermediate time horizons. In the case of trimonthly closing prices, the estimators for model (3) are d = 0.152 (0.137), a1 = 0.913 (0.116), b1 = −0.185 (0.169), with the standard errors given inside the parentheses. The results regarding the trimonthly case are shown in Table 4 (N = 187). In addition, Table 4 shows that there is little difference between a model which includes feedback and a model which does not include a feedback. This fact should be expected, since the number of observations is rather small.

18

Models model (3)

l=1

l=2

l=3

l=4

B-HMM2

B-HMM3

B-HMM4

dim(θ)

3

2

3

4

5

4

9

16

−lN (θ)

122.335

123.374

122.915

122.293

121.439

124.332

122.720

120.302

AIC

250.670

250.747

251.830

252.586

252.878

256.664

263.434

272.604

AICc

250.802

250.813

251.961

252.806

253.210

256.884

264.457

275.804

BIC

260.364

257.210

261.523

265.511

269.033

269.589

292.520

324.302

Table 4: The number of the parameters dim(θ) and the values of the negative log-likelihood (−lN (θ)), AIC, AICc and BIC criterions are displayed respectively for model (3), models (15) for l = 1, 2, 3, 4 and the Bernoulli-HMMs of order k = 2, 3, 4 for the binary time series derived from the logarithmic returns of the trimonthly closing prices of the stock ”Johnson & Johnson” for the period from 2/1/1970 to 28/5/2013.

A different type of example is based on a dataset reported by Zucchini and MacDonald (2009). We examine a similar model comparison for a binary time series, which represents the observed feeding status of a caterpillar. The two categories of the time series are mealtaking, an activity characterized by feeding interspersed by brief pauses, and inter-meal intervals, in which the animal mainly rests but might also feed for brief periods. In this example, the hidden states of the HMMs are the so-called motivational states (e.g. hungry). The complete set of data refers to an experiment in which eight caterpillars were observed at one-minute intervals for almost 19 hours, and classified as feeding or not feeding. We choose the second caterpillar and apply the previous methodology. In this case, the estimators for model (3) are d = −2.292 (0.212), a1 = 0.116 (0.076), b1 = 3.028 (0.240), with the standard errors given inside the parentheses. The results of the comparison are shown in Table 5 (N = 1132).

19

Models model (3)

l=1

l=2

l=3

l=4

B-HMM2

B-HMM3

B-HMM4

dim(θ)

3

2

3

4

5

4

9

16

−lN (θ)

367.272

368.376

367.179

367.082

366.739

366.716

365.330

360.134

AIC

740.544

740.753

740.358

742.165

743.478

741.432

748.661

752.268

AICc

740.565

740.764

740.379

742.200

743.531

741.467

748.822

752.756

BIC

755.639

750.817

755.453

762.292

768.637

761.559

793.947

832.776

Table 5: The number of the parameters dim(θ) and the values of the negative log-likelihood (−lN (θ)), AIC, AICc and BIC criterions are displayed respectively for model (3), models (15) for l = 1, 2, 3, 4 and the Bernoulli-HMMs of order k = 2, 3, 4 for the caterpillar data.

We observe from Table 5 that model (3) outperforms any other model in terms of AIC (except model (15) with l = 2). In addition, the simple logistic regression based models are performing better than the Bernoulli-HMMs of orders 3 and 4. These particular models contain by far more unknown parameters to be estimated (9 and 16 respectively). In addition, it will be difficult for both models to give a physical interpretation to their order for this specific example. However, the BIC points out a model of the form (15) with l = 1. Clearly the choice of model will depend upon the specific application.

6.2

A Categorical time series example

Russell and Engle (1998) consider the recorded transactions for IBM from 11/1990 to 1/1991 and in particular the consecutive changes in the transaction price. They observe that most of them are zero and that there is a symmetry. They consider a 5−state logistic model, corresponding to the states more than 1 tick down, 1 tick down, zero price move, 1 tick up and more than 1 tick up. The parameters of this model are subject to symmetry constraints. A similar classification has been followed by Hausman et al. (1992), who propose an ordered probit analysis for the transaction prices of the stocks, where a discrete random variable is related to a continuous dependent variable, partitioning the state space into a finite number of distinct regions. Motivated by the application of Russell and Engle (1998), we consider the weekly closing prices of ”Microsoft Corporation” for the period from 18/2/2003 to 22/4/2013. We took the differences between successive weekly closing prices and we observed that most of them are very close to zero (mean= 0.004) and in addition there is a

20

symmetry around this value. Consider now the data as a categorical time series, Y˜t , with three categories as follows

Y˜t =

   1, the difference is less than -0.5,   2, the difference is between -0.5 and 0.5,     3, the difference is more than 0.5.

This produces a categorical time series of size N = 531. We apply the multinomial logit model (7) with m = 3. The main difference in our example is that we do not impose any symmetry restrictions to the parameters as discussed ˆ is estimated following the in Russell and Engle (1998), besides the restrictions implied by Theorem 1. The vector θ methodology of Section 4. The estimated parameters are summarized in Table 6 with their standard errors inside the parentheses. dˆ1

dˆ2

a ˆ11

a ˆ21

a ˆ12

a ˆ22

ˆb11

ˆb21

ˆb12

ˆb22

-0.092

-0.049

0.006

-0.918

0.000

0.923

0.076

0.000

-0.077

0.000

(0.413)

(0.410)

(1.680)

(2.307)

(0.499)

(0.496)

(0.305)

(0.294)

(0.294)

(0.252)

Table 6: Maximum likelihood estimators of θ with their standard errors for the categorical time series with three categories derived from the weekly closing prices of ”Microsoft Corporation” for the period from 18/2/2003 to 22/4/2013.

A similar comparison as in Section 6.1 is performed among model (7) and two classes of models. The first one is the class of logit models of the form λt = d +

l ∑

Bi Yt−i ,

(16)

i=1

where in this case d is a real vector, Bi are real matrices and l represents the lag of the respective model. The second class are the Multinomial-Hidden Markov models (HMMs) of order k (k = 2, 3, 4). The maximization of the loglikelihood for the logit models (16) is performed through multinom, built in R, while for the Multinomial-HMMs we follow Zucchini and MacDonald (2009) and apply the unconstrained optimizer nlm.

21

Models model (7)

l=1

l=2

l=3

l=4

M-HMM2

M-HMM3

M-HMM4

dim(θ)

10

6

10

14

18

6

12

20

−lN (θ)

537.096

540.618

538.196

534.585

532.962

539.763

534.483

532.544

AIC

1094.191

1093.236

1096.393

1097.170

1101.923

1091.526

1092.965

1105.088

AICc

1094.614

1093.396

1096.816

1097.984

1103.259

1091.687

1093.567

1106.735

BIC

1136.939

1118.884

1139.140

1157.017

1178.869

1117.175

1144.262

1190.583

Table 7: The number of the parameters dim(θ) and the values of the negative log-likelihood (−lN (θ)), AIC, AICc and BIC criterions are displayed respectively for model (7), models (16) for l = 1, 2, 3, 4 and the Multinomial-HMMs of order k = 2, 3, 4 for the categorical time series with three categories derived from the weekly closing prices of ”Microsoft Corporation” for the period from 18/2/2003 to 22/4/2013.

It is seen from Table 7 that model (7) performs better in terms of the minimization of the negative log-likelihood than the respective model of the form (16) with lags = 1 or 2. When the lag in (16) is greater or equal to 3 however, the models without the feedback mechanism yield a smaller value, as expected, but the number of unknown parameters increases. This has as consequence to obtain moderately higher values of the AIC and BIC criterion. In comparison to the Hidden Markov Models, (7) performs better in the minimization of the negative log-likelihood for the two state model. The HMMs with 3 and 4 hidden states give smaller values, but in this case the high number of parameters affects the values of the AIC and especially of the BIC criterion.

Acknowledgements This work has been carried out while the first author was visiting the Department of Mathematics and Statistics, University of Cyprus. He would like to thank all the members of the Department for their warm hospitality. We would also like to thank two reviewers and the Associate Editor for their comments that led to a substantial improvement of this manuscript.

22

Appendix Proof of Theorem 1

The first step is to show that there exists a weakly dependent strictly stationary process

{(Yt , λt ), t ∈ Z}, which belongs to L1 . We need to verify condition 3.1 of Doukhan and Wintenberger (2008). Condition 3.2 in the same paper is assumed, while condition 3.3 trivially holds for this case. We can write Yt as follows 

Yt

=

Y1t

  Y  2t  .  .  .    Yit   .  ..  Yqt





1(Ut ≤ p1t )



    1(p1t ≤ Ut ≤ p1t + p2t )     ..     . =     1(p1t + . . . + p(i−1)t ≤ Ut ≤ p1t + . . . + p(i−1)t + pit )     ..   .   1(p1t + . . . + p(q−1)t ≤ Ut ≤ p1t + . . . + p(q−1)t + pqt )

      ,      

where Ut is an i.i.d. sequence of U (0, 1). Observe that from the definition of model (8), we obtain that pit =

1+

exp(f (Y , λ )) ∑q i t−1 t−1 , l=1 exp(fl (Yt−1 , λt−1 ))

λit = fi (Yt−1 , λt−1 ),

i = 1, 2, . . . , q.

Define 

 Y1t

Xt

   Y2t = (Yt , λt ) =   ..  .  Yqt 

λ1t

  λ2t      λqt



1(Ut ≤ p1t )

f1 (Yt−1 , λt−1 )

1(p1t ≤ Ut ≤ p1t + p2t ) .. .

f2 (Yt−1 , λt−1 ) .. .

1(p1t + . . . + p(q−1)t ≤ Ut ≤ p1t + . . . + p(q−1)t + pqt )

fq (Yt−1 , λt−1 )

   =    

      

= F (Yt−1 , λt−1 ; Ut ) = F (Xt−1 ; Ut ), The existence of the sequence of uniform variables Ut is guaranteed from Kallenberg (2002, p.89). It holds that for most constructions we only need a single randomization variable, i.e. a U (0, 1) random variable that is independent of 23

all other previously introduced random elements and σ-fields. This means that the basic probability space is assumed to be rich enough to support any randomization variables we may need and involves no serious loss of generality, since we can always get the condition fulfilled by a simple extension of the space. In what follows, we set X = Xt−1 , since it will be much more convenient in the presentation of the calculations. Yit

= 1(p1t + . . . + p(i−1)t ≤ Ut ≤ p1t + . . . + pit ) ) ( exp(f1 (X)) + . . . + exp(fi (X)) exp(f1 (X)) + . . . + exp(fi−1 (X)) ∑q ∑q ≤ Ut ≤ . = 1 1 + l=1 exp(fl (X)) 1 + l=1 exp(fl (X))

For a q × 2 matrix X = (Y, λ) with Y = (Y1 , . . . , Yq )T , λ = (λ1 , . . . , λq )T , Yi ∈ {0, 1}, λi ∈ R, i = 1, 2, . . . , q, we consider the norm ∥X∥ϵ = ∥Y ∥ + ϵ∥λ∥ =

q ∑

|Yi | + ϵ

i=1

q ∑

|λi |,

i=1

for some ϵ > 0. Note that, we use this notation to avoid cumbersome expressions that complicate our presentation. It is important to note that the vectors Y and λ defined before are identical to the vectors Yt and λt we consider. Then, ′ , condition 3.1 of Doukhan and Wintenberger (2008) becomes1 with X = Xt−1 and X′ = Xt−1 ( ) ∥F (X; Ut ) − F (X′ ; Ut )∥Φ =E ∥F (X; Ut ) − F (X′ ; Ut )∥ϵ [ =E |1(Ut ≤ p1t ) − 1(Ut ≤ p′1t )| + |1(p1t ≤ Ut ≤ p1t + p2t ) − 1(p′1t ≤ Ut ≤ p′1t + p′2t )| + . . . +|1(p1t + . . . + p(i−1)t ≤ Ut ≤ p1t + . . . + pit ) − 1(p′1t + . . . + p′(i−1)t ≤ Ut ≤ p′1t + . . . + p′it )| + . . . ] + |1(p1t + . . . + p(q−1)t ≤ Ut ≤ p1t + . . . + pqt ) − 1(p′1t + . . . + p′(q−1)t ≤ Ut ≤ p′1t + . . . + p′qt )| +ϵ

q ∑

|fi (X) − fi (X′ )|

i=1

=E[|A1t | + |A2t | + . . . + |Aqt |] + ϵ∥f (X) − f (X′ )∥,

where Ait = 1(p1t + . . . + p(i−1)t ≤ Ut ≤ p1t + . . . + pit ) − 1(p′1t + . . . + p′(i−1)t ≤ Ut ≤ p′1t + . . . + p′it ), and ∥f (X) − f (X′ )∥ =

q ∑

|fi (X) − fi (X′ )|.

i=1

Observe that E|A1t | = |p1t −

p′1t |.

For i > 1, define Lit = p1t + p2t + . . . + p(i−1)t . Then, for i = 1, 2, . . . q E|Ait | ≤ 2|Lit − L′it | + |pit − p′it |.

1 For

the Orlicz function Φ(x) = x.

24

(A-1)

Define the function gi (f (X)) = gi (f1 (X), . . . , fq (X)) =

exp(f (X)) ∑q i , 1 + l=1 exp(fl (X))

i = 1, 2, . . . , q.

Then we have that |gi (f (X)) − gi (f (X′ ))| ≤ Observe that

|∇gi (f (Y))| · |f (X) − f (X′ )| =

sup Y∈{0,1}q ×Rq

∑i−1 l=1 exp(fl (X)) ∑ , Lit = q 1 + l=1 exp(fl (X))

1 ||f (X) − f (X′ )||. 4

(A-2)

i = 2, 3, . . . , q.

Similarly we have that |Lit (f (X)) − Lit (f (X′ ))| ≤

sup Y∈{0,1}q ×Rq

|∇Lit (f (Y))| · |f (X) − f (X′ )| =

1 ||f (X) − f (X′ )||. 4

(A-3)

Then from (A-1), (A-2), (A-3) and (9) we have ∥F (X; Ut ) − F (X′ ; Ut )∥Φ

≤ |p1t − p′1t | +

q ∑ (2|Lit − L′it | + |pit − p′it |) + ϵ||f (X) − f (X′ )|| i=2

(

) + ϵ ||f (X) − f (X′ )|| ) + ϵ (∥M ∥ · ∥λ − λ′ ∥ + ∥N ∥ · ∥Y − Y ′ ∥) ) ( ) ∥M ∥ +ϵ ϵ · ∥λ − λ′ ∥ + ∥N ∥ · ∥Y − Y ′ ∥ ϵ ) ( ) ∥M ∥ + ϵ max , ∥N ∥ (∥Y − Y ′ ∥ + ϵ · ∥λ − λ′ ∥) . ϵ ( ) We choose ϵ = ∥M ∥/∥N ∥ and the right hand of the equation above becomes [1 + 3(q − 1)]∥N ∥ + 4∥M ∥ ∥X − 1 + 3(q − 1) 4 ( 1 + 3(q − 1) ≤ 4 ( 1 + 3(q − 1) = 4 ( 1 + 3(q − 1) ≤ 4 =

X′ ∥ϵ /4. From the assumption of Theorem 1 we have that the coefficient of ∥X − X′ ∥ϵ is less than 1. Hence, the first step of the proof is complete. The second step is to show that the weakly dependent strictly stationary process {(Yt , λt ), t ∈ Z}, belongs to Ls . To obtain this, we use induction. Consider the norm [ ∥X∥s =

q ∑

|Yk | + s

k=1

q ∑ k=1

where Yk ∈ {0, 1}, λk ∈ R as before.

25

]1/s |λk |

s

,

Note that from condition (9) we have that ||λt ||

= ||f (X)|| = ||f (Xt−1 )|| = ||f (Xt−1 ) − f (0) + f (0)|| ≤ ||f (Yt−1 , λt−1 ) − f (0)|| + ||f (0)|| ≤ ||M || · ||λt−1 || + ||N || · ||Yt−1 || + ||f (0)|| ≤ ||M || · ||λt−1 || + ||N || + ||f (0)|| ≤ ||M || · (||M || · ||λt−2 || + ||N || + ||f (0)||) + ||N || + ||f (0)|| ≤ ||M ||2 · (||M || · ||λt−3 || + ||N || + ||f (0)||) + (||N || + ||f (0)||)(1 + ||M ||) ≤ ... ≤ ||M ||t ||λ0 || + (||N || + ||f (0)||)

1 − ||M ||t = Ct . 1 − ||M ||

(A-4)

Since ||M ||, ||N || < 1 and ||f (0)|| is in general a finite constant, there exists a sequence of constants {Ct }t∈N , Ct ∈ R+ , such that ||λt || ≤ Ct . Therefore, ||λt ||s ≤ Cts . Consider the function

[( Λs (X) =

q ∑

)s |Yk |

( +

q ∑

)s ]1/s |λk |

.

k=1

k=1

Then form (A-4) we have that EΛss (Xt ) =

E (||Yt ||s + ||λt ||s )

≤ 1 + E(||λt ||s ) ≤ 1 + Cts < ∞, (A-5) which is always finite since Ct → (||N || + ||f (0)||)/(1 − ||M ||), as t → ∞. Observe that E∥Xt ∥ss ≤ EΛss (Xt ) and the second part of the proof is complete. In addition, from Doukhan and Wintenberger (2008), Xt = F (Xt−1 ; Ut ) can be represented as causal Bernoulli shift, Xt = H(Ut , Ut−1 , . . .), where H is a measurable function. Since Ut is an i.i.d. sequence of uniform random variables, Xt is an ergodic and stationary sequence and the proof is complete. Proof of Lemma 1

Consider model (8) as λt (θ) = f (Yt−1 , λt−1 ; θ), for t = 1, 2, . . . , N . The dimension of vector

θ of the parameters to be estimated is s. (i) In order to apply the CLT for martingales, we show that {∂lt (θ)/∂θ}t∈N , where ∂lt (θ) ∂λt (θ) = (Yt − pt (θ)), ∂θ ∂θ 26

is a sequence of square integrable martingale differences. At the true value θ = θ 0 , we have E (Yt − pt (θ)|Ft−1 ) = 0 and Var(Yt −pt (θ)|Ft−1 ) = Σt (θ). We need to show that E|∂lt (θ)/∂θ| < ∞ or equivalently E|∂λt (θ)/∂θ| < ∞. We can write ∂λrt (θ) ∂fr (Yt−1 , λt−1 (θ); θ) ∂λt−1 (θ) ∂fr (Yt−1 , λt−1 (θ); θ) = + , ∂θi ∂θi ∂θi ∂λTt−1 (θ)

i = 1, 2, . . . , p,

r = 1, 2, . . . , q.

We use the following notation T At−1 + Cr(t−1) , Art = Br(t−1)

t ≥ 1,

r = 1, 2, . . . , q.

Substituting repeatedly we obtain T Art = Br(t−1)

(t−2 ∏

) Bi

T A0 + Br(t−1)

i=0

t−2 ∑ k=0

(

t−2 ∏

) Bi

Ck + Cr(t−1) ,

i=k+1

where  At

  =   

Bt

  =   

Ct

  =  

 A1t .. . Aqt T B1;t .. . T Bq;t

C1;t .. . Cq;t



    =   



    =   



    =  

 ∂λ1t (θ)/∂θi .. .

  , 

∂λqt (θ)/∂θi ∂f1 (Yt , λt (θ); θ)/∂λTt (θ) .. . ∂fq (Yt , λt (θ); θ)/∂λTt (θ)  ∂f1 (Yt , λt (θ); θ)/∂θi  ..  . .  ∂fq (Yt , λt (θ); θ)/∂θi

   , 

k−1 ∏

Bi = I,

i=k

Condition (9) of Theorem 1 can be rewritten as |f1 (X) − f1 (X′ )| + |f2 (X) − f2 (X′ )| + . . . + |fq (X) − fq (X′ )| ≤ ∥M ∥ · ∥λ − λ′ ∥ + ∥N ∥ · ∥Y − Y ′ ∥. Note that from the assumption [1 + 3(q − 1)]∥N ∥/4 + ∥M ∥ < 1 in Theorem (1), it follows that ∥N ∥, ∥M ∥ < 1, ∑q ∑q since q ≥ 2. Hence, we can assume there exist αi , βi ∈ [0, 1), such that i=1 αi = ||M ||, i=1 βi = ||N || and |fr (X) − fr (X′ )| ≤ αr · ∥λ − λ′ ∥ + βr · ∥Y − Y ′ ∥,

27

r = 1, 2, . . . , q.

Then we have that |fr (Y, (λ1 , λ2 , . . . , λq )) − fr (Y, (λ′1 , λ2 , . . . , λq ))| ≤ ar |λ1 − λ′1 | ⇒ |∂fr (Y, λ)/∂λ1 | ≤ ar ,

r = 1, 2, . . . , q.

In the same manner |∂fr (Y, λ)/∂λi | ≤ ar ,

r = 1, 2, . . . , q.

Using this fact and Assumption 3 we find (t−2 ( t−2 ) ) t−2 ∏ ∑ ∏ ∂λrt (θ) T ≤ |B T |Bi | |Ck | + |Cr(t−1) | |Bi | |A0 | + |Br(t−1) | r(t−1) | ∂θi i=0 k=0 i=k+1   |∂λ10 (θ)/∂θi | q t−2 ∑   ∑ ..   + a ≤ (ar , . . . , ar )  (br1i ||Yk || + br2i ||λk ||) + C.  r .   k=0 r=1 |∂λq0 (θ)/∂θi | Since Theorem 1 gives E||Yk || < ∞, E||λk || < ∞, we derive that E|∂λt (θ)/∂θ| < ∞. Applying the CLT for martingales we prove that as N → ∞ 1 D √ SN (θ 0 ) −→ N (0, G(θ 0 )), N where the matrix G is defined as the limit of ] [ N 1∑ ∂λt (θ) T ∂λt (θ) (Yt − pt (θ))(Yt − pt (θ)) Ft−1 , E n t=1 ∂θ ∂θ T which is given by (13). The matrix G is positive definite from Assumption 2. The conditional Lindeberg’s condition holds because of Theorem 1 which guarantees existence of moments. (ii) We just need to show that the matrix RN (θ) =

q [ 2 N ∑ ∑ ∂ λrt (θ) t=1 r=1

∂θ∂θ T

] (Yrt − prt (θ))

converges in probability to zero. But this is true, provided that E ∂ 2 λrt (θ)/∂θ∂θ T < ∞, r = 1, 2, . . . , q.

28

Indeed, ∂λ2(t−1) ∂λ1(t−1) ∂λq(t−1) ∂λ1(t−1) ∂ 2 fr ∂λ1(t−1) ∂λ1(t−1) ∂ 2 fr ∂ 2 fr ∂ 2 λrt = 2 + + ... + ∂θi ∂θj ∂λ1(t−1) ∂θj ∂θi ∂λ1(t−1) ∂λ2(t−1) ∂θj ∂θi ∂λ1(t−1) ∂λq(t−1) ∂θj ∂θi +

∂λ1(t−1) ∂fr ∂ 2 λ1(t−1) + + ... ∂λ1(t−1) ∂θj ∂θi ∂λ1(t−1) ∂θi ∂θj

+

∂λ1(t−1) ∂λq(t−1) ∂λ2(t−1) ∂λq(t−1) ∂ 2 fr ∂ 2 fr ∂ 2 fr ∂λq(t−1) ∂λq(t−1) + + ... + ∂λq(t−1) ∂λ1(t−1) ∂θj ∂θi ∂λq(t−1) ∂λ2(t−1) ∂θj ∂θi ∂λ2q(t−1) ∂θj ∂θi

+

∂λq(t−1) ∂fr ∂ 2 λq(t−1) + ∂λq(t−1) ∂θj ∂θi ∂λq(t−1) ∂θi ∂θj

+

∂λ1(t−1) ∂λ2(t−1) ∂λq(t−1) ∂ 2 fr ∂ 2 fr ∂ 2 fr ∂ 2 fr + + ... + + , ∂θi ∂λ1(t−1) ∂θj ∂θi ∂λ2(t−1) ∂θj ∂θi ∂λq(t−1) ∂θj ∂θi ∂θj

∂ 2 fr

∂ 2 fr

r = 1, 2, . . . , q,

which is a sum with q(q + 2) + (q + 1) summands. By Assumption 3, we bound the absolute value of each one of the above summands by a linear function of λt and Yt , plus a term of the form |∂fr (0, 0, θ)/∂x∗i | or |∂ 2 fr (0, 0, θ)/∂x∗i ∂x∗j |, whose expected value is finite by the same assumption. Moreover, from Theorem 1, E||λt || < ∞ and E||Yt || < ∞. The desired result follows. Note further that in the case of model (2), RN (θ) = 0. (iii) We need to show that E ∂ 3 λrt (θ)/∂θi ∂θj ∂θk < ∞. Following the same reasoning as in (ii), we can write ∂ 3 λrt (θ)/∂θi ∂θj ∂θk as a sum, with summands bounded in absolute value by terms with finite expected values. Lemma A-2. If Assumptions 1 and 2 hold and the conditions of Theorem 1 are satisfied, then 1 1˜ sup lN (θ) − lN (θ) → 0, a.s., as N → ∞, N θ∈Θ N ˜0. where ˜lN (θ) denotes (11) evaluated at some starting value λ Proof of Lemma A-2

We need to show that ( 1 lim sup N →∞ θ∈Θ N

N ) N ∑ ∑ ˜ lt (θ) − lt (θ) = 0, t=1

a.s.,

t=1

˜0. where ˜lt (θ) is the tth log-likelihood component obtained by setting the starting value to λ

29

(A-6)

We have that

N N N N ∑ ∑ ∑ ∑ ˜ ˜ l (θ) − l (θ) = (l (θ) − l (θ)) |lt (θ) − ˜ lt (θ)| ≤ t t t t t=1 t=1 t=1 t=1 ( ) m−1 ( ) m−1 m−1 N m−1 ∑ ∑ ∑ ∑ ∑ ˜ jt (θ) + log 1 + ˜ it (θ)) = Yjt λjt (θ) − log 1 + exp(λit (θ)) − Yjt λ exp(λ t=1 j=1 i=1 j=1 i=1   ( ) ( ) m−1 m−1 m−1 N ∑ ∑ ∑ ∑ ˜ it (θ))  ˜ jt (θ)| + log 1 +  ≤ exp(λit (θ)) − log 1 + exp(λ Yjt |λjt (θ) − λ t=1 i=1 i=1 j=1   m−1 N ∑ ∑ ˜ jt (θ)| + |Γt | .  = Yjt |λjt (θ) − λ (A-7) t=1

j=1

But we have already seen in the proof of Lemma 1 that from equation (9) and assumption [1 + 3(q − 1)]∥N ∥/4 + ∥M ∥ < 1 in Theorem 1, it follows that ∥N ∥, ∥M ∥ < 1, and therefore there always exist αi , βi ∈ [0, 1), such that ∑q ∑q i=1 αi = ||M ||, i=1 βi = ||N || and |fr (X) − fr (X′ )| ≤ αr · ∥λ − λ′ ∥ + βr · ∥Y − Y ′ ∥,

r = 1, 2, . . . , q,

where X = (Y, λ) = Xt−1 = (Yt−1 , λt−1 ). Then, applying recursively equation (9) we have that ˜ jt (θ)| = |fj (Yt−1 , λt−1 ) − fj (Yt−1 , λ ˜ t−1 )| |λjt (θ) − λ ˜ t−1 ∥ + βj · ∥Yt−1 − Yt−1 ∥ ≤ αj · ∥λt−1 − λ ˜0∥ ≤ αj · ∥M ∥t−1 · ∥λ0 − λ ≤ Kjt · K, ˜ 0 ∥ is a finite positive constant. where Kjt = αj · ∥M ∥t−1 goes to zero exponentially fast, as t → ∞ and K = ∥λ0 − λ Hence, from the compactness of θ, there exist positive constants Kjt , j = 1, 2, ..., q, such that ˜ jt (θ)| ≤ Kjt K. sup |λjt (θ) − λ θ∈Θ

For the second term of (A-7) consider at first the case where

30

∑m−1 i=1

exp(λit (θ)) >

∑m−1 i=1

˜ it (θ)). Note exp(λ

that

( ) ( ) ∑m−1 m−1 m−1 ∑ ∑ 1 + exp(λ (θ)) it i=1 ˜ it (θ)) = log exp(λit (θ)) − log 1 + exp(λ log 1 + ∑m−1 ˜ 1 + exp( λ (θ)) it i=1 i=1 i=1 ∑m−1 1 + i=1 exp(λit (θ)) =log ∑ ˜ 1 + m−1 i=1 exp(λit (θ)) ( ) ∑m−1 ∑m−1 ˜ i=1 exp(λit (θ)) − i=1 exp(λit (θ)) =log 1 + ∑ ˜ 1 + m−1 i=1 exp(λit (θ)) ∑m−1 ∑m−1 ˜ it (θ)) exp(λit (θ)) − i=1 exp(λ ≤ i=1 ∑m−1 ˜ it (θ)) 1 + i=1 exp(λ ∑ ( ) m−1 ˜ it (θ)) i=1 exp(λit (θ)) − exp(λ = ∑ ˜ 1 + m−1 i=1 exp(λit (θ)) ∑m−1 ˜ ˜ i=1 exp(λit (θ)) exp(λit (θ) − λit (θ)) − 1 ≤ ∑m−1 ˜ it (θ)) 1 + i=1 exp(λ ∑m−1 ˜ it (θ))|λit (θ) − λ ˜ it (θ)| exp |λit (θ) − λ ˜ it (θ)| exp(λ ≤ i=1 ∑m−1 ˜ 1 + i=1 exp(λit (θ)) ≤

m−1 ∑

˜ it (θ)| exp |λit (θ) − λ ˜ it (θ)|, |λit (θ) − λ

i=1

where the fourth line is deduced by the inequality log x ≤ x − 1, ∀x > 0, and the seventh from the inequality ∑m−1 ∑m−1 ˜ it (θ)), we obtain the same | exp(z) − 1| ≤ |z| exp |z|, ∀z ∈ R. In the case where i=1 exp(λit (θ)) < i=1 exp(λ result by similar arguments. Collecting all the above results, we obtain that   N m−1 ∑ ∑ 1 1 ˜ jt (θ)| + |λjt (θ) − λ ˜ jt (θ)| exp |λjt (θ) − λ ˜ jt (θ)| .  Yjt |λjt (θ) − λ |lN (θ) − ˜lN (θ)| ≤ N N t=1 j=1 Hence

  N m−1 ∑ ∑ K 1  sup |lN (θ) − ˜lN (θ)| ≤ Yjt Kjt + Kjt exp (Kjt K) . N θ∈Θ N t=1 j=1

Applying Markov’s inequality ∀ϵ > 0, we obtain that (∑ )s   m−1 m−1 ∞ E ∑ ∑ j=1 YjN KjN + KjN exp (KjN K) P  < ∞, YjN KjN + KjN exp (KjN K) > ϵ ≤ ϵs j=1 N =1 N =1 ∞ ∑

since E(YjN ) < 1. Hence, supθ∈Θ lN (θ)/N − ˜lN (θ)/N → 0, almost surely, as N → ∞.

31

Proof of Theorem 2 Let ON (θ 0 ) =

{ √ } θ : ||θ − θ 0 || ≤ r/ N be a compact neighborhood of the true value, for

any r > 0. A Taylor expansion shows that 1 lN (θ) = lN (θ 0 ) + (θ − θ 0 )T SN (θ 0 ) − (θ − θ 0 )T HN (θ ⋆ )(θ − θ 0 ), 2

(A-8)

where θ ⋆ lies on the line between θ and θ 0 . Then, we note that (A-8) is rewritten as lN (θ) − lN (θ 0 )

( ) 1 1 (θ − θ 0 )T SN (θ 0 ) − (θ − θ 0 )T HN (θ ⋆ ) − HN (θ 0 ) (θ − θ 0 ) − (θ − θ 0 )T HN (θ 0 )(θ − θ 0 ) 2 2 = (I) + (II) + (III). (A-9)

=

√ But (I) ≤ ||SN (θ 0 )||r/ N and (III) ≤ −λmin (HN (θ 0 )) r2 /2n and (II) → 0, because of the continuity of the Hessian matrix. Hence, for all η > 0

[ ] √ 4E ||SN (θ 0 )/ N ||2

η P (lN (θ) − lN (θ 0 ) < 0, ∀θ ∈ ∂ON (θ 0 )) ≥ 1 − − . 2 r 2 [ √ 2] But E ||SN (θ 0 )/ N || < ∞ from Lemma 1(i) and hence the second term can become small. Equivalently, there ˆ such that exists a sequence of conditional MLEs, θ, ( ) ˆ ∈ ON (θ 0 ) ≥ 1 − η, P θ

∀N ≥ N1 .

Following Lemma 1(ii), we also have that HN (θ) is positive definite with probability tending to one, for θ ∈ ON (θ 0 ). ˆ to the score equations in the interior of ON (θ 0 ). This root of the score Hence, there exists exactly one solution θ { equations is consistent, because if r1 < r, then there will exist a root of the score equations in the set θ : ||θ − θ 0 || ≤ √ } r1 / N ⊆ ON (θ 0 ). Because of the uniqueness we obtain the consistency of the solution. The asymptotic normality follows from the proof of Lemma 1. Proof of Remark 1

Following the steps in the proof of Theorem 1, we define for a vector x = (y, λ) ∈ {0, 1} × R

the norm ||x||ϵ = |y| + ϵ|λ|, ∀ϵ > 0. Define Xt = F (x, Ut ) = F (Yt−1 , λt−1 , Ut ). Then ( ) 1 ∥F (x, Ut ) − F (x′ , Ut )∥Φ ≤ + ϵ |f (x) − f (x′ )| 4 ( ) 1 ′ ≤ + ϵ (|a1 | · |λt−1 − λ′t−1 | + |b1 | · |Yt−1 − Yt−1 |) 4 ( )( ) 1 |a1 | ′ ≤ +ϵ ϵ · |λt−1 − λ′t−1 | + |b1 | · |Yt−1 − Yt−1 | 4 ϵ ( ) ( ) 1 |a1 | ≤ + ϵ max , |b1 | ||x − x′||ϵ 4 ϵ ( ) 4|a1 | + |b1 | = · ||x − x′||ϵ , 4 32

where we chose ϵ = |a1 |/|b1 | and condition 3.1 is satisfied. Regarding the moments, after repeated substitution in (3), we have that λt = d

t−1 ∑

ak1 + at1 λ0 + b1

k=0

t ∑

ak−1 Yt−k . 1

k=1

Since Yt−k ∈ {0, 1} and |a1 | < 1 (since 4|a1 | + |b1 | < 4), then E|λt | < |d|

1 1 + |a1 |t |λ0 | + |b1 | = c < ∞ (|a1 | ̸= 1). 1 − |a1 | 1 − |a1 |

Then E||Xt ||ϵ = E|Yt | + ϵE|λt | < ∞. Similarly E||Xt ||ss < ∞. Proof of Remark 2

The result follows similarly as in the proof of Remark 1.

33

References Albert, A. and J. A. Anderson (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71, 1–10. Biswas, A. and P. X.-K. Song (2009). Discrete-valued ARMA processes. Statistics & Probability Letters 79, 1884– 1889. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. J. Econometrics 31, 307–327. Breen, W., L. R. Glosten, and R. Jagannathan (1989). Economic significance of predictable variations in stock index returns. The Journal of Finance 44, 1177–1189. Butler, K. C. and S. Malaikah (1992). Efficiency and inefficiency in thinly traded stock markets: Kuwait and Saudi Arabia. Journal of Banking & Finance 16, 197–210. Christoffersen, P., F. X. Diebold, R. S. Mariano, A. Tay, and Y. K. Tse (2007). Direction-of-change forecasts for Asian equity markets based on conditional variance, skewness and kurtosis dynamics: Evidence from Hong Kong and Singapore. Journal of Financial Forecasting, 1–22. Christoffersen, P. F. and F. X. Diebold (2006). Financial asset returns, direction-of-change forecasting, and volatility dynamics. Management Science 52, 1273–1287. Cox, D. R. (1981). Statistical analysis of time series: Some recent developments. Scand. J. Statist. 8, 93–115. de Jong, R. M. and T. Woutersen (2011). Dynamic time series binary choice. Econometric Theory 27, 673–702. Dedecker, J., P. Doukhan, G. Lang, J. R. Le´on R., S. Louhichi, and C. Prieur (2007). Weak Dependence: With Examples and Applications, Volume 190 of Lecture Notes in Statistics. New York: Springer. Doukhan, P., K. Fokianos, and D. Tjøstheim (2012). On weak dependence conditions for Poisson autoregressions. Statist. Probab. Lett. 82, 942–948. Doukhan, P. and S. Louhichi (1999). A new weak dependence condition and applications to moment inequalities. Stochastic Process. Appl. 84, 313–342. Doukhan, P. and O. Wintenberger (2008). Weakly dependent chains with infinite memory. Stochastic Process. Appl. 118, 1997–2013. 34

Fahrmeir, L. and G. Tutz (2001). Multivariate Statistical Modelling based on Generalized Linear Models (Second ed.). Springer Series in Statistics. New York: Springer-Verlag. With contributions by Wolfgang Hennevogl. Fokianos, K. (2012). Count time series. In T. S. Rao, S. S. Rao, and C. R. Rao (Eds.), Handbook of Statistics: Time Series Analysis–Methods and Applications, Volume 30, pp. 315–347. Amsterdam: Elsevier B. V. Fokianos, K. and B. Kedem (2003). Regression theory for categorical time series. Statist. Sci. 18, 357–376. Fokianos, K., A. Rahbek, and D. Tjøstheim (2009). Poisson autoregression. J. Amer. Statist. Assoc. 104, 1430–1439. With electronic supplementary materials available online. Fokianos, K. and D. Tjøstheim (2011). Log-linear Poisson autoregression. J. Multivariate Anal. 102, 563–578. Fokianos, K. and D. Tjøstheim (2012). Nonlinear Poisson autoregression. Ann. Inst. Statist. Math. 64, 1205–1225. Franke, J. (2010). Weak dependence of functional INGARCH processes. Report in Wirtschaftsmathematik 126, University of Kaiserslautern. Hausman, J. A., A. W. Lo, and A. MacKinlay (1992). An ordered probit analysis of transaction stock prices. Journal of Financial Economics 31, 319 – 379. Joe, H. (1997). Multivariate Models and Dependence Concepts. London: Chapman & Hall. Kallenberg, O. (2002). Foundations of Modern Probability. Springer. Kaufmann, H. (1987). Regression models for nonstationary categorical time series: asymptotic estimation theory. Annals of Statistics 15, 79–98. Kaufmann, H. (1989). On existence and uniqueness of maximum likelihood estimates in quantal and ordinal response models. Metrika 13, 291–313. Kauppi, H. (2008). Yield-curve based probit models for forecasting US recessions: stability and dynamics. Technical report, Aboa Centre for Economics. Kauppi, H. and P. Saikkonen (2008). Predicting US recessions with dynamic binary response models. The Review of Economics and Statistics 90, 777–791. Kedem, B. and K. Fokianos (2002). Regression Models for Time Series analysis. Hoboken, NJ.

35

Leung, M. T., H. Daouk, and A.-S. Chen (2000). Forecasting stock indices: a comparison of classification and level estimation models. International Journal of Forecasting 16, 173–190. MacDonald, I. L. and W. Zucchini (1997). Hidden Markov and Other Models for Discrete–valued Time Series. London: Chapman & Hall. McCullagh, P. and J. A. Nelder (1989). Generalized Linear Models (2nd ed.). London: Chapman & Hall. Neumann, M. (2011). Absolute regularity and ergodicity of Poisson count processes. Bernoulli 17, 1268–1284. Nyberg, H. (2010). Studies on binary time series models with applications to empirical macroeconomics and finance. Ph. D. thesis, Universiy of Helsinki. Qaqish, B. F. (2003). A family of multivariate binary distributions for simulating correlated binary variables with specified marginal means and correlations. Biometrika 90, 455–463. R Core Team (2013). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org. Russell, J. R. and R. F. Engle (1998). Econometric analysis of discrete-valued irregularly-spaced financial transactions data using a new autoregressive conditional multinomial model. SSRN eLibrary. Russell, J. R. and R. F. Engle (2005). A discrete-state continuous-time model of financial transactions prices and times. Journal of Business and Economic Statistics 23, 166–180. Rydberg, T. H. and N. Shephard (2003). Dynamics of trade-by-trade price movements: decomposition and models. Journal of Financial Econometrics 1, 2–25. Tjøstheim, D. (2012). Rejoinder on: Some recent theory for autoregressive count time series. TEST 21, 469–476. Weiß, C. (2011). Generalized choice models for categorical time series. J. Statist. Plann. Inference 141, 2849–2862. Zeger, S. L. and B. Qaqish (1988). Markov regression models for time series: a quasi-likelihood approach. Biometrics 44, 1019–1031. Zucchini, W. and I. L. MacDonald (2009). Hidden Markov models for time series: an introduction using R. CRC Press.

36

Suggest Documents