Vector generalized linear and additive extreme

5 downloads 0 Views 479KB Size Report
Jun 8, 2007 - generalized extreme value (GEV) distribution to be a function of time with linear ... generally constrained within a narrow modelling context. ... We have developed new methodology in order to ... as a model for which the conditional distribution of Y given ..... The first group are vgam ... rlplot. Return level plot ...
Extremes (2007) 10:1–19 DOI 10.1007/s10687-007-0032-4

Vector generalized linear and additive extreme value models Thomas W. Yee · Alec G. Stephenson

Received: 24 January 2006 / Revised: 30 January 2007 / Accepted: 20 February 2007 / Published online: 8 June 2007 © Springer Science + Business Media, LLC 2007

Abstract Over recent years parametric and nonparametric regression has slowly been adopted into extreme value data analysis. Its introduction has been characterized by piecemeal additions and embellishments, which has had a negative effect on software development and usage. The purpose of this article is to convey the classes of vector generalized linear and additive models (VGLMs and VGAMs) as offering significant advantages for extreme value data analysis, providing flexible smoothing within a unifying framework. In particular, VGLMs and VGAMs allow all parameters of extreme value distributions to be modelled as linear or smooth functions of covariates. We implement new auxiliary methodology by incorporating a quasiNewton update for the working weight matrices within an iteratively reweighted least squares (IRLS) algorithm. A software implementation by the first author, called the vgam package for R, is used to illustrate the potential of VGLMs and VGAMs. Keywords Fisher scoring · Iteratively reweighted least squares · Maximum likelihood estimation · Penalized likelihood · Smoothing · Extreme value modelling · Vector splines AMS 2000 Subject Classification 62P99

T. W. Yee (B) Department of Statistics, University of Auckland, Private Bag 92019, Auckland, New Zealand e-mail: [email protected] A. G. Stephenson Department of Statistics & Applied Probability, National University of Singapore, Singapore 117546, Singapore e-mail: [email protected]

2

T.W. Yee, A.G. Stephenson

1 Introduction Over the last decade, nonparametric regression or smoothing has had a large and powerful influence in many areas of applied statistics. Unfortunately, its application to extreme value data analysis has been rather slow and piecemeal. Understandably, parameters of distributions associated with extreme value theory were originally modelled as scalars due to the theoretical and computational limitations at that time. Parameters were subsequently modelled as linear functions of covariates, e.g., Smith (1986) extends Weissman (1978) and allows the location parameter of the generalized extreme value (GEV) distribution to be a function of time with linear and sinusoidal components, and Tawn (1988) uses a linear trend for a marginal location parameter of a bivariate extreme value model. More recently, parameters have been modelled as fully smooth functions of covariates, e.g., Rosen and Cohen (1996) extend Smith (1986) by using cubic splines to model both parameters of the Gumbel distribution, Pauli and Coles (2001) use a spline smoother for the GEV location parameter, and Chavez-Demoulin and Davison (2005) apply splines to the parameters of a peaks-over-threshold model. In general statistical data analysis, smoothing allows for a data-driven approach rather than a model-driven approach so the data can speak for itself, and this is an invaluable asset. With extreme value data, care must be taken when allowing a datadriven approach because many extreme value models are based on asymptotics that require extrapolation and thus are fundamentally model-driven. Nevertheless, when justifiable, a data-driven approach nested within this model-driven framework can be very useful. A modern text illustrating the wide applicability and power of smoothing is Schimek (2000). In recent years, several pieces of software have appeared for extreme value data analysis: see Stephenson and Gilleland (2006) for a comprehensive review. These have given extreme value analysts tools for exploring their data and fitting various statistical models. Each software package provides useful utilities, but these are generally constrained within a narrow modelling context. It is hoped that the reader will appreciate that a unified theory should result in more streamlined software that makes life for the practitioner much easier. Similarly, the extreme value modelling literature may be assimilated more successfully if seen in the context of an overarching structure. Ultimately, we believe the lack of a unified theoretical framework among extreme value models has had negative consequences, and this paper seeks to redress this issue. The purpose of this article is to convey the advantage obtained by considering the classes of vector generalized linear models (VGLMs; Yee and Hastie 2003) and vector generalized additive models (VGAMs; Yee and Hastie 1996) in the context of extreme value analysis. We have developed new methodology in order to incorporate extreme value models within the VGAM framework, namely, a quasiNewton update within the iteratively reweighted least squares (IRLS) algorithm, the details of which are given in Section 2.2. The VGLM/VGAM classes of models are very large; they have been successfully applied to categorical data analysis, LMS quantile regression, and scores of standard and nonstandard univariate and continuous distributions. VGLMs/VGAMs provide a substantial and unifying framework for many statistical methods applicable for extreme value models. They also provide

Vector generalized linear and additive extreme value models

3

a seamless transition between parametric and nonparametric analyses, allowing parameters to be modelled as linear or smooth functions of covariates. The VGLM/VGAM classes are implemented in the vgam package (Yee 2007) for the R statistical computing environment (Ihaka and Gentleman 1996). This package provides a number of additional benefits above what is available elsewhere, and the modular nature of vgam also makes it easier for programmers to add new methodology through an object-oriented programming framework. The vgam package currently works only in R and not S-PLUS, however, the latter may be supported in the future. An outline of this article is as follows. Section 2 reviews VGLMs and VGAMs, and Section 3 looks at extreme value models within this context. Section 4 demonstrates the vgam package with analyses of sea-level and rainfall data, and a simulation study. The paper concludes with a discussion in Section 5. Appendix 1 provides some technical implementation details, and Appendix 2 gives sample R code for the extreme value analyses presented in Section 4.

2 Vector Generalized Linear and Additive Models In this section a skeleton of necessary details is presented regarding VGLMs and VGAMs so that the purposes of this paper can be achieved. Fuller details can be found in Yee and Hastie (2003) and Yee and Wild (1996), respectively. Suppose the observed response y is a q-dimensional vector. VGLMs are defined as a model for which the conditional distribution of Y given explanatory x is of the form f ( y|x; B) = h( y, η1 , . . . , η M )

(2.1)

for some known function h(·), where B = (β 1 β 2 · · · β M ) is a p × M matrix of unknown regression coefficients, and the jth linear predictor is η j = η j(x) = β Tj x =

p 

β( j)k xk ,

j = 1, . . . , M,

(2.2)

k=1

where x = (x1 , . . . , x p )T with x1 = 1 if there is an intercept. VGLMs are thus like GLMs but allow for multiple linear predictors, and they encompass models outside the limited confines of the exponential family. Indeed, the breadth of the class covers a wide range of multivariate response types and models including univariate and multivariate distributions, categorical data analysis, time series, survival analysis, generalized estimating equations, correlated binary data, bioassay data and nonlinear least-squares problems. The η j of VGLMs may be applied directly to parameters of a distribution rather than just to means as for GLMs. A simple example is a univariate distribution with a location parameter μ and a scale parameter σ > 0, where we may take η1 = μ and η2 = log σ . In general, η j = g j(θ j) for some parameter link function g j and parameter θ j. In vgam, there are currently over a dozen links to choose from, of which any can be assigned to any parameter, ensuring maximum flexibility.

4

T.W. Yee, A.G. Stephenson

There is no relationship between q and M in general: it depends specifically on the model or distribution to be fitted. For simple extreme value models M is usually the number of parameters to be estimated and q may be any positive integer. For the stationary GEV and GP models defined in Section 3 we have q = 1, with M = 3 and M = 2 respectively. VGLMs are estimated using IRLS. Most models that can be fitted have a logn likelihood  = i=1 wi i and this will be assumed here. The wi are known positive prior weights. Let xi denote the explanatory vector for the ith observation, for i = 1, . . . , n. Then one can write ⎛

⎛ T ⎞ ⎞ η1 (xi ) β 1 xi ⎜ ⎜ ⎟ ⎟ ηi = η(xi ) = ⎝ ... ⎠ = BT xi = ⎝ ... ⎠ . η M (xi ) β TM xi In IRLS, an adjusted dependent vector zi = ηi + Wi−1 d i is regressed upon a large model matrix, with d i = wi ∂i /∂ηi . The working weights Wi are usually taken as Wi = −wi ∂ 2 i /(∂ηi ∂ηiT ) or Wi = −wi E[∂ 2 i /(∂ηi ∂ηiT )], giving rise to the NewtonRaphson and Fisher-scoring algorithms respectively. Fisher-scoring is to be preferred for numerical stability because the Wi are positive-definite over a larger region of parameter space. The price to pay for this stability is typically a slower convergence rate and less accurate standard errors (Efron and Hinkley 1978). VGAMs provide additive-model extensions to VGLMs, that is, Eq. 2.2 is generalized to η j(x) = β( j)1 +

p 

f( j)k (xk ),

j = 1, . . . , M,

k=2

a sum of smooth functions of the individual covariates, just as with ordinary GAMs (Hastie and Tibshirani 1990). The f k = ( f(1)k (xk ), . . . , f(M)k (xk ))T are centered for uniqueness, and are estimated simultaneously using vector smoothers. VGAMs are thus a visual data-driven method that is well suited to exploring data, and they retain the simplicity of interpretation that GAMs possess. In practice we may wish to constrain the effect of a covariate to be the same for some of the η j and to have no effect for others. For example, for VGAMs we may wish to take η1 = β(1)1 + f(1)2 (x2 ) + f(1)3 (x3 ), η2 = β(2)1 + f(1)2 (x2 ), so that f(1)2 ≡ f(2)2 and f(2)3 ≡ 0. For VGAMs, we can represent these models using η(x) = β (1) +

p  k=2

f k (xk ) = H1 β ∗(1) +

p 

Hk f ∗k (xk )

k=2

where H1 , H2 , . . . , H p are known full-column rank constraint matrices, f ∗k is a vector containing a possibly reduced set of component functions and β ∗(1) is a vector of unknown intercepts. With no constraints at all, H1 = H2 = · · · = H p = I M and

Vector generalized linear and additive extreme value models

5

β ∗(1) = β (1) . Like the f k , the f ∗k are centered for uniqueness. For VGLMs, the f k are linear so that

BT = H1 β ∗(1) H2 β ∗(2) · · · H p β ∗( p) . VGAMs can be estimated by applying a modified vector backfitting algorithm (Buja et al. 1989) to the zi (this is called the local scoring algorithm). This entails decomposing f( j)k (xk ) = β( j)k xk + g( j)k (xk ) and simultaneously fitting all the linear parts first, followed by ordinary backfitting to estimate the nonlinear parts g( j)k (xk ). Modified backfitting improves ordinary backfitting in terms of the convergence rate and numerical stability. 2.1 Vector Splines and Penalized Likelihood Just as GAMs employ smoothers in their estimation, so VGAMs employ vector smoothers in their estimation. For scatterplot data (xi , yi ,  i ), i = 1, . . . , n, vector smoothers fit a vector of smooth functions f (x) = ( f1 (x), . . . , f M (x))T to the vector measurement model yi = f (xi ) + εi ,

i = 1, . . . , n,

(2.3)

where εi are independent, E(εi ) = 0 and Var(εi ) =  i . Here the  i are known symmetric and positive-definite error covariances. The Wi =  i−1 are known functions of the parameters within the Newton-Raphson or Fisher-scoring algorithms. Vector splines minimize the quantity n 

yi − f (xi )

T

 i−1



yi − f (xi )

i=1

+

M  j=1

 λj a

b

{ f j (t)}2 dt

(2.4)

where a < min(x1 , . . . , xn ) and b > max(x1 , . . . , xn ). The first term measures lack of fit while the second term is a smoothing penalty. Equation 2.4 simplifies to an ordinary cubic smoothing spline when M = 1, (see, e.g., Green and Silverman 1994). Each component function f j has a non-negative smoothing parameter λ j which operates as with an ordinary cubic spline. Fessler (1991) proposed an O(nM3 ) algorithm to fit vector splines based essentially on the Reinsch algorithm (Reinsch 1967), however Yee (2000) proposed an alternative algorithm based on B-splines (de Boor 1978) that has advantages such as improved numerical stability and ease of prediction of the functions. Now consider the penalized likelihood 1  λ( j)k 2 j=1 p

−

k=1

M



bk

ak

{ f(j)k (xk )}2 dxk .

(2.5)

Special cases of this quantity have appeared in a wide range of situations in the statistical literature to introduce cubic spline smoothing into a model’s parameter estimation (for a basic overview of this roughness penalty approach see Green and Silverman 1994). For the framework of VGAMs, Yee (1993) has shown that the VGAM local scoring algorithm can be justified via maximizing the penalized likelihood (2.5) with vector spline smoothing.

6

T.W. Yee, A.G. Stephenson

From a practitioner’s viewpoint the smoothness of a smoothing spline can be more conveniently controlled by specifying the degrees of freedom of the smooth rather than λ j. Given a value of the degrees of freedom, the software can search for the λ j producing this value. In general, the higher the degrees of freedom, the more wiggly the curve. In the examples below, 1 degree of freedom corresponds to a linear fit. One null hypothesis of interest with VGAMs is to test whether a component function f( j)k (xk ) is linear in xk . The alternative hypothesis is that it is a nonlinear function. If H0 cannot be rejected then it is more parsimonious to model it as a linear function. The generic function summary() applied to a vgam() object conducts a very approximate score test for H0 , along the lines of the details given in Section 7.4.5 of Chambers and Hastie (1993). With VGAMs approximate inference via a deviance test is available. This is described in Section 6.8 of Hastie and Tibshirani (1990); see also Hastie and Tibshirani (1987). Briefly, for full-likelihood models, the difference in deviance between two models (one nested within the other) is approximately χ 2 with degrees of freedom equal to the difference in degrees of freedom between the models. 2.2 Quasi-Newton IRLS In the VGAM context we typically prefer Fisher scoring for the numerical stability it provides. Unfortunately, Fisher scoring is not as practical in the extreme value domain because of the general difficulty of obtaining the expected information matrix. For example, Oakes and Manatunga (1992) and Shi (1995) each give the expected information matrix for specific extreme value models, but they typically involve numerical evaluation of integrals and are tedious to compute. The alternative Newton-Raphson approach can also be problematic for extreme value models because the individual working weight matrices may not be positive-definite even at the solution. For some extreme value models we have therefore adopted a quasiNewton algorithm within the IRLS procedure. To our knowledge, the application of a quasi-Newton method to update the working weight matrices within IRLS is novel. We use a rank-2 BFGS (Broyden 1970) update for each Wi matrix. The basic details are as follows. The idea is to use a rank-2 update of the Hessian matrix at each iteration step, thus building it up so that it becomes closer to the real value at the solution as the iterations converge to the solution. Importantly, only first order derivative information is used. At the first iteration we use a steepest descent step Wi(1) = wi I M . In vgam we apply the BFGS update to each weight matrix Wi . We have, for each i = 1, . . . , n and at iteration a, Wi(a+1) = Wi(a) +

qi(a) qi(a)T si(a)T qi(a)



Wi(a) si(a) si(a)T Wi(a) si(a)T Wi(a) si(a)

(2.6)

where qi(a) = d i(a) − d i(a+1) , si(a) = ηi(a+1) − ηi(a) .

(2.7)

For this method, symmetry and positive-definiteness are assured provided si(a)T qi(a) > 0. The update of Wi is skipped if si(a)T qi(a) ≤ 0 to ensure that the weight matrices

Vector generalized linear and additive extreme value models

7

remain positive-definite. In practice, we have found that this method has given good results.

3 Extreme Value Models In this section we give a brief outline of the extreme value models and methodology included in the vgam package. Table 1 lists some of the central vgam family and plotting functions at the heart of this paper. A vgam family function specifies the type of model to be used, and is assigned to the family argument of vglm() and vgam(). The vgam package contains many other family functions related to extreme value analyses, such as the Weibull, bivariate logistic, Fréchet, and loggamma distributions. We will focus here on models based on the generalized extreme value (GEV) and generalized Pareto (GP) distributions. For a more detailed review of these models see, e.g., Coles (2001) or Beirlant et al. (2004). Let Z 1 , . . . , Z m be a random sample from a distribution F, and let Ym = max{Z 1 , . . . , Z m }. Extreme value theory specifies that as m → ∞ the class of nondegenerate limiting distributions for Ym , under linear normalization, is the GEV distribution   −1/ξ G(y) = exp − {1 + ξ (y − μ) /ψ}+ , (3.1) where μ, ψ > 0 and ξ are the location, scale and shape parameters respectively, and h+ = max(h, 0). Within the VGLM/VGAM framework we implement non-stationary models where the ith maximum is assumed to be distributed as GEV with parameters (μi , ψi , ξi ) for i = 1, . . . , n. The above result also leads to an approximation to the conditional exceedance probability Pr(Y > y|Y > u) for all y greater than some predetermined large threshold u. Specifically, the generalized Pareto (GP) distribution (Pickands 1975) can be obtained as an approximation of the conditional distribution of exceedances. Given n data points y1 , . . . , yn , a non-stationary extension of this model leads to the likelihood function   δi  n  ξi (yi − ui ) −1/ξi −1 1 (3.2) 1+ (1 − Pi )1−δi Piδi σ σ i i i=1

Table 1 Some central R functions for extreme value data currently available in the vgam package

The first group are vgam family (modelling) functions and the second are plotting functions.

gev gpd gumbel egumbel egev

GEV distribution (matrix responses allowed) Generalized Pareto distribution Gumbel (matrix responses allowed) Gumbel (univariate response) GEV (univariate response)

guplot meplot qtplot rlplot

Gumbel plot Mean excess (mean residual life) plot Quantile plot Return level plot

8

T.W. Yee, A.G. Stephenson

where δi is an indicator for the exceedance of the ith data point over the threshold ui and Pi = Pr(Yi > ui ). The likelihood can be separated into two parts. The first term approximates the conditional exceedance probability for the ith response as GP with parameters (σi , ξi ) for i = 1, . . . , n. This term can be easily modelled under the VGLM/VGAM framework. The second term, which only involves the probability Pi , is the likelihood for a binomial GLM/GAM where δi is the response. A binomial GLM/GAM can therefore be used to model Pi separately. If n is large an alternative approach is to divide the observation period into subintervals, average the covariates within each interval, and use a Poisson GLM/GAM where the number of exceedances within each interval is the response. Offsets can be used to account for subintervals containing different numbers of observations. The models outlined above assume the observations are independent. For GEV models this assumption is reasonable if the observed maxima are taken over a relatively large observational period. For GP models this assumption may not be true because extreme events may form clusters, and observed series of exceedances will then show varying levels of short-range dependence. In practice this problem can be solved by identifying independent clusters and working with the cluster maxima. See, e.g., Leadbetter et al. (1983) for the theoretical basis of this approach and Ferro and Segers (2003) for a discussion of cluster identification.

4 Examples To illustrate the flavour of vgam modelling the following analyses were performed (R 2.4.1 was used). R is one of two implementations of the S language (the other being S-PLUS), and the modelling ideas of Chambers and Hastie (1993) are prominent in this section. The core of the code used here is given in Appendix 2. 4.1 Fremantle Data Example Coles (2001) describes a data set consisting of the annual maximum sea-levels recorded at Fremantle, near Perth in Australia. There are 86 annual measurements from 1897 to 1989 (some years are missing), and there are two variables, Year and SOI, which give the year and the annual mean values of the Southern Oscillation Index, a proxy for meteorological volatility. An initial GEV VGAM model was fitted using both variables and giving a nominal three degrees of freedom to each smooth. The shape parameter ξ was fitted as an intercept-only term for three reasons. Firstly, data provides little information on this parameter in general. Secondly, estimation is numerically fraught when this parameter is allowed to be too flexible. Thirdly, the data set is small here. (The first two reasons are why the software default is an intercept-only for ξ .) We chose additive predictors μ(x) = η1 = β(1)1 + f(1)2 (x2 ) + f(1)3 (x3 ) log ψ(x) = η2 = β(2)1 + f(2)2 (x2 ) + f(2)3 (x3 ) ξ = η3 = β(3)1

(4.1)

Vector generalized linear and additive extreme value models

9

0.10

0.8

0.05

0.6

s(Year, df = 3):2

s(Year, df = 3):1

where x2 = Year and x3 = SOI. Note that ξ > − 12 corresponds to the regular case where the classical asymptotic theory of maximum likelihood estimators is applicable (Smith 1985) and the working weight matrices are positive-definite. Figure 1 shows the fitted functions. It appears that all but one of the functions,  f(2)2 (x2 ), are linear, and an approximate score test for linearity supported this view. These results agree with Coles (2001) who concluded that μ is linear in Year, however our method of seeing this feature directly from a plot that allows nonlinearity is more convenient and reassuring. The fitted functions of Fig. 1 are plotted with pointwise ±2 standard error bands about them. It is common for these types of plots for the standard error bands to be widest at the boundaries due to there being less data there. The estimate of ξ was −0.27 with standard error 0.06, which indicates a Weibull-type GEV. The function  f(2)2 (x2 ), is “L” shaped due to the smaller scale around the year 1945. These results suggest the model behind Fig. 2, where the three functions are linear and the smooth term f(2)2 (x2 ) is modelled using a piecewise-linear regression spline, i.e., a brokenstick model. An informal deviance test between the two models gives a p-value of 2 0.43 ≈ P[χ250−242.91 > 2(63.33 − 59.80)], indicating that this simplified model suffices. In the analysis by Coles (2001) of these data, he also concluded that the effect of SOI is influential on annual maximum sea-levels at Fremantle, even after the allowance for time-variation, i.e., both Year and SOI need to be in the model for the μ parameter. Our analysis tends to agree with this conclusion. Furthermore, our second model agrees with an estimated time trend in the location parameter at around 2mm per year. The SOI can be interpreted as a measure of meteorological volatility due to ocean-atmosphere phenomenon such as the El Niño effect, and

0.00

–0.05 – 0.10 – 0.15

0.4 0.2 0.0

– 0.2 – 0.4

– 0.20 1900

1920

1940

1960

1980

1900

1920

0.3

1.0

0.2

0.5

0.1 0.0

– 0.1 – 0.2 –1

0 SOI

1940

1960

1980

Year

s(SOI, df = 3):2

s(SOI, df = 3):1

Year

1

2

0.0

– 0.5 – 1.0

–1

0

1

2

SOI

Fig. 1 Initial VGAM fitted to the Fremantle data (called fit1 in Appendix 2). Each function has three degrees of freedom and the dashed lines are ±2 SE bands. From top left going clockwise, the f(2)2 (x2 ),  f(2)3 (x3 ),  f(1)3 (x3 ) in Eq. 4.1 fitted functions are  f(1)2 (x2 ), 

T.W. Yee, A.G. Stephenson

partial for Year

0.10 0.05 0.00 –0.05 –0.10 –0.15 1900

1920

1940

1960

1980

bs(Year, degree = 1, knot = 1945)

10

0.8 0.6 0.4 0.2 0.0 –0.2 –0.4 1900

1920

Year

1960

1980

Year

1.0 partial for SOI:2

0.2 partial for SOI:1

1940

0.1 0.0 –0.1

–1

0 SOI

1

2

0.5 0.0 –0.5

–1

0

1

2

SOI

Fig. 2 A more parsimonious VGAM fitted to the Fremantle data (called fit2 in Appendix 2). It is a parametric version of Fig. 1. The pointwise ±2 standard error bands intersect at the sample mean of each covariate that has a linear function fitted to it

the analysis shows that, accounting for the time trend, such effects explain some of the observed variation of extreme sea-levels occurring on the Western Australian coastline (Coles 2001).

4.2 Rainfall Data Example In this example we illustrate the GP model for exceedances using a large dataset of daily rainfall records, recorded throughout the period 1935–1970 at a site in Southwest England. There are some days with missing values, and these were omitted from the analysis. We incorporated spatial information through the records gathered at four neighbouring sites, labelled A–D. Sites A, B, C and D respectively lie to the North-East, East, South and West of the central site of interest. The observation times are also included in the model, giving five covariates in total. For pairwise bivariate Bayesian analyses of yearly site maxima, see Smith and Walshaw (2003). One of the difficulties of non-stationary modelling under the GP model is the issue of threshold choice. For stationary models the threshold is typically chosen to be constant across time, with the constant threshold specified on the basis of standard graphical diagnostics (Beirlant et al. 2004). With non-stationarity it may not reasonable to choose a constant threshold given the expectation that there may be more exceedances in some periods than others; yet there are no published procedures (known to the authors) that assist in the determination of a non-constant threshold. For our analysis we used a fairly ad hoc two-stage procedure. At the first stage, we ignored the non-stationarity and chose a constant threshold using standard

Vector generalized linear and additive extreme value models

11

methods. Approximately 4.6% of the observations at our site of interest exceeded this threshold. At the second stage we chose a non-constant threshold across time using quantile regression, implementing a cubic regression spline with time as the covariate, and estimating the quantile corresponding to an upper tail probability of 4.6%. There were 436 exceedances of the resulting non-constant threshold. An additional modelling difficulty is the dependence that may be exhibited by the data series. Environmental processes typically exhibit little or no long-range dependence, but often exhibit short-range dependence in the form of extreme clusters. For our data, the intervals estimator (Ferro and Segers 2003) of the extremal index (Leadbetter et al. 1983) was approximately 0.83, giving an estimated limiting mean cluster size of about 1.2, implying that there is a slight tendency to cluster at high levels of daily rainfall. We therefore decided to de-cluster the data by implementing the runs method (Coles 2001). Using the non-constant threshold derived earlier and a run length of 3 days, the method identified 326 extreme clusters from our 436 exceedances, and the 326 cluster maxima were used for our GP model. The results shown here are for the model log σ (x) = η1 = β(1) + f A (x A ) + f B (x B ) + fC (xC ) + f D (x D ) + βT xT , ξ(x) = η2 = β(2) , where xT is the time covariate and, e.g., x A is the rainfall covariate for site A. In order to illustrate the different features of the vgam package we model the functions as cubic regression splines, rather than smoothing splines as were used in the previous example. Each spline is implemented using a standard B-spline basis, and hence this model lies within the VGLM framework. Given the large number of cluster maxima we also considered a time trend for the shape parameter, but this was subsequently found to be statistically insignificant. The fitted regression splines are given in Fig. 3. They generally depict a positive linear relationship, though there is some suggestion of a decreasing scale with increasing rainfall at site B over the [0, 200] mm interval. The estimated values for the log-scale and shape intercepts were about 4.4 and −0.4 respectively, and there was a small but statistically significant decreasing trend on log-scale, decreasing by about 0.1 every 8 years. When the site covariates are omitted from the analysis the estimated log-scale is larger than the intercept of the VGLM model, and consequently the estimated shape is also larger, taking a value close to zero. When we regress on the site values, the information reduces the extremal variation and hence the the tail of the rainfall distribution at the central site becomes lighter under the conditioning. For prediction purposes, this example is merely illustrative; clearly the rainfall at the neighbouring sites cannot be known in advance. However, it does show the potential for incorporating other geographical and meteorological processes into models of the form presented here. If such processes can be predicted with accuracy, the flexible regression tools provided by vgam offers the potential for considerable improvements in the estimation of future return levels. 4.3 Simulation Example In this example we illustrate the GP model for exceedances using simulated data. We simulate 500 exceedances of the threshold u = 0 from a GP model with a covariate

T.W. Yee, A.G. Stephenson 0.8

0.8

0.6

0.6

0.4

0.4

bs(SiteB)

bs(SiteA)

12

0.2 0.0

0.2 0.0

–0.2

–0.2

–0.4

–0.4 0

200

400

600

800

1000 1200

0

200

400

0.8

0.8

0.6

0.6

0.4

0.4

0.2 0.0

–0.4

–0.4 400

1000

0.0 –0.2

200

800

0.2

–0.2

0

600

SiteB

bs(SiteD)

bs(SiteC)

SiteA

600

800

SiteC

0

100

200

300

400

500

600

SiteD

0.8 partial for Year

0.6 0.4 0.2 0.0 –0.2 –0.4 1935 1940 1945 1950 1955 1960 1965 1970 Year

Fig. 3 A VGAM fitted to the rainfall data (called rainfit in Appendix 2)

x2 generated as an equally spaced sequence between 0 to 1, and x3 ∼ Uniform(0, 1) independently, with log σ (x) = 12 sin(2π x2 ) + 0.1x3 and ξ = −0.1 denotes the true model. This involves a constant shape parameter and a scale parameter with a periodic component. Our experience with the GP shape parameter have mirrored others who have found it is often most parsimonious to fit this parameter as an intercept-only term. For this reason, this is the software default. A reasonable first fitted model can be written log σ (x) = η1 = β(1)1 + f(1)2 (x2 ) + f(1)3 (x3 ),   1 = η2 = β(2)1 log ξ + 2

(4.2) (4.3)

for cubic smoothing spline functions f(1)2 and f(1)3 . Note that if ξ ≈ 0 then η2 ≈ − log 2 + 2ξ in Eq. 4.3.

Vector generalized linear and additive extreme value models

0.4 s(x3)

s(x2)

0.2 0.0 – 0.2 – 0.4 – 0.6

0.0

0.2

0.4

0.6

0.8

1.0

13

0.3 0.2 0.1 0.0 – 0.1 – 0.2 – 0.3 – 0.4 0.0

0.2

x2

0.4

0.6

0.8

1.0

x3

Fig. 4 VGAM model fitted to some simulated GPD data with n = 500. The functions are estimates of f(1)2 (x2 ) and f(1)3 (x3 ) in Eq. 4.2 and each have four degrees of freedom

A typical fit of Eqs. 4.2 and 4.3 is illustrated in Fig. 4. It can be immediately apparent that the VGAM picks up both the sine curve and linearity straightaway. The ±2 SE bands in the latter plot are wide, and upon this result, the user would typically be parsimonious and constrain f(1)3 to be linear in x3 (or drop x3 altogether). Of interest then are the estimates of: (a) the shape parameter ξ = eη2 − 12 , and (b) β(1)3 where log σ (x) = η1 = β(1)1 + f(1)2 (x2 ) + β(1)3 x3 .

(4.4)

In order to illustrate the small sample properties of our estimators in this particular case, we sample exceedences from the true model described above for sample sizes n =100, 500, and 1,000. For each sample size we simulate 500 realizations and fit the model (4.3) and (4.4). Then over these simulations, the standard errors and bias can (1)3 . The results are summarized in Table 2. Histograms of the be computed for  ξ and β (1)3 are skewed away from estimates (not given) show that the distributions of  ξ and β zero for n = 100, and a little skewed for n = 500, and very normal-like for n =1,000. The bias and standard errors can be estimated by computing the sample means and standard deviations of the estimates. The results of Table 2 tend to agree with what is to be expected: as n increases both the bias and standard errors decrease. The mean square error (MSE) can be calculated for  ξ as 0.0367, 0.00237, 0.00105 (for n = 100, (1)3 they are 0.116, 0.0216, 0.0104. Thus 500, 1,000 respectively), and for the slope β the estimates appear to be consistent as n → ∞.

Table 2 Simulation results of the bias and standard deviation of the estimates from a sample size of n GPD observations

Bias( ξ) (1)3 ) Bias(β SD( ξ) (1)3 ) SD(β

n = 100

n = 500

n = 1000

−0.136 −0.0117 0.135 0.341

−0.0231 −0.0136 0.0429 0.146

0.0310 0.0143 0.0310 0.101

The results are from 500 simulations each.

14

T.W. Yee, A.G. Stephenson

5 Discussion One of the major breakthroughs of the 1970s in statistics was the proposal of the class of GLMs (Nelder and Wedderburn 1972). It unified several important and different regression models at that time under a common framework, and has had a major impact on statistics. Nowadays, courses on GLMs are taught by many statistics departments, and most major statistical software packages now offer routines for fitting and analysing GLMs. In the 1980s GAMs added smoothing capabilities to GLMs. This paper has described the class of VGLMs and VGAMs which extend GLMs and GAMs to handle multiple predictors, multivariate responses and many statistical models and distributions. We have attempted to convey the many advantages that VGLMs/VGAMs can offer for extreme value analyses, for example, they provide a broad and attractive family for representing non-stationarity in extreme value datasets. Although we have illustrated only GEV and GP modelling, the versatility of VGLMs and VGAMs for extreme value analysis is far broader. The q-dimensional multivariate extreme value distribution, obtained as the class of limiting distributions of linearly normalized componentwise maxima, can also be modelled within the VGLM/VGAM framework. The censored likelihood approach for modelling multivariate threshold exceedances (Ledford and Tawn 1996) can similarly be employed. The are no additional difficulties in implementing these larger classes of extreme value models under VGLM/VGAM, other than those which exist under any modelling framework, such as model choice and increasing computational burden. The implementation of these models in the vgam package is an area of further work. The virtues of the VGLM/VGAM framework are extensive, but VGLMs/VGAMs are not without some weaknesses. For such a wide range of models, reliability and numerical stability is an important issue. Complex parameter constraints and nonregular likelihood conditions can be problematic. Expected information matrices are often difficult to derive or too expensive to evaluate, and this often necessitates the use of different algorithms for different types of models such as those described in Section 2.2. Furthermore, issues of model choice and diagnostic tools require more specific methodologies, and different implementations are required to accommodate them. The addition of modelling diagnostics and improved handling of parameter constraints are areas of further work. Despite these issues, we feel that the VGAM framework offers many compelling advantages to theoreticians and practitioners of extreme value analysis. Acknowledgements The first author wishes to thank Stuart Coles for encouragement to start this work during a period of sabbatical leave at the Department of Mathematics, University of Bristol. This work was completed while the first author was on another sabbatical at the Department of Statistics & Applied Probability at the National University of Singapore. He thanks both departments for their hospitality. We thank the three reviewers and Editor for their many detailed comments which improved the paper markedly.

Appendix 1: Implementation Details This section briefly describes some implementation details used by the vgam package, especially those relating to numerical analysis. Online documentation is provided

Vector generalized linear and additive extreme value models

15

with the software, and additional documentation is provided at the first author’s website. The package is under continual development, therefore improvements and changes to the details and usage given here are possible. As all extreme value models in the vgam package have a log-likelihood function, half-stepsizing is useful because it ensures that an improvement is made at each iteration. This extension takes a half step if the next iteration ‘overshoots’ the solution because the quadratic approximation to the log-likelihood function is not accurate. If necessary an even smaller step is taken (e.g., 14 or 18 etc.), and an improvement is guaranteed since the step direction is ascending. Half-stepsizing is particularly useful at early stages of the iteration if the initial values are poor. Furthermore, a very large negative value of the log-likelihood can be returned to prevent an iteration from falling outside the parameter space, e.g., 1 + ξi (yi − μi )/ψi > 0 is needed for the GEV distribution. Initial values are often obtained using moment estimation. For example, for the Gumbel distribution the mean and variance are given by μ + σ γ and π 2 σ 2 /6 respectively, where γ ≈ 0.58 is√Euler’s constant, and hence moment estimation leads to the starting values  σi(0) = 6 s/π and  μi(0) = m −  σi(0) γ , where m and s are the sample mean and standard deviation. All the vgam family functions listed in Table 1 evaluate initial values internally, i.e., are self-starting. This compares with functions fext() and forder() in the R package evd (Stephenson 2002) which are not. Self-starting family functions are useful for novices who may not know how to obtain a reasonable initial value. The family functions of Table 1 also allow the user to input initial values if desired. The user can therefore employ other forms of moment estimation such as probability weighted moment estimation (Hosking et al. 1985) or L-moment estimation (Hosking et al. 1997). For further details on moment estimation for threshold models, see Hosking and Wallis (1987). One technicality of extreme value models is the presence of singularities in the derivatives that need to be programmed as special cases. For example, for the GP distribution,   (1 + ξi−1 ) yi ∂i 1 ξi yi = 2 log 1 + , − ∂ξi σi 1 + ξi yi /σi ξi ξi and there is a singularity at ξi = 0. Upon series expansions, one obtains the limit

lim

ξi →0

yi ∂i = ∂ξi σi



 yi −1 . 2σi

Another example is the expected second derivatives of the GEV log-likelihood. For example, Prescott and Walden (1980) derived 

∂ 2 i E ∂μi ∂ψi

 =

1 (1 + ξi )2 (1 + 2ξi ) − (2 + ξi ) , ψi2 ξi

16

T.W. Yee, A.G. Stephenson

and using L’Hospital’s rule, it can be shown that  lim E

ξi →0

∂ 2 i ∂μi ∂ψi

 =

0.42278 2(1 − γ ) − ψ ∗ (2) ≈ , 2 ψi ψi2

where ψ ∗ (·) is the digamma function. Parameter link functions are used to increase the chances of successful convergence by avoiding numerical problems, e.g., by using a log link for positive quantities. Additionally this makes a quadratic approximation to the log-likelihood function more realistic, and consequently, standard errors are more accurate. Users can very easily choose from a range of parameter link functions. Note that if η j = g j(θ j) then se( η j) ≈ |gj( θ j)| · se( θ j). Suitable parameter link functions also tend to speed up the convergence rate. The zero argument in a vgam family function is used to model parameters as intercept only. This is useful for parameters that require large amounts of data for them to be estimated accurately, since for small datasets the practitioner should model these parameters as simply as possible. This often occurs in extreme value models, since small datasets contain little information about the shape parameter ξ . A particular noteworthy innovation incorporated into vgam is smart prediction (written by T. Yee and T. Hastie), whereby the parameters of parameter dependent functions are saved on the object. This means that functions such as scale(), poly(), bs() and ns() will always correctly predict. In contrast, in ordinary R, terms such as I(poly()) and bs(scale()) will not predict properly, while in S-PLUS even terms such as poly() and scale() will not predict properly. Now a note about the software defaults. At present, the vgam family functions for the GEV and GP distributions have   1 log ξ + = η3 = β(3)1 (5.1) 2 as the default for the shape parameter ξ , i.e., a log link with an offset of 12 . There are a number of reasons for this. Firstly, standard ML inference applies when ξ > − 12 so that generic functions such as summary() and vcov() applied to the fit should give valid results provided the estimate satisfies this constraint. Secondly, if the estimate actually conflicts with this constraint then this will give numerical problems during estimation which should alert the user to think more deeply about the ramifications of allowing ξ ≤ − 12 . Thirdly, the default is easily overridden, e.g., by simply setting lshape="identity"; however, output from summary() and vcov() could be very misleading. Another software default for the GEV and GP distributions is zero=3 and zero=2 respectively. This argument causes the shape parameter to be modelled as an intercept-only (e.g., as in Eq. 5.1). This choice minimizes the chances of numerical problems arising from allowing the shape parameter to be a linear/additive model in the covariates; and as before, the default can very easily be overridden (e.g., setting zero=NULL). For some distributions vgam uses functions such as deriv3() to obtain derivatives, meaning laborious derivatives can be handled. However, only observed second order derivatives can be evaluated with this method. Advances in symbolic computing such as mathStatica (Rose and Smith 2002) may result in improvements to the present code.

Vector generalized linear and additive extreme value models

17

In contrast to evd, the unifying theoretical framework advocated here naturally makes use of object-oriented software. vgam is written in Version 4 of the S language (Chambers 1998) and uses the modelling syntax of Chambers and Hastie (1993), including formula and data frames, as well as object-oriented programming. This means other users can add functionality by writing their own methods functions. The unified theory of Section 2 also means that inference and model checking is more streamlined, e.g., resid() can return three common forms of residuals for VGAMs, called working, Pearson and response residuals. These can be plotted to help detect outliers and lack of fit. Appendix 2: Code for Extreme Value Analysis A copy of this code is available from http://www.stat.auckland.ac.nz/~yee/updates, as well as the rainfall data. Example 1: Fremantle Data library(ismev); data(fremantle); fremantle = fremantle; detach() library(VGAM) fit1

= vgam(SeaLevel ~ s(Year, df=3) + s(SOI, df=3), family=gev(lshape="identity", zero=3), data=fremantle, maxit=50) par(mfrow=c(2,2), las=1) plot(fit1, se=TRUE, lcol="blue", scol="darkgreen") cm = list("(Intercept)"=diag(3), "Year"=rbind(1,0,0), "bs(Year, degree = 1, knot = 1945)"=rbind(0,1,0), "SOI"=diag(3)[,1:2]) fit2 = vglm(SeaLevel ~ Year + bs(Year, degree=1, knot=1945)+ SOI, family=gev(lshape="identity", zero=3), data=fremantle, constraints=cm) par(mfrow=c(2,2), las=1) plotvgam(fit2, se=TRUE, lcol="blue", scol="darkgreen") coef(fit2, matrix=TRUE) 1 - pchisq(2*(logLik(fit1)-logLik(fit2)), df=df.residual(fit2)-df.residual(fit1))# Deviance test

Example 2: Rainfall Data threshold = ysrain$Thresh[ysrain$CluMax] clmax = ysrain$CluMax rainfit = vglm(CenSite ~ bs(SiteA)+bs(SiteB)+bs(SiteC)+ bs(SiteD)+Year, family = gpd(threshold, lshape="identity", zero=2), data = ysrain, subset = clmax, maxit = 50)

18

T.W. Yee, A.G. Stephenson

par(mfrow=c(3,2), las=1) plotvgam(rainfit, se=TRUE, lcol="blue", scol="darkgreen", ylim=c(-0.5,0.8))

Example 3: Simulation Example n = 500 threshold = 0 x2 = seq(0, 1, len=n) x3 = runif(n) y = rgpd(n,loc=threshold, scale=exp(0.5 * sin(2*pi*x2)+ 0.1*x3), shape = -0.1) fit = vgam(y ~ s(x2) + s(x3), family=gpd(threshold, lshape="logoff", zero=2)) par(mfrow=c(1,2), las=1) plot(fit, se=TRUE, lcol="blue", scol="darkgreen", rug=FALSE)

References Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J.: Statistics of Extremes. Wiley, Chichester, England (2004) Broyden, C.G.: The convergence of a class of double-rank minimization algorithms. J. Inst. Math. Appl. 6, 76–90 (1970) Buja, A., Hastie, T., Tibshirani, R.: Linear smoothers and additive models (with discussion). Ann. Stat. 17, 453–510 (1989) Chambers, J.M.: Programming with Data: A Guide to the S Language. Springer, New York (1998) Chambers, J.M., Hastie, T.J. (eds.) Statistical Models in S. Chapman & Hall, New York (1993) Chavez-Demoulin, V., Davison, A.C.: Generalized additive modelling of sample extremes. Appl. Stat. 54, 207–222 (2005) Coles, S.: An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag, London (2001) de Boor, C.: A Practical Guide to Splines. Springer, Berlin (1978) Efron, B., Hinkley, D.V.: Assessing the accuracy of the maximum likelihood estimator: observed versus expected Fisher information. Biometrika 65, 457–481 (1978) Ferro, C.A.T., Segers, J.: Inference for clusters of extreme values. J. R. Stat. Soc., B 65, 545–556 (2003) Fessler, J.A.: Nonparametric fixed-interval smoothing with vector splines. IEEE Trans. Acoust. Speech Signal Process. 39, 852–859 (1991) Green, P.J., Silverman, B.W.: Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman & Hall, London (1994) Hastie, T., Tibshirani, R.: Generalized additive models: some applications. J. Am. Stat. Assoc. 82, 371–386 (1987) Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models. Chapman & Hall, London (1990) Hosking, J.R.M., Wallis, J.R.: Parameter and quantile estimation for the generalized Pareto distribution. Technometrics 29, 339–349 (1987) Hosking, J.R.M., Wallis, J.R.: Regional Frequency Analysis. Cambridge University Press, Cambridge, UK (1997) Hosking, J.R.M., Wallis, J.R., Wood, E.F.: Estimation of the generalized extreme-value distribution by the method of probability-weighted moments. Technometrics 27, 251–261 (1985) Ihaka, R., Gentleman, R.: R: A language for data analysis and graphics. J. Comput. Graph. Stat. 5, 299–314 (1996) Leadbetter, M.R., Lindgren, G., Rootzén, H.: Extremes and Related Properties of Random Sequences and Series. Springer-Verlag, New York (1983)

Vector generalized linear and additive extreme value models

19

Ledford, A.W., Tawn, J.A.: Statistics for near independence in multivariate extreme values. Biometrika 83, 169–187 (1996) Nelder, J.A., Wedderburn, R.W.M.: Generalized linear models. J. R. Stat. Soc., A, 135, 370–384 (1972) Oakes, D., Manatunga, A.K.: Fisher information for a bivariate extreme value distribution. Biometrika 79, 827–832 (1992) Pauli, F., Coles, S.: Penalized likelihood inference in extreme value analyses. J. Appl. Stat. 28, 547–560 (2001) Pickands, J.: Statistical inference using extreme order statistics. Ann. Stat. 3, 119–131 (1975) Prescott, P., Walden, A.T.: Maximum likelihood estimation of the parameters of the generalized extreme-value distribution. Biometrika 67, 723–724 (1980) Reinsch, C.: Smoothing by spline functions. Numer. Math. 10, 177–183 (1967) Rose, C., Smith, M.D.: Mathematical Statistics with Mathematica. Springer, New York (2002) Rosen, O., Cohen, A.: Extreme percentile regression. In: Härdle, W., Schimek, M.G. (eds.) Statistical Theory and Computational Aspects of Smoothing: Proceedings of the COMPSTAT ’94 Satellite Meeting held in Semmering, Austria, August 1994, pp. 27–28. Physica-Verlag, Heidelberg (1996) Schimek, M.G. (ed.): Smoothing and Regression: Approaches, Computation, and Application. Wiley, New York (2000) Shi, D.: Fisher information for a multivariate extreme value distribution. Biometrika 82, 644–649 (1995) Smith, E.L., Walshaw, D.: Modelling bivariate extremes in a region. In: Bernardo, J.M., Bayarri, M.J., Berger, J.O., Dawid, A.P., Heckerman, D., Smith, A.F.M., West, M. (eds.) Bayesian Statistics 7. Oxford University Press, Oxford, UK (2003) Smith, R.L.: Maximum likelihood estimation in a class of nonregular cases. Biometrika 72, 67–90 (1985) Smith, R.L.: Extreme value theory based on the r largest annual events. J. Hydrol. 86, 27–43 (1986) Stephenson, A.G.: evd: Extreme value distributions. R News 2(2), 31–32 (2002) Stephenson, A.G., Gilleland, E.: Software for the analysis of extreme events: the current state and future directions. Extremes 8, 87–109 (2006) Tawn, J.A.: Bivariate extreme value theory: models and estimation. Biometrika 75, 397–415 (1988) Weissman, I.: Estimation of parameters and large quantiles based on the k largest observations. J. Am. Stat. Assoc. 73, 812–815 (1978) Yee, T.W.: The Analysis of Binary Data in Quantitative Plant Ecology. Ph.D. thesis, University of Auckland, Department of Mathematics & Statistics, University of Auckland, New Zealand (1993) Yee, T.W.: Vector splines and other vector smoothers. In: Bethlehem, J.G., van der Heijden, P.G.M. (eds.) Proceedings in Computational Statistics COMPSTAT 2000. Physica-Verlag, Heidelberg (2000) Yee, T.W.: A User’s Guide to the vgam Package, http://www.stat.auckland.ac.nz/~yee/VGAM (2007) Yee, T.W., Hastie, T.J.: Reduced-rank vector generalized linear models. Stat. Modelling 3, 15–41 (2003) Yee, T.W., Wild, C.J.: Vector generalized additive models. J. R. Stat. Soc., B 58, 481–493 (1996)