Canonical transformations of skew-normal variates - Springer Link

Test (2010) 19: 146–165 DOI 10.1007/s11749-009-0146-x O R I G I N A L PA P E R

Canonical transformations of skew-normal variates Nicola Loperfido

Received: 10 August 2007 / Accepted: 17 February 2009 / Published online: 12 March 2009 © Sociedad de Estadística e Investigación Operativa 2009

Abstract Conditions are given for linear functions of skew-normal random vectors to maximize skewness and kurtosis. As a direct implication, several measures of their multivariate skewness and kurtosis are shown to be equivalent. An estimator of the shape parameter with good statistical properties is also considered. These results are strictly related to canonical forms of skew-normal distributions and linear transformations to normality. Keywords Independent component analysis · Normalizing transformations · Principal components analysis · Skewness · Kurtosis Mathematics Subject Classification (2000) 62F03 · 62E17 · 62P25 1 Introduction There is a quickly growing literature on classes of multivariate distributions which include the normal ones as special cases. Many proposals are discussed in the book edited by Genton (2004), together with their applications in finance, econometrics, environmetrics, and biology. Survey papers by Azzalini (2005, 2006) and Kotz and Vicari (2005) give a critical overview of research in this area, pointing out the related inferential issues. Azzalini and Dalla Valle (1996) inspired many researchers by their seminal paper introducing the multivariate skew-normal distribution (SN, hereafter), whose pdf is f (z; Ω, α) = 2φp (z; Ω)Φ α T z , (1) This research was supported by the Ministero dell’Università, dell’Istruzione e della Ricerca with grant PRIN No. 2006132978. N. Loperfido () Facoltà di Economia, Università di Urbino “Carlo Bo”, Via Saffi 42, 61029 Urbino (PU), Italy e-mail: [email protected]

Canonical transformations of skew-normal variates

147

where z, α ∈ Rp , φp (·; Ω) is the pdf of Np (0p , Ω), Φ(·) is the cdf of N (0, 1), and Ω is a correlation matrix. We shall write z ∼ SN p (Ω, α) to denote the above density. The matrix Ω and the vector α are commonly referred to as the scale parameter and shape parameter, respectively. When the shape parameter equals the null vector, the above density is a normal one. Skewness easily follows from f (z; Ω, α) not being necessarily equal to f (−z; Ω, α). Despite skewness, SN distributions show a remarkable degree of tractability, which is preserved when a location parameter is added and the scale parameter is chosen to be any valid covariance matrix. Linear transformations of SN random vectors attracted a lot of attention, because of their relevance and implications. Azzalini and Dalla Valle (1996) proved that the SN class is closed with respect to marginalization and sign changes. Azzalini and Capitanio (1999) give results for wider classes of linear transformations. Gupta and Huang (2002) characterize SN random vectors through linear transformations, thus mirroring a fundamental property of the normal class. They also give conditions for a linear and a quadratic function of the same SN vector to be independent of each other. Interest often focuses on finding mutually independent linear transformations, rather than on independence between the observed variates. Azzalini and Capitanio (1999) show that there exist linear transformations whose distribution is canonical SN: the scale parameter is an identity matrix, components are mutually independent, and all components but one are normally distributed. In the following, we shall refer to this linear transformation and to the corresponding matrix as to the canonical transformation and the canonical matrix, respectively. The paper deals with canonical transformations, their estimates. and their connections with shape parameters. More precisely, it shows that one row of the canonical matrix is proportional to the shape parameter, which identifies the direction maximizing both skewness and kurtosis. It also proposes an estimate of the canonical transformation with good statistical properties. The paper is structured as follows. Section 2 contains main results and their applications to several multivariate statistical techniques. Section 3 proposes an estimation method based on results in the previous section, whose sampling properties are explored through simulations in Sect. 4. Section 5 presents a numerical example. Section 6 discusses results in the paper and some related research issues. The Appendix contains all proofs.

2 Main results This section examines projections of SN random vectors which maximize several nonnormality features, and their connection with canonical transformations. Sections 2.1 and 2.2 relate the shape parameter to canonical transformations and directions maximizing measures of shape, discussing implications for principal component analysis and independent component analysis, respectively. Generalizations to the more interesting case with a location parameter and a scale parameter chosen among valid covariance matrices will be considered in the following section.

148

N. Loperfido

2.1 Canonical transformations Finding transformed vectors whose components are independent may be a goal by itself, as well as an intermediate step in the statistical analysis. For normal random vectors, this is usually achieved either by standardization or by transformation into principal components. Both methods may be inappropriate for SN random vectors, whose components may be dependent even when uncorrelated. Azzalini and Capitanio (1999) overcame the problem by showing that there is a linear transform y = W z of a random vector z ∼ SN p (Ω, α) satisfying y ∼ SN p (Ip , α ∗ ), where W is a p × p real matrix, and at most one component of α ∗ is nonzero. The pdf of y is p T φ1 (yi ; 1), f (y) = 2Φ yj α Ωα

1 ≤ j ≤ p.

(2)

i=1

√ They also proposed α T Ωα as a measure of nonnormality which possesses several appealing features: it is a nonnegative, scale invariant quantity and a one-to-one function of Mardia’s indices of multivariate skewness and kurtosis (Mardia 1970). Without loss of generality we can assume that the index j in (2) takes the value one. Propositions 5 and 6 in Azzalini and Capitanio (1999) imply that rows w1T , . . . , wpT of W satisfy w1 = √

αT z α T Ωα

wiT Ωwj = 0,

∼ SN 1 1, α T Ωα , i = j,

(3)

wiT Ωwi = 1, for i, j = 1, . . . , p. The following proposition gives a condition for principal components of SN random vectors to be mutually independent. Proposition 2.1 Let z ∼ SN p (Ω, α), and let Γ T be a p × p matrix whose columns γ1 , . . . , γp are the normalized eigenvectors of Ω corresponding to the eigenvalues λ1 , . . . , λp with γ1 ∝ α. Then principal components of z are independent, proportional to the canonical variates, and the variance of z can be represented as follows: πλ1 + α T α(π − 2)λ21 ΓT. , . . . , λ Γ diag p π(1 + λ1 α T α)

(4)

The restriction of α being proportional to γ1 can be relaxed into α being proportional to any eigenvector of Ω. Statistical application of the above proposition include Common Principal Components, a data analysis technique created by Bernhard Flury that allows two or more matrices to be compared in a hierarchical fashion (Flury 1988). As an example, consider independent data x1 , . . . , xn satisfying xi ∼ SN p (Ωi , αi ) for i = 1, . . . , n, where Ω1 , . . . , Ωn share the same eigenvectors, and α1 , . . . , αn are proportional to some of them. The same proposition implies that


149

the dependence structure of x1 , . . . , xn can be easily described via the Common Principal Components model despite being nonnormal, hence simplifying inferential procedures. The connection between principal components and canonical variates becomes apparent for exchangeable SN random vectors, that is, random vectors whose distribution is multivariate skew-normal and invariant under permutation of the vectors’ components. As an example, consider

1 ω 1 z ∼ SN 2 ,ψ , (5) ω 1 1 where |ω| < 1 and ψ ∈ R. Two eigenvectors of the scale matrix are γ1 = (1, 1)T and γ2 = (−1, 1)T . The corresponding eigenvalues are 1 + ω and 1 − ω. The variance Σ of z can be easily computed using results in Azzalini and Capitanio (2003) and a little algebra: 1+ω 1+ω 1−ω 4ψ 2 T γ2 γ2 + 1 − γ1 γ1T . Σ= 2 2 π 1 + 2ψ (1 + ω) 2

(6)

It follows that γ1 and γ2 are also eigenvectors of Σ . Moreover, the shape parameter is an eigenvalue of both Ω and Σ . Transformations into principal components and canonical variates are characterized by the matrices −1

√ 1 1 1 1+ω √ 0 , (7) and W = √ −1 1 1−ω 0 2 √ √ respectively, so that Γ = diag( 1 + ω, 1 − ω)W . The independence of the first and second components easily follows. 1 Γ =√ 2

1 1 −1 1

2.2 Measures of shape In the univariate case, two most popular measures of skewness and kurtosis of a random variable Z are

X−μ 3 X−μ 4 β1 (Z) = E 2 and β2 (Z) = E , (8) σ σ where μ = E(Z), σ 2 = V (Z), and moments of appropriate order are finite. Mardia (1970) generalized these measures to multivariate distributions. For a p-dimensional random vector z with mean μ and variance Σ, they are defined as follows:

3 M β1,p (z) = E (z − μ)T Σ −1 (w − μ) ,

2 M (z) = E (z − μ)T Σ −1 (z − μ) , β2,p

(9) (10)

where z and w are independent and identically distributed random vectors, and moments of appropriate order are finite. When z ∼ SN p (Ω, α), Mardia’s indices have a

150

N. Loperfido

simple analytical form Azzalini and Capitanio (1999)

3 α T Ωα , (11) π + (π − 2)α T Ωα 2 α T Ωα M β2,p (z) = p(p + 2) + 8(π − 3) , (12) π + (π − 2)α T Ωα √ which are one-to-one functions of α T Ωα, thus motivating its use as a measure of nonnormality. Malkovich and Afifi (1973) proposed other measures of multivariate skewness and kurtosis for a p-dimensional random vector z: D D β1,p (z) = maxp β1 cT z , β2,p (z) = maxp β2 cT z , (13) M (z) = 2(4 − π)2 β1,p

c∈R0

c∈R0

p

where superscript “D” reminds their directional nature, and R0 is the set of all pD (z) and β D (z) equal 0 and 3 when z dimensional nonnull vectors. Measures β1,p 2,p is normally distributed (it follows from any linear combination of a normal random vector being univariate normal). Their sample analogues are

D b1,p (X)

2 n

1 c T xi − c T x 3 = maxp √ c∈R0 n cT Sc

and

(14)

i=1

D b2,p (X)

n 1 c T xi − c T x 4 = maxp , √ c∈R0 i=1 n cT Sc

(15)

where x, S, and X denote the sample mean, the sample variance, and the data matrix X whose rows are the vectors x1T , . . . , xnT . Baringhaus and Henze (1991) considered D (z), β D (z) and of their sample analogues for elliptically distribproperties of β1,p 2,p D (z) and uted random vectors. Machado (1983) approximated percentage points of b1,p D (z) using appropriate transformations of the asymptotic distribution for normal b2,p vectors of dimensions 2, 3, and 4. Kuriki and Takemura (2001) provided approximaD (z) and bD (z) for observations and bounds for the upper tail probabilities of b1,p 2,p tions of arbitrary dimension. √ M (z), β M (z), β D (z), β D (z), and α T Ωα When z ∼ SN p (Ω, α), measures β1,p 2,p 1,p 2,p D (z), β D (z) are proportional are functionally related, and vectors maximizing β1,p 2,p to α. More formally: D (z) = Proposition 2.2 Let c be a nonzero real number and z ∼ SN p (Ω, α). Then β1,p M (z) = β (cα T z) and β D (z) = β M (z) − p(p + 2) + 3 = β (cα T z). β1,p 1 2 2,p 2,p

Proposition 2.2 gives an intuitive interpretation of the shape parameter α and provides the first example, to the best of the author’s knowledge, of a skewed multivariate D (z) and β D (z) have a simple analytical form. distribution for which measures β1,p 2,p Directions maximizing kurtosis were introduced as tools for testing multivariate normality. However, they may be interesting in their own right, as in independent


151

component analysis (ICA). The basic ICA model is y = As, where y is the observed vector, s is a vector with independent components, some of them nonnormal, and A is an invertible matrix. The nonnormal components are often assumed to be leptokurtic. Theory, applications, and algorithms of ICA are throughly discussed by Hyvarinen and Oja (2000), Hyvarinen et al. (2001), and Stone (2005). Skew-normal random vectors y ∼ SN p (Ω, α) satisfy these assumptions, with A and s being the inverse of the canonical matrix and the vector of canonical variates, respectively, √ where the only nonnormal component is skew-normal with shape parameter α T Ωα (and hence leptokurtic, as any univariate nonnormal SN distribution). ICA aims to recover informations about nonnormal components by finding directions maximizing a nonnormal feature, often chosen to be kurtosis. For y ∼ SN p (Ω, α) with α = 0p , the choice is motivated by cα T y with c = 0, having maximal kurtosis and being proportional to the only nonnormal canonical variate. Directions maximizing kurtosis also occur in Cluster Analysis. Peña and Prieto (2001) propose to identify clusters using projections onto directions minimizing and maximizing kurtosis of the projected data. Results on canonical variates in Azzalini and Capitanio (1999), together with Propositions 2.1 and 3.1 in the next section, imply that these directions tend to be orthogonal to Ωα and similar to the direction of the shape vector α, respectively, when the data are i.i.d. from SN p (Ω, α) and the sample size is large enough.

3 An estimation method In order to fit real data, Azzalini and Dalla Valle (1996) include a scale and a location parameter, obtaining

f (z; ξ, Ω, α) = 2φp (z − ξ ; Ω)Φ α T ω−1/2 (z − ξ ) (16) in the notation in Azzalini and Capitanio (1999), where Ω = {ωij } is a symmetric, positive definite p × p matrix, and ω = diag(ω11 , . . . , ωpp ). The vector ξ , the matrix Ω, and the vector α will be referred to as the location, the scale, and the shape parameters of the SN p (ξ, Ω, α), respectively. Azzalini and Capitanio (1999) also introduced the parameter η = ω−1/2 α in order to simplify maximization of the likelihood function. Propositions 2.1 and 2.2 still hold true, with minor changes, when α and SN p (Ω, α) are replaced with η and SN D p (ξ, Ω, η), respectively, where the latter denotes the distribution corresponding to the above pdf, with ω−1/2 α replaced with η. Consistently with the above argument, we shall refer to η as to the directional parameter, reminding that it characterizes the direction of maximal nonnormality. Similarly, we shall refer to the triple (ξ, Ω, η) as to the directional parameterization, and SN D p (ξ, Ω, η) shall denote a p-dimensional SN distribution whose pdf is parameterized via (ξ, Ω, η). In statistical practice, the canonical matrix needs to be estimated. Intuitively, its estimate should directly follow from estimates of the location, scale, and shape parameters, maybe obtained using either the method of moments or maximum likelihood. However, estimates of the shape parameter might be complex, when using the former

152

N. Loperfido

method, and infinite, when using the latter. These problems are well known in the univariate case (Azzalini and Capitanio 1999; Pewsey 2000). Similar problems, or worse, are likely to appear in the multivariate setting. The estimate of the canonical matrix presented in this section circumvents the problem, being based on the direction maximizing sample skewness. The following proposition shows that the direction maximizing sample skewness (kurtosis) almost surely converges to the direction maximizing population skewness (kurtosis), when the latter is unique, as it happens in the skew-normal case. Proposition 3.1 Let c ∈ Rp be the unique vector, up to a multiplicative constant, maximizing skewness (kurtosis) of cT z, where z ∈ Rp is a random vector with finite third (fourth) moment. Moreover, let Xn be an n × p data matrix whose rows are independent and identically distributed as z. Then the linear combination of variates maximizing sample skewness (kurtosis) converges almost surely to c. Crucial steps in the estimation process are motivated by the following identities: Lemma 1 Let Σ and Σ −1 be the variance and concentration matrix of the random vector z ∼ SN D p (ξ, Ω, η). Then ηT Ση =

ηT Ωη{π + (π − 2)ηT Ωη} , π(1 + ηT Ωη)

(17)

2ηηT . π + (π − 2)ηT Ωη

(18)

Σ −1 = Ω −1 +

Let x1 , . . . , xn be a random sample from SN D p (ξ, Ω, η) whose mean and variance are x and S, respectively. The estimation method can be operationally described as follows. Step 1: Find the normalized vector λ maximizing the skewness of a linear combination of the data: n T c xi − c T x 3 λ = arg max . (19) √ p cT Sc c∈R0 ;cT c=1 i=1 Step 2: Evaluate the sample version of Mardia’s skewness

3 1 (xi − x)S −1 (xj − x) 2 n n

M b1,p =

n

(20)

i=1 j =1

∗ = min(bM , 0.99). We use b∗ rather than bM because the latter can and let b1,p 1,p 1,p 1,p ∗ is always smaller than 1, exactly like β M . take any positive real value, while b1,p 1,p M rather than b∗ may lead to an estimated scale matrix which Moreover, using b1,p 1,p is not positive definite.


153

M (z) and ηT Ωη with b∗ and η, respectively, in the equaThen substitute β1,p ηT Ω 1,p D tion representing Mardia’s skewness of z ∼ SN p (ξ, Ω, η):

M β1,p (z) = 2(4 − π)2

ηT Ωη π + (π − 2)ηT Ωη

3 (21)

.

η, Then solve, for ηT Ω 1/3 ∗ b1,p wπ , w= . (22) 1 − w(π − 2) 2(4 − π)2 η, respectively, in Step 3: Substitute Σ, λ = η/ ηT η, η, ηT Ωη with S, λ, η, ηT Ω the equation η = η Ω T

λT Σλ =

ηT Ωη{π + (π − 2)ηT Ωη} , πηT η(1 + ηT Ωη)

(23)

which is a direct consequence of the first equation in Lemma 1 and of the definition of λ. Then solve, for ηT η, ηT η=

η{π + (π − 2) η} ηT Ω ηT Ω . η)π λT Σ λ(1 + ηT Ω

(24)

η, respectively, in the −1 , Step 4: Substitute Σ −1 , Ω −1 , η, ηT Ωη with S, Ω η, ηT Ω D equation representing the concentration matrix of z ∼ SN p (ξ, Ω, η): 2ηηT . π + (π − 2)ηT Ωη

Σ −1 = Ω −1 +

(25)

Then solve the resulting equation for Ω: = S −1 − Ω

−1 2 η ηT . η π + (π − 2) ηT Ω

(26)

Step 5: Find p − 1 vectors b1 , . . . , bp−1 and a vector η satisfying j = 0, biT Ωb

i = j,

η = 0, biT Ω

i = 1, . . . , p − 1,

(27)

be the matrix whose rows are and let C ηT , η ηT Ω

b1T 1 b1T Ωb

,

...,

T bp−1 T Ωb p−1 bp−1

.

(28)

This method is not meant to replace neither the method of moments nor maximum likelihood for estimating the parameters of SN distributions: its purpose is limited to estimation of the canonical matrix. For this reason, there is no need to be concerned if some steps of the method are not guided by deep inferential principles, as long as

154

N. Loperfido

it produces valid estimates, i.e., neither complex nor infinite. For example, the matrix may not be positive definite. This would be a problem if the algorithm were aimed Ω is needed only for finding Ω η. Moreover, as a direct at estimating Ω. However, Ω consequence of Proposition 3.1, it follows that estimates of Ω and α described in this section converge almost surely to Ω and α, respectively.

4 A simulation study Simulations in this section compare efficiency of estimators for the shape parameter based on the method of moments and the proposed method. They are based on 5000 samples of sizes n = 100, 150, 200, 250, 300 from SN p (0p , Ip , νp −1/2 1p ), where 0p , Ip , and 1p denote the zero vector, the identity matrix, and the unit vector of size p = 2, 3, while ν = 1, 2, 3 denotes the nonnormality index introduced by Azzalini and Capitanio (1999). For each sample and each estimate a, the quantity (α − a)T (α − a) was computed in order to estimate the corresponding mean square error. Simulation results are reported in Table 1. They hint that the proposed estimation method is more efficient than method of moments and that estimation of the shape parameter becomes more difficult when its dimension increases. The simulations results regarding ML estimation are not reported, due to a nonnegligible percentage of samples (about 10%) for which the algorithm failed to converge. Numerical instability might be due to the presence of boundary (i.e., infinite) estimates of the shape parameter. The problem is well known in the univariate case (Azzalini and Capitanio 1999; Pewsey 2000). The problem has not been studied in the multivariate case, neither theoretically nor through simulations. Hence it is not possible, at present time, to tell which samples lead to boundary estimates of the shape parameter. Other simulations (not reported here) are consistent with the following statements. Directions maximizing population’s skewness and kurtosis (i.e., vectors proportional to the shape parameter) are better estimated using vectors maximizing sample skewness than vectors maximizing sample kurtosis. Moreover, directions identifying transformations to normality (i.e., vectors orthogonal to the direction of Ωα) are estimated quite efficiently using vectors orthogonal to the estimated direction of Ωα obtained with the algorithm described in the previous section.

5 A numerical example This section applies results in the previous ones to measurements of chest and shoulders taken from 118 Italian women, aimed to design bride dresses for the women themselves. Measurements were taken by professional Taylors, reported in centimeters, and rounded to the nearest integer. Table 2 reports some descriptive statistics for the two variables. Mardia’s index of multivariate skewness is 0.5927, and the p-value of the corresponding normality test is 0.0201. Hence there is enough evidence to reject the hypothesis of normality. However, nonnormality seems to be moderate, and the distribution underlying the data is unimodal, as hinted by the scatterplot (Fig. 1), the

Canonical transformations of skew-normal variates Table 1 Simulated mean square errors of estimates for the shape parameter α obtained using the method of moments and the proposed method, based on 5000 replicates of samples of sizes n = 100, 150, 200, 250, 300 from SN p (0p , Ip , νp−1/2 1p ), p = 2, 3, and ν = 0, 1, 2, 3

Variates

2

2

2

2

3

3

3

3

155

Nonnormality

Sample

Method

Maximum

index

size

of moments

skewness

0

1

2

3

0

1

2

3

100

11.5618

2.3735

150

11.4375

1.7134

200

11.3631

1.4195

250

11.3230

1.2434

300

11.3006

1.1539

100

30.0287

3.3612

150

29.9166

3.0251

200

29.8135

2.8165

250

29.7555

2.6526

300

29.7954

2.6876

100

52.7612

11.9962

150

52.4068

11.3451

200

52.0700

10.8706

250

51.8726

10.5846

300

51.7892

10.4740

100

73.6125

26.1261

150

72.9016

26.8267

200

72.7886

26.7801

250

72.6827

27.4153

300

72.4437

27.2869

100

8.1421

8.7099

150

8.0642

4.0802

200

8.0062

2.8382

250

7.9780

2.2347

300

7.9661

1.9475 12.1755

100

20.6886

150

20.4568

6.3234

200

20.3678

4.4623

250

20.3055

3.8828

300

20.2677

3.5496

100

35.1636

25.5612

150

34.8632

19.2609

200

34.7286

14.3754

250

34.5878

11.6398

300

34.5588

11.1589

100

49.9517

45.1658

150

49.6677

38.5079

200

49.4415

33.8463

250

49.4054

31.2639

300

49.3100

30.2276

156

N. Loperfido

Table 2 Summary statistics of the original variables

Mean

Variance

Skewness

Kurtosis

Shoulders

39.6780

2.9132

0.0725

3.3165

Chest

91.3814

35.4902

0.5044

3.3809

Fig. 1 Scatterplot of shoulders and chest data

Fig. 2 Healy’s plot of shoulders and chest data

Healy’s plot (Fig. 2), and the histograms (Figs. 3 and 4). The above features make the bivariate skew-normal distribution SN 2 (ξ, Ω, α) the natural candidate for data fitting and motivate the application of the estimation method described in Sect. 3.


157

Fig. 3 Histogram of shoulders data

Fig. 4 Histogram of chest data

The method presented in Sect. 3 estimates with = Ω

3.3231 8.4925 8.4925 88.1222

and η=

−0.1522 0.4228

(29)

the scale matrix and the shape parameter. Estimate ξ for the location parameter ξ of D z ∼ SN p (ξ, Ω, η) can be obtained from the equation E(z) = ξ +

Ωη 2 π 1 + ηT Ωη

(30)

158

N. Loperfido

Fig. 5 Histogram of projection maximizing skewness

Fig. 6 Histogram of projection Xc

(Azzalini and Capitanio 1999) by replacing E(z), ξ , Ω, η with x, ξ , Ω, η and then solving for ξ . With the data at hand, this leads to ξ = (39.0575, 84.1475)T . The directions maximizing skewness and kurtosis are characterized by the vectors a = (0.3387, −0.9409)T and b = (0.5827, −0.8127)T , respectively. They are very similar to each other, since the cosine of the angle between a and bi is 0.9620 in absolute value, i.e., very close to its maximum value 1. Furthermore, there is little difference between the skewness (kurtosis) associated with a that is 0.5229 (3.4473) and the skewness (kurtosis) associated with b that is 0.5032 (3.4732). This is consistent with the skew-normal model, for which the directions maximizing skewness and kurtosis coincide (Proposition 2.2).


159

Fig. 7 PP-plot for Xc

Fig. 8 PP-Plot for Xa

We shall now consider the linear transformations Xa and Xc, where X denotes the data matrix, and c = (−0.9963, 0.0855)T is orthogonal to the estimated direction of Ωα, which is identified by (3.0849, 35.9655)T . Skewnesses of Xa and Xc are 0.5229 and 0.0685, respectively. Kurtosises of Xa and Xc are 3.4473 and 3.1742, respectively. The transformed variable Xa is definitely skewed and leptokurtic (Fig. 8). Its histogram clearly resembles a skew-normal density (Fig. 5). On the contrary, skewness and kurtosis of Xc are consistent with the normality hypothesis: the corresponding p-values are 0.2444 and 0.6993, respectively. The histogram (Fig. 6) and the QQ-plot (Fig. 7) of Xc definitely suggest a normal behavior, too. All these empirical findings are consistent with theoretical results in the paper.

160

N. Loperfido

6 Discussion The paper shows that the shape parameter of an SN distribution characterizes the direction maximizing skewness and kurtosis, in the sense of Malkovich and Afifi (1973). Maximal skewness and kurtosis are functionally related to Mardia’s indices of multivariate skewness and kurtosis and to the nonnormality index proposed by Azzalini and Capitanio (1999), in the SN case. Directions maximizing (minimizing) skewness and kurtosis are shown to be related to canonical transformations of SN random vectors and to several statistical multivariate techniques. The paper mostly deals with point estimation, but results in Sect. 2 could be applied to goodness-offit tests, too. As remarked by Arnold and Beaver (2002), “only the surface of the goodness of fit has been scratched, and only in one dimension”. A possible strategy for testing skew-normality in the multivariate case could be the following: first project the data onto the direction of the shape parameter in order to emphasize nonnormal features of the data themselves. Then apply a test for skew-normality devised for the univariate case by Gupta and Chen (2001), Dalla Valle (2007), Mateu-Figueras et al. (2007), or Meintanis (2007). Theoretical results in the paper only deal with SN distributions, but useful generalizations can be expected for more flexible models, such as the skew-t distribution (Branco and Dey 2001). Azzalini and Capitanio (2003) showed that the skew-t distribution gave a very good fit to bivariate body measurements of 202 Australian athletes. The directions maximizing sample skewness and kurtosis of these data are very similar to each other, i.e., d1 = (0.9950, −0.1004)T and d2 = (0.9980, −0.0628)T , respectively, suggesting that Proposition 2.2 can be generalized to the skew-t distribution. Results in the paper also apply in the presence of covariates. For example, consider the linear mixed model proposed by Lin and Lee (2007), that is, y = Xβ + Zδ + e, where e follows ordinary normal, and δ follows skew-normal distribution. Then the distribution of the response vector y is multivariate skew-normal, with its shape parameter influenced by the distribution of δ. Hence the projection α T y maximizing the skewness (kurtosis) also highlights some features of the mixing effect δ. Generalization of results in this paper appear to be more problematic for the class of skewed distributions introduced by Sahu et al. (2003), where presence of several sources of truncation might allow for different directions maximizing skewness (kurtosis). No theoretical results for this class of distributions are available at present time, and the problem motivates further research on this topic. Acknowledgements The author would like to thank an associate editor, three anonymous referees, Adelchi Azzalini, and Antonella Capitanio for reading previous drafts of the paper and for their comments.

Appendix Proof of Proposition 2.1 The variance of z ∼ SN p (Ω, α) is V (z) = Ω −

2 Ωαα T Ω π 1 + α T Ωα

(31)


161

√ (Azzalini and Capitanio 1999). By assumption γ1 = α/ α T α. Ordinary properties of eigenvectors imply V (z) =

p λ21 2 λi γi γiT − αα T . π 1 + λ1 α T α

(32)

i=1

A little algebra leads to V (z) =

p πλ1 + α T α(π − 2)λ21 T + λi γi γiT , αα πα T α(1 + λ1 α T α)

(33)

i=2

which can be simplified to πλ1 + α T α(π − 2)λ21 , . . . , λp Γ T . V (z) = Γ diag π(1 + λ1 α T α)

(34)

Hence Γ is the matrix corresponding to the transformation of z into its principal components. Its rows satisfy α , γ1 = √ αT α

γiT Ωγj = 0 (i = j )

for i, j = 1, . . . , p.

(35)

Let W be the canonical matrix whose rows w1T , . . . , wpT satisfy w1 = √

α α T Ωα

,

wiT Ωwj = 0,

i = j,

wiT Ωwi = 1,

i = 2, . . . , p. (36)

The above constraints imply that α T α πλ1 + α T α(π − 2)λ21 −1/2 w1 = γ1 , α T Ωα π(1 + λ1 α T α)

γi wi = √ λi

(37)

for i = 2, . . . , p. It follows that the principal components γ1T z, . . . , γpT z are proportional to the canonical variates w1T z, . . . , wpT z and therefore are mutually independent. Proof of Proposition 2.2 Azzalini and Capitanio (1999) showed that: y = Az ∼ SN k AΩAT ,

(AΩAT )−1 AΩα 1 + α T Ωα − α T ΩAT (AΩAT )−1 AΩα

,

(38)

where A∈Rk × Rp , and AΩAT = 1. In particular, if A = cT , the shape parameter of y is αy =

(cT Ωc)−1 cT Ωα 1 + α T Ωα − α T Ωc(cT Ωc)−1 cT Ωα

.

(39)

162

N. Loperfido

By assumption cT Ωc = 1; so that y ∼ SN 1 1,

cT Ωα (1 + α T Ωα) − (cT Ωα)2

.

(40)

Azzalini (1985) showed that skewness and kurtosis of X ∼ SN(1, λ) are β1 (X) =

2λ6 (4 − π)2 , {(π − 2)λ2 + π}3

β2 (X) = 3 +

8λ4 (π − 3) . {(π − 2)λ2 + π}2

(41)

Both skewness and kurtosis are increasing functions of |λ|. Hence the vector maximizing β1 (cT z) and β2 (cT z) also maximizes the shape parameter of cT z, which is increasing in cT Ωα. By assumption, Ω is a positive definite symmetric matrix, and so there exists a positive definite symmetric matrix Ω 1/2 satisfying Ω 1/2 Ω 1/2 = Ω. Apply now the Cauchy–Schwarz inequality: T 1/2 1/2 Ω α ≤ cT Ω 1/2 Ω 1/2 c α T Ω 1/2 Ω 1/2 α . c Ω

(42)

√ T The constraints Ω 1/2 Ω 1/2 = Ω and cT Ωc = 1 imply that cT Ωα ≤ α √ Ωα. The equality is achieved only when c is proportional to α, that is, c = √ α/ α T Ωα. It T follows√that the shape parameter of c z achieves its maximum value α T Ωα when c = α/ α T Ωα: α cT Ωα = arg max , √ α T Ωα (1 + α T Ωα) − (cT Ωα)2 cT Ωc=1 cT Ωα α T Ωα = max . cT Ωc=1 (1 + α T Ωα) − (cT Ωα)2

(43)

(44)

√ √ Since α T z α T Ωα ∼SN(0, 1, α T Ωα), we can write: D (z) = maxp A cT z = 2(4 − π)2 β1,p c∈R0

α T Ωα π + (π − 2)α T Ωα

D β1,p (z) = maxp K cT z = 3 + 8(π − 3) c∈R0

3 (45)

,

α T Ωα π + (π − 2)α T Ωα

2 .

(46)

M (z) and β M (z) In order to complete the proof, it suffices to recall the values of β1,p 2,p when z ∼ SN p (Ω, α).

Proof of Lemma 1 In order to simplify the notation, and without loss of generality, we shall prove the lemma for z ∼ SN p (Ω, α), whose variance can be represented as Σ =Ω −

2 Ωαα T Ω π 1 + α T Ωα

(47)


163

(Azzalini and Dalla Valle 1996). In order to prove the first equation, premultiply and postmultiply both sides of the above equation by α T and α, respectively. Then apply simple algebra to obtain α T Σα =

α T Ωα{π + (π − 2)α T Ωα} . π(1 + α T Ωα)

(48)

We shall now prove the second equation in (1). The concentration matrix Σ −1 can be represented as the inverse of the sum of a matrix and a matrix product. We can then apply the formula (A + BCD)−1 = A−1 − A−1 B(C −1 + DA−1 B)−1 DA−1

(49)

Mardia et al. (1979), p. 459 by letting A = Ω, B = Ωα, C = {−(π/2)(1 + α T Ωα)}−1 , and D = α T Ω and applying simple linear algebra to obtain Σ −1 = Ω −1 +

2αα T . π + (π − 2)α T Ωα

(50)

Proof of Proposition 3.1 We shall prove the result for sample skewness only, since the proof for sample kurtosis is very similar. When all necessary moments exist finite, first, second, and third moments of z are μ1 = E(z), μ2 = E(zzT ), and μ3 = E(z ⊗ zT ⊗ zT ), where ⊗ denotes the Kronecker (tensor) product Kollo and von Rosen (2005), p. 177. The corresponding sample analogues are 1 m1,n = Xn 1n , n

1 m2,n = XnT Xn , n

and m3,n =

1 xi ⊗ xiT ⊗ xiT , n

(51)

respectively, where xi is the transpose of the ith row of Xn . Ordinary asymptotic properties of sample moments imply that (52) P lim mi,n = μi = 1, i = 1, 2, 3. n→∞

The statistic b(Xn a) is a continuous function of m1,n , m2,n , and m3,n for any nonnull vector a ∈ Rp . It follows that P lim b(Xn a) = β(zT a) = 1, (53) n→∞

where b(Xn a) and β(zT a) denote skewness of a linear combination of the variables in Xn and skewness of a linear combination of the components of z, respectively. Simple but tedious application of standard results in asymptotic theory and matrix differentiation lead to the following statements: b(Xn a) and β(zT a) are twice differentiable with respect to a; maximum values of b(Xn a) and β(zT a) can be found via ordinary differentiation techniques; projections of Xn corresponding to local maxima of b(Xn a) converge almost surely to projections of z corresponding to local maxima

164

N. Loperfido

of β(zT a); the number of local maxima of b(Xn a) and β(zT a) is finite. These statements, together with almost sure convergence of b(Xn a) to β(zT a) for any nonnull vector a ∈ Rp , imply that P lim maxp b(Xn a) = maxp β zT a = 1. (54) n→∞ a∈R

0

a∈R0

By assumption, c is the unique vector, up to a multiplicative constant, with the above maximizing property. Hence (55) P lim an = c = 1, where an = arg maxp b(Xn a). n→∞

a∈R0

References Arnold BC, Beaver RJ (2002) Skewed multivariate models related to hidden truncation and/or selective reporting (with discussion). Test 11:7–54 Azzalini A (1985) A class of distributions which includes the normal ones. Scand J Stat 12:171–178 Azzalini A (2005) The Skew-normal distribution and related multivariate families (with discussion). Scand J Stat 32:159–188 Azzalini A (2006) Some recent developments in the theory of distributions and their applications. Atti della XLIII Riunione Scientifica della Società Italiana di Statistica, pp 51–64 Azzalini A, Capitanio A (1999) Statistical applications of the multivariate skew-normal distributions. J R Stat Soc B 61:579–602 Azzalini A, Capitanio A (2003) Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t distribution. J R Stat Soc B 65:367–389 Azzalini A, Dalla Valle A (1996) The multivariate skew-normal distribution. Biometrika 83:715–726 Baringhaus L, Henze N (1991) Limit distributions for measures of multivariate skewness and kurtosis based on projections. J Multivar Anal 38:51–69 Branco MD, Dey DK (2001) A general class of skew-elliptical distributions. J Multivar Anal 79:99–113 Dalla Valle A (2007) A test for the hypothesis of skew-normality in a population. J Stat Comput Simul 77:63–77 Flury B (1988) Common principal components and related multivariate methods. Wiley, New York Genton MG (ed) (2004) Skew-elliptical distributions and their applications: a journey beyond normality. Chapman & Hall/CRC, Boca Raton Gupta AK, Chen T (2001) Goodness-of-fit tests for the skew-normal distribution. Commun Stat Simul Comput 30:907–930 Gupta AK, Huang WJ (2002) Quadratic forms in skew-normal variates. J Math Anal Appl 273:558–564 Hyvarinen A, Oja E (2000) Independent component analysis: Algorithms and applications. Neural Netw 13:411–430 Hyvarinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley, New York Kollo T, von Rosen D (2005) Advanced multivariate statistics with matrices. Springer, Dordrecht Kotz S, Vicari D (2005) Survey of developments in the theory of continuous skewed distributions. Metron LXIII:225–261 Kuriki S, Takemura A (2001) Tail probabilities of the maxima of multilinear forms and their applications. Ann Stat 29:328–371 Lin TC, Lee JC (2007) Estimation and prediction in linear mixed models with skew normal random effects for longitudinal data. Statistics in Medicine (online version) Machado SG (1983) Two statistics for testing for multivariate normality. Biometrika 70:713–718 Malkovich JF, Afifi AA (1973) On tests for multivariate normality. J Am Stat Assoc 68:176–179 Mardia KV (1970) Measures of multivariate skewness and kurtosis with applications. Biometrika 57:519– 530 Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, London Mateu-Figueras G, Puig P, Pewsey A (2007) Goodness-of-fit tests for the skew-normal distribution when the parameters are estimated from the data. Commun Stat Theory Methods 36(9):1735–1755


165

Meintanis SG (2007) A Kolmogorov–Smirnov type test for skew normal distribution based on the empirical moment generating function. J Stat Plan Inference 137:2681–2688 Peña D, Prieto FJ (2001) Cluster identification using projections. J Am Stat Assoc 96:1433–1445 Pewsey A (2000) Problems of inference for Azzalini’s skew-normal distribution. J Appl Stat 27:859–870 Sahu, S, Dey D, Branco M (2003) A new class of distributions with applications to Bayesian regression models. Can J Stat 31(2):129–150 Stone JV (2005) Independent component analysis. In: Everitt, BS, Howell, DC (eds) Encyclopedia of statistics in behavioral sciences. Wiley, Chichester