Skewness-Based Projection Pursuit: a Computational

0 downloads 0 Views 5MB Size Report
Nov 1, 2017 - where ϕ (·) is the pdf of a standard normal distribution, Φ(·) is the cdf of a ... of w belongs to a class of nonnormal random vectors with normal ..... Branco & Dey (2001) defined the skew-t distribution as follows. .... (12). The skewness of vT z attains its maximum if v is a solution of the equa- ...... Symp. on Math.
Skewness-Based Projection Pursuit: a Computational Approach Nicola Loperfido Dipartimento di Economia, Societ` a e Politica, Universit` a degli Studi di Urbino “Carlo Bo” Via Saffi 42, Urbino (PU), ITALY

Abstract Projection pursuit is a multivariate statistical technique aimed at finding interesting low-dimensional data projections by maximizing a measure of interestingness commonly known as projection index. Widespread use of projection pursuit has been hampered by the computational difficulties inherent to the maximization of the projection index. The problem is addressed within the framework of skewness-based projection pursuit, focused on data projections with highest third standardized cumulants. First, it is motivated the use of the right dominant singular vector of the third multivariate, standardized moment to start the maximization procedure. Second, it is proposed an iterative algorithm for skewness maximization which relies on the analytically tractable maximization of a third-order polynomial in two variables. Both visual inspection and formal testing based on simulated data clearly suggest that the asymptotic distribution of the maximal skewness achievable by a linear projection of normal data might be skew-normal. The potential of skewness-based projection pursuit for uncovering data structures is illustrated with Olympic decathlon data. Keywords: Higher-order power method, Projection pursuit, Singular value decomposition, Skewness, Tensor eigenvector.

1. Introduction

5

Projection pursuit is a multivariate statistical technique aimed at finding interesting low-dimensional data projections. It is particularly useful when data are high-dimensional, data features are unclear and the approach is exploratory. Interesting projections often uncover unanticipated structures. However, Cook et al. (1993) clearly remarked that interesting projection might well

Email address: [email protected] (Nicola Loperfido) figures, theorems’ proofs and Matlab codes are available as supplementary material in the electronic version of the paper. 1 Additional

Preprint submitted to Journal of LATEX Templates

November 1, 2017

10

15

20

25

30

35

40

45

50

be those which uncover a previously chosen data structure. Projection pursuit was introduced by Kruskal (1969), but was given its present name by Friedman & Tukey (1974). Huber (1985), Jones & Sibson (1987), Sun (2006), Daszykowski (2007), Caussinus & Ruiz-Gazen (2009) give excellent reviews of the topic. A projection index is a function which associates a data projection to a real value measuring its interestingness: the higher the index, the more interesting the projection. Consequently, projection pursuit looks for the data projection which maximizes the projection index. There are many of such indices, but the most popular ones belong to three distinct classes, which we shall refer to as to cumulant indices (Malkovich & Afifi, 1973; Jones & Sibson, 1987; Nason, 1995), entropy indices (Friedman & Tukey, 1974; Huber, 1985) and distance indices (Friedman, 1987; Hall, 1989; Cook et al., 1993), with an obvious meaning. Despite being conceptually distinct, they are connected to each other in several ways. Different indices might be either associated to the same projection, or associated to different projections detecting the same data feature. Cook et al. (1993) highlighted some previously unnoticed connections between cumulantbased and distance-based indices. For example, some terms in the expansions of distance-based indices are very useful for detecting skewness. As remarked by Berro et al. (2010), the optimization algorithm plays a crucial role in projection pursuit. Friedman (1987) clearly states that usefulness of projection pursuit depends on the numerical optimizer. Posse (1995) stresses the importance of powerful optimization routines. According to Sun (2006), the two basic elements of projection pursuit are the projection index and the optimization routine. Computational efficiency of the latter is strictly related to analytical properties of the former. A nondifferentiable index, i.e. the Hampel index in Pan et al. (2000), poses more computational difficulties than a differentiable one, i.e. the cumulant index in Jones & Sibson (1987). An index with more local maxima, i.e. kurtosis, poses more computational difficulties than an index with less local maxima, i.e. skewness (Paajarvi & Leblanc, 2004). Projection pursuit methods tend to be computationally intensive (Tyler et al., 2009), especially when the algorithm does not scale well to the cardinality and the dimensionality of data, due to the exponentially increasing amount of time or memory which requires. In recent years, computational difficulties of projection pursuit motivated alternative approaches to the search for interesting projections (Tyler et al., 2009; Hui & Lindsay, 2010; Pe˜ na et al., 2010; Serfling & Mazumder, 2013). This paper addresses the computational difficulties of projection pursuit within the framework of skewness maximization. In the general case, the projection which maximizes skewness must be computed by a numerical, iterative procedure. Unfortunately, difficulties often arise due to the multimodal nature of the objective function (Paajarvi & Leblanc, 2004), and we address them in a twofold way. First, we motivate the use of the right dominant singular vector of the third multivariate, standardized moment to start the maximization procedure. Second, we propose an iterative algorithm for skewness maximization which relies on the analytically tractable maximization of a third-order polyno2

55

60

mial in two variables. The rest of the paper is structured as follows. Section 2 motivates skewness-based projection pursuit using both simulated data and theoretical arguments. Section 3 investigates a starting value for skewness maximization originally proposed in numerical multilinear algebra and describes a new iterative algorithm for skewness maximization. Sections 4 and 5 assess the practical relevance of the proposed method with simulated and real data, respectively. Section 6 contains some concluding remarks. 2. Motivation This section describes the merits of skewness as a projection index, using arguments from exploratory data analysis, multivariate normality testing, information-based approximations, independent component analysis and nonnormal multivariate modelling. In the first place, skewness is a valid projection pursuit index, according to the criteria stated in Huber (1985). Moreover, it is a mathematical concept with a simple graphical interpretation, and projection pursuit aims at making the most of visual perception as a tool for discovering data patterns (Huber, T 1985). We shall illustrate the point with the random vector w = (W1 , W2 , W3 ) , whose joint density is f (w1 , w2 , w3 ; α) = 2ϕ (w1 ) ϕ (w2 ) ϕ (w3 ) Φ (αw1 w2 w3 ), where ϕ (·) is the pdf of a standard normal distribution, Φ (·) is the cdf of a standard normal distribution and α is a nonnull, real value. The distribution of w belongs to a class of nonnormal random vectors with normal marginals introduced by Arnold et al. (2002, 2007). The vector w is not multivariate skew-normal, as defined by Azzalini & Dalla Valle (1996), but the conditional distributions w1 |w2 , w3 , w2 |w1 , w3 , w3 |w1 , w2 are univariate skew-normal. All bivariate marginals of f (w1 , w2 , w3 ; α) are standard normal distributions: ( ) ( ) ( ) [( ) ( )] W1 W1 W2 0 1 0 ∼ ∼ ∼ N2 , . (1) W2 W3 W3 0 0 1 T

As a first implication, the random vector w = (W1 , W2 , W3 ) is standardized. T As a second implication, any random vector (aWi + bWj , cWi + dWj ) has a bivariate normal distribution, for i ̸= j and i, j = 1, 2, 3. As a third implication, neither histograms nor scatterplots of data from f (w1 , w2 , (w3 ; α) ) are 4 =3 likely to detect its nonnormal features. As a fourth implication, E W i ( ) and E Wi2 Wj2 = 1, for i, j = 1, 2, 3 and i ̸= j. Also, elementary integration techniques imply that ∫ ( ) E Wi2 Wj Wh = wi2 wj wh 2ϕ (wi ) ϕ (wj ) ϕ (wh ) Φ (αwi wj wh ) dwi dwj dwh = 0, R3

65

(2) where i, j, h = 1, 2, 3 and ̸= i, j, h. We conclude that w and a trivariate, standard normal random vector have the same fourth-order moments. Hence kurtosisbased projection pursuit is unable to undercover the nonnormal features of f (w1 , w2 , w3 ; α). 3

D The directional skewness γ1,d (x) of a d−dimensional random vector x is the ( ) maximum value attainable by γ1 cT x , where c is a nonnull, d−dimensional real vector and γ1 (Y ) is the third standardized moment of the random variable Y . More formally, let µ and Σ the mean and the variance of x, and let be Sd−1 the unit sphere in Rd . The directional skewness of x is then [( )3 ] E cT x − cT µ ( ) D γ1,d (x) = max γ1 cT x = max . (3) 3/2 c∈Sd−1 c∈Sd−1 (cT Σc)

70

75

D The name directional skewness is a reminder of γ1,d (x) being the maximum attainable skewness by a projection of the random vector x onto a direction. The squared third standardized cumulant of a linear combination cT w, T T where c = (c1 , c2 , c3 ) ̸= (0, 0, 0) is [ ] 3 E 2 (c1 W1 + c2 W2 + c3 W3 ) 36γ 2 c21 c22 c23 [ ]= (4) 3, 2 (c21 + c22 + c23 ) E 3 (c1 W1 + c2 W2 + c3 W3 )

where γ = E (W1 W2 W3 ) and α have the same sign. Basic properties of the geometric and the arithmetic mean of c21 , c22 , c23 imply that the above ratio is maximized only if c1 , c2 , c3 are equal to each other in absolute value, as √ T D for example in c = 13 = (1, 1, 1) . It follows that γ1,d (x) = 2γ/ 3. All linear combinations maximizing absolute skewness have the same shape, up to sign changes, and hence give the same informations about nonnormality of f (w1 , w2 , w3 ; α). The distribution of w is simple to the point of being artificial. The above example is only meant to give some insight into skewness-based projection pursuit. Sample directional skewness is similarly defined. Let m, S and X denote the sample mean, the sample variance and the data matrix X whose rows are the vectors xT1 , ..., xTn . Then the directional skewness of X is the statistic 1∑ n i=1 n

D (X) = max g1,d

c∈Sd−1

80

85

90

(

cT xi − cT m √ cT Sc

)3 .

(5)

In order to have a better insight into the nonnormal features of f (w1 , w2 , w3 ; α) we simulated 100000 samples from it, with α = 10. The histogram of the chosen projection maximizing absolute skewness (Figure 1) clearly shows some nonnormality, thus illustrating the advantages of projection pursuit. Following the advices in Cook et al. (1993) we seek further insight by looking at the scatterplot of two projections either maximizing or minimizing absolute skewness, under the constraint of mutual orthogonality (Figure 2). It has a rather unusual three-leaf clover shape. Both the histogram and the scatterplot illustrate the potential of skewness-based projection pursuit for displaying previously undetected data features. Malkovich & Afifi (1973) introduced linear projections maximizing skewness 4

4

4

x 10

3.5 3 2.5 2 1.5 1 0.5 0 −4

−3

−2

−1

0

1

2

3

4

5

Figure 1: Histogram of a projection maximizing absolute skewness.

5 4 3 2 1 0 −1 −2 −3 −4 −5 −4

−3

−2

−1

0

1

2

3

4

5

Figure 2: Scatterplot of the projection maximizing absolute skewness and projection minimizing absolute skewness, orthogonal to each other.

5

95

100

105

110

115

120

125

130

135

as direct applications of Roy’s union-intersection approach to multivariate normality testing. Within this approach, the test rejects the multivariate normality hypothesis whenever the chosen test statistic, calculated from any linear combination of the variables, exceeds a preassigned value. The test statistic, which might be regarded as a projection index, should possess some desirable properties when testing for univariate normality, as for example does skewness. Ferguson (1961) showed an optimal property of sample skewness when testing for the presence of outliers in a normal sample. A simulation study performed by Mendell et al. (1993) hinted that skewness-based tests for univariate normality might well be the most powerful ones, when the alternative hypothesis is a two-component, normal, homoscedastic mixture with a mixture weight much smaller than the other. Their empirical findings are supported by the theoretical results in Takemura et al. (2006). Sample skewness also provides the locally most powerful test for normality among the location and scale invariant ones, when the underlying distribution is assumed to be univariate skew-normal (Salvan, 1986). The same result has later been generalized to a wider class of skewed distributions by Franceschini & Loperfido (2014). Cassart et al. (2011) motivated the use of skewness for testing symmetry of nonnormal distributions. In the above framework, the test statistic takes the role of the projection index. However, projection indices are meant to detect interesting data features, that is those absent from uninteresting distributions. All distributions which are deemed uninteresting in the projection pursuit literature are symmetric (Diaconis & Freedman, 1984; Jones & Sibson, 1987; Naito, 1997; Nason, 2001; Diaconis & Salzman, 2008). The Kullback-Leibler divergence between a multivariate symmetric distribution and a skewed distribution with the same mean and variance is well approximated by the sum of squared third-order moments of the latter, under mild conditions (Lin et al., 1999). In particular, if both distributions are univariate and standardized, the same divergence is wellapproximated by the squared skewness of the asymmetric distribution. Hence the projection which maximizes skewness also approximates the most interesting projection, that is the one which diverges the most from the uninteresting distribution, in the Kullback-Leibler sense. The argument is admittedly heuristic, but motivates skewness projection as a index for all the uninteresting distributions proposed in the projection pursuit literature. Skewness maximization plays a role in Independent Component Analysis (ICA), a multivariate statistical technique aimed at recovering independent random variables from their linear combinations. The connection between ICA and projection pursuit has been thoroughly investigated by Stone & Porrill (1998). ICA is usually based on kurtosis maximization (see, for example Hyvarinen et al., 2001), but Paajarvi & Leblanc (2004) showed that skewness maximization poses less computational problems. De Lathauwer et al. (1996) highlighted other advantages, which are also instrumental for carrying out kurtosis-based ICA. Skewness–based ICA has been successfully applied, for example, to neuroimaging data (Stone et al., 2002) and to gene expression data (Kim et al., 2008). Once skewness is detected, it is important to understand its nature, and to model it. This problem motivated researchers to look for parametric interpreta6

140

145

tions of projections with maximal skewness. This was first done for the multivariate skew-normal distribution by Loperfido (2010) and Balakrishnan & Scarpa (2012). Their results were later generalized to the extended skew-normal distribution (Franceschini & Loperfido, 2014), the skew−t distribution (Arevalillo & Navarro, 2015) and scale mixtures of skew-normal distributions (Kim & Kim, 2017). For mixtures of two symmetric multivariate distributions with equal covariances, projections with maximal skewness coincide, up to an affine transformation, with the best linear discriminant function, when the mixture’ weights differ from each other (Loperfido, 2013). Loperfido et al. (2017) showed that the same projections had simple analytical structures for multivariate aggregate claim models. 3. Computation This section describes an algorithm for skewness maximization. Projections maximizing skewness might be conveniently expressed using third moment maT trices (third moments, henceforth). ( 3 ) Let x = (X1 , ..., Xd ) be a d−dimensional random vector satisfying E Xi( < +∞, for ) i = 1, ..., d. The third moment of x is the d2 ×d matrix M3,x = E x ⊗ xT ⊗ x , where “⊗” denotes the Kronecker T product. For example, the third moment of the random vector (X1 , X2 , X3 ) is   m111 m112 m113  m112 m122 m123           m113 m123 m133    X1 X1  m112 m122 m123  ( )        X2 X2 ⊗ X1 X2 X3 ⊗ =  m122 m222 m223  E .  m123 m223 m233  X3 X3    m113 m123 m133     m123 m223 m233  m133 m233 m333 (6) As a second example, the third moment of the random vector w described in the previous section is   0 0 0  0 0 γ     0 γ 0     0 0 γ     M3,w =  (7)  0 0 0 .  γ 0 0     0 γ 0     γ 0 0  0 0 0

150

Intuitively, when α is positive, any two components of w tend to be positively correlated when an outcome of the other component is higher than its mean,

7

and negatively correlated in the opposite case. Nonnormality of w becomes apparent when looking at its third moment M3,w , which is not a null matrix. In the following, when referring to the third moment of a random vector, we shall implicitly assume that all appropriate moments exist. The third moment of a linear transformation is evaluated using the matrix multiplication, transposition and the tensor product: M3,Ax = (A ⊗ A) M3,x AT , where A is a k × d real matrix (Christiansen & Loperfido, 2014). The third central moment of x, also known as its third cumulant and denoted with K3,x , is the third moment of x − µ: K3,x = M3,x−µ . We can then represent the directional skewness of a random vector x with its second and third cumulants: ( T ) c ⊗ cT K3,x c D γ1,d (x) = max . (8) 3/2 c∈Sd−1 (cT Σc) Let z = Σ−1/2 (x − µ) be a standardized random vector, where Σ−1/2 is the (unique) symmetric and positive definite square root matrix of the inverse of the covariance matrix Σ, which is assumed to be nonsingular. The third standardized cumulant of x is the third cumulant (moment) of z. It is often denoted either by M3,z or K3,z . The third standardized moment (cumulant) is instrumental in representing directional skewness as the maximum of a homogeneous, third-order polynomial in several variables: ( ) D γ1,d (x) = max cT ⊗ cT M3,z c. (9) c∈Sd−1

As a direct implication of results in Qi (2006) on tensor the ( ) eigenvectors, D d−dimensional real vector of unit length v satisfies γ1 vT x = γ1,d (x) if and only if it also satisfies MT3,z (v ⊗ v) = λv for the largest real value λ. For exT ample, 13 = (1, 1, 1) is a tensor eigenvector of w, and it is associated with the tensor eigenvalue 2γ:   1  1        1     1  0 0 0 0 0 γ 0 γ 0 1    0 0 γ 0 0 0 γ 0 0   1  = 2γ  1  . (10)    1  0 γ 0 γ 0 0 0 0 0 1    1     1  1

155

√ D (x) being equal to 2γ/ 3, as remarked Dominance of 13 and 2γ follows from γ1,d in the previous section. In the general case, directional skewness must be computed numerically, by an iterative procedure. Unfortunately, third-order polynomials in several variables have several local maxima (Paajarvi & Leblanc, 2004), thus making

8

160

165

170

175

180

185

190

195

200

the choice of the starting value of paramount importance. De Lathauwer et al. (1995) and De Lathauwer et al. (2000) suggest the use of the first right singular vector of the standardized third moment. Their proposal might be described as follows, for a n × d data matrix X and the corresponding sample covariance matrix S, which is assumed to be of full rank. First, standardize the data, obtaining the matrix Z = Hn XS−1/2 , where Hn is the n × n centring matrix, and S−1/2 is the symmetric, positive definite square root of S−1 . Second, obtain the third standardized moment M3,Z = (z1 ⊗ zT1 ⊗ z1 + ... + zn ⊗ zTn ⊗ zn )/n, where zTi is the i−th row of Z. Third, compute Zg, where g is the dominant eigenvector of MT3,Z M3,Z , that is the dominant right singular vector of M3,Z . De Lathauwer et al. (1995) and De Lathauwer et al. (2000) observed that the above starting value very often lie in the basin of attraction of a globally optimal solution, without giving a theoretical motivation for this empirical finding. The following proposition shows that their proposal leads to projections with maximal skewness, when the left dominant singular vector of the third standardized moment is a vectorized, rank-one matrix. Proposition 1. Let µ and Σ be the mean and the variance of the random vector x. Also, let z = Σ−1/2 (x − µ) be the standardized version of x. Finally, let u and v be the left and right dominant singular vectors of the third cumulant of z. Then the random variable vT z achieves maximal skewness among all linear combinations of x if u is a vectorized matrix of rank one. The rank-one assumption holds for the two-component, homoscedastic normal mixture (Loperfido, 2013), the extended skew-normal distribution (Franceschini & Loperfido, 2014) and for the independent components model, if the components’ skewnesses differ from each other (Loperfido, 2015b). However, the projection Zg might be a good starting value for an iterative skewness maximization algorithm, when the squared dominant eigenvalue of the matrix A accounts for most of its Euclidean norm, where vec (A) is proportional to the dominant left singular vector of M3,Z . It follows from eigenvalues and associated eigenvectors of real matrices being continuous functions of the matrix elements (see, for example, Ortega, 1987, page 41). We therefore suggest an intuitively appealing method for assessing the quality of the projection Zg as a starting value for the iterative procedure. First, compute the dominant eigenvector v of M3,Z MT3,Z , that is the dominant left singular vector of M3,Z . Second, find the dominant eigenvalue ( )ρ of the matrix A satisfying vec (A) = v. Third, compute the ratio ρ2 /tr A2 , which is bound between zero and one. Higher ratios suggest better starting projections. Branco & Dey (2001) defined the skew-t distribution as follows. Let y be a d−dimensional skew-normal random vector with null location vector, scatter matrix Ω and shape parameter α: y ∼ SNd (0d , Ω, α). Also, let V be a chi-squared random variable, independent of y and √with υ degrees of freedom: V ∼ χ2 (υ). Then the random vector x = ξ + υ/V y has a skew-t distribution with location parameter ξ, scatter matrix Ω, shape parameter α and υ degrees of freedom, denoted by x ∼ Std (ξ, Ω, α, υ). Kim & Mallick (2003) 9

205

210

215

220

225

230

235

derived the third and fourth moment matrices of x. Azzalini & Genton (2008) investigated the modelling and inferential properties of the skew−t distribution. Arevalillo & Navarro (2015) showed that the projection αT x achieves maximal skewness. The following proposition shows that the projection uT z achieves maximal skewness, when z is a standardized skew−t random vector and u is the right dominant singular vector of its third moment. Proposition 2. Let µ and Σ be the mean and the variance of a d−dimensional, skew-t random vector. Also, let z = Σ−1/2 (x − µ) be the standardized version of x. Then the random variable uT z achieves maximal skewness among all linear combinations of x, where u is the dominant right singular vector of the third cumulant of z. We end this section with a cautionary note. Both theoretical and empirical results support dominant right singular vectors of the third standardized moment as starting projections. In some situations, however, their performance may be very poor, as exemplified by the trivariate random vector T w = (W1 , W2 , W3 ) with density 2ϕ (w1 ) ϕ (w2 ) ϕ (w3 ) Φ (αw1 w2 w3 ) and third cumulant K3,w . Simple matrix algebra shows that KT3,w K3,w = 2γ 2 I3 is proportional to a 3 × 3 identity matrix, without a simple dominant eigevalue. As a direct consequence, right singular vectors of K3,w do not give any hint of 1T3 w being the projection with maximal skewness. We therefore recommend a closer look at the singular value decomposition of K3,w before using its right dominant singular vector for starting the skewness-maximizing routine. We shall now describe the iteration step of the algorithm for skewness maximization. We shall focus on random vectors, but the algorithm might be trivially modified if a data matrix is of interest. A common choice for the iteration step is the higher-order power method (HOPM), which may be described as follows, for a symmetric, third-order tensor (De Lathauwer et al., 1995; De Lathauwer et al., 2000). Let v0T x be our initial guess for the projection of x achieving maximal skewness, where v0 is a d−dimensional real vector. Then compute vi+1 = xi / ∥xi ∥, where xi = MT3,z (vi ⊗ vi ) and i is an integer ranging from zero to the required number of iterations. Alternatively, the iteration step might be repeated until the distance ∥vi − vi+1 ∥ between vi and vi+1 becomes negligible. Unfortunately, the HOPM might not converge to the global maximum, that is the direction

10

maximizing skewness. As an example,  9160  12   4624   4243   4609 X=  7702   3225   7847   4714 358

240

245

250

255

consider the 10 × 3 data matrix  1759 2691 7218 7655   4735 1887   1527 2875   3411 911  . 6074 5762   1917 6834   7384 5466   2428 4257  9174 6444

(11)

We used a direct search method for finding the linear projection Xc which T maximizes skewness, where c = (0.816, 0.446, 0.367) . Its skewness is 0.938. The skewness of the initial linear projection, obtained from the first right singular vector of the third sample standardized moment, is 0.787. The skewnesses of the projections corresponding to the first, second and third HOPM iterations are 0.677, 0.498 and 0.173, respectively. Hence each iteration of the HOPM decreases the skewness, leading to projections far away from the optimal one. In order to overcome this problem, we propose an iteration step which builds upon convenient analytical properties of skewness maximization for bivariate random vectors. We look for a linear combination a1 X1 + a2 X2 of the random T vector x = (X1 , X2 ) with maximal skewness, as measured by its third standardized moment γ1 (a1 X1 + a2 X2 ). We shall follow the approach outlined in De Lathauwer et al. (2000) for the best rank-one approximation of third-order, symmetric and binary tensors. Our presentation will be more detailed and less technical, in order to illustrate the method without requiring any background in multilinear algebra. Skewness maximization does not depend neither on location nor on scale. Hence, without loss of generality, we shall focus on the standardized random T vector z = (Z1 , Z2 ) = Σ−1/2 (x − µ). We shall obtain a simple analytical form T for the projection v1 Z1 + v2 Z2 with maximal skewness, where v = (v1 , v2 ) is 2 2 T 2 a real, bivariate vector of unit norm: v ∈ R , v v = v1 + v2 = 1. Clearly, the T vector a = (a1 , a2 ) is a simple linear function of v: a = Σ−1/2 v. ( ) assumptions E (Z(1 ) = )E (Z2 ) = E (Z1 Z2 ) = 0 and vT v = E Z12 = ( The ) 2 3 2 T 3 E Z22 =( 1 imply ) that γ1 v z = α30 v1 + 3α21 v1 v2 + 3α12 v1 v2 + α03 v2 , where αij = E Z1i Z2j , for i, j = 0, 1, 2, 3 and i + j = 3. Equivalently, we might write 

T  v12 α30    ( T ) v v 1 2   α21 γ1 (v1 Z1 + v2 Z2 ) = v ⊗ vT K3,z v =   v1 v2   α21 v22 α12

 α21 ( ) α12   v1 . α12  v2 α03 (12) The skewness of vT z attains its maximum if v is a solution of the equa-

11

[ ( ) ( )] T tion (∂/∂v) γ1 vT z − λ vT v − 1 = (0, 0) corresponding to the maximum value of λ. Standard differentiation techniques lead to the system of equations   α30 v12 + 2α21 v1 v2 + α12 v22 = (2/3) λv1 . (13)  α21 v12 + 2α12 v1 v2 + α03 v22 = (2/3) λv2 T

The vector v is proportional to (1, 0) if and only if α21 equals zero. In order to rule out this case, we shall assume that both α21 and v2 differ from zero. First, eliminate λ by subtracting the first equation, multiplied by v2 , from the second equation, multiplied by v1 : α21 v13 + (2α12 − α30 ) v12 v2 + (α03 − 2α21 ) v1 v22 − α12 v23 = 0.

(14)

Next, divide each side of the above equation by v23 and let t = v1 /v2 : α21 t3 + (2α12 − α30 ) t2 + (α03 − 2α21 ) t − α12 = 0.

260

(15)

The above polynomial has three roots in the complex space, all of them with a simple analytical form. At least one of them is real, corresponding to maximum absolute skewness. It follows from the absolute skewness of a linear combination being never greater than the first singular value of K3,z (Loperfido, √ 2015a). The T 2 2 identities t = v1 /v2 and v1 +v2 = 1 imply that v equals (t0 , 1) / 1 + t20 , where t0 is an appropriate real root of the above polynomial. The result obtained for the bivariate case is useful when addressing the problem of skewness maximization for random vectors of any dimension. Let Y0 = aT0 x be our initial guess for the projection of x achieving maximal skewness, where a0 is a d−dimensional real vector. Replace the first component of a0 with zero, name the resulting vector a1 and let Y1 = aT1 x be another projection of x. T Since (X1 , Y1 ) is a bivariate random vector, the scalar c1 = arg max γ1 (hX1 + Y1 )

(16)

h∈R

admits an analytical and easily computable representation. Now repeat the two following steps for i = 2, ..., d. First, replace the i−th and the (i − 1)−th components of ai−1 with zero and ci−1 to obtain the vector ai and the projection Yi = aTi x. Then compute the scalar ci = arg maxγ1 (hXi + Yi ) .

(17)

h∈R 265

270

By construction γ1 (c1 X1 + Y1 ) ≤ ... ≤ γ1 (cd Xd + Yd ). Next, replace the initial guess a0 by ad where the final component has been replaced by cd and begin the procedure again, repeating for the required number of times or until a chosen condition is met. We shall now apply the proposed method to the data matrix X. The skewness of the initial linear projection, obtained from the first right singular vector 12

275

280

of the third sample standardized moment, is 0.787. The skewnesses of the projections corresponding to the first, second and third iterations are 0.910, 0.938 and 0.938, respectively. Hence each iteration of our method does not decrease skewness, which achieves its maximum value after two iterations only. We shall now investigate the use of several directions in skewness-based projection pursuit. As we did before, we shall assume without loss of generality that z is a d−dimensional, standardized random [vector, ]so that the skewness ( ) ( )3 ( T )−1.5 of a linear combination vT z of z is γ1 vT z = E vT z v v , where v is a d−dimensional, real vector. There are at least two approaches, which we shall refer to as to blockwise and sequential. The blockwise appoach looks for k ≤ d mutually orthogonal, real of [( vectors )3 ] T unit length v1 , ..., vk maximizing the sum of the skewnesses E v1 z , ..., [( )3 ] E vkT z . Equivalently, the vectors v1 , ..., vk minimize the function

2

k



T λi · ui ⊗ ui ⊗ ui

M3,z −

(18)

i=1

285

290

295

300

305

among all d−dimensional vectors u1 , ..., uk satisfying uTi ui = 1 and uTi uj = 0, where i, j = 1, ..., n and i ̸= j, for an appropriate choice of the nonnegative constants λ1 , ..., λk . From a multilinear point of view, the above optimization problem is equivalent to search for the best approximation of a third-order, symmetric tensor with another tensor of symmetric, othogonal rank k (Comon et al., 2008) The sequential approach might be operationally described as follows. First, find the projection vT z achieving maximal skewness. Second, compute the residuals of the linear regression of vT z on z, together with their covariance Σ. Third, compute the principal components of the residuals associated with positive eigenvalues of Σ. Fourth, replace the original variables with the principal components and repeat the first three steps until the required number of projection is obtained. Principal components are instrumental in obtaining linear projections which are uncorrelated with the previously found skewed components. The blockwise and the sequential approach lead to the same results when applied to principal components, since subtracting from a symmetric matrix its best rank-one approximation decreases the matrix rank. Unfortunately, the same does not hold for higher-order tensors (Kofidis & Regalia, 2002). Hence the blockwise and the sequential skewness-based projection pursuit may lead to different results. On top of that, best lower rank approximation is plagued by computational difficulties, which are well reviewed by De Lathauwer et al. (2000) and Comon et al. (2008). We therefore chose to follow the sequential approach. The sequential search for projections maximizing skewness might be carried out using deflation-based FastICA with the non-linearity function g(x) = x2

13

310

315

320

325

330

335

(Hyvarinen & Oja, 1997; Hyvarinen et al., 2001). The FastICA method is a gradient-based algorithm with constant step size (Zarzoso & Comon, 2007). It is fairly popular, especially in signal processing, and it is implemented in several packages, as for example the R package fICA (Miettinen et al., 2017). Unfortunately, FastICA converges to local maxima, which may not be the global ones (Miettinen et al., 2014). Moreover, the convergence of FastICA slows down or even fails in the presence of saddle points (Tichavsky et al., 2006). We computed the projection with maximal skewness of the Iris dataset using both the proposed method and FastICA. The maximal skewness attained with the former method (i.e. 0.968) is about 13% greater than the skewness obtained with the latter method (i.e. 0.852). We obtained similar results with other datasets. These findings do not imply that the proposed method is better than FastICA. The former is specifically designed for skewness maximization, while the latter is a general purpose method. Also, skewness maximization by FastICA could benefit from the starting value proposed in the paper, and the proposed iteration step might be used to improve the performance of FastICA in the presence of local maxima and saddle points. The proposed method is implemented in the R package MaxSkew, by Cinzia Franceschini and Nicola Loperfido. The package finds mutually orthogonal data projections which maximize skewness. More precisely, the first data projection (i.e. the first skewed component) maximizes skewness among all data projections. The second projection (i.e. the second skewed component) maximizes skewness among all linear projections orthogonal to the first skewed component, and so on. The package is meant for those interested in multivariate statistics, and particularly in independent component analysis, projection pursuit, clustering and outlier detection. It is freely available at https://CRAN.Rproject.org/package=MaxSkew. A bootstrap test for symmetry based on skewness maximization is implemented in the R package MultiSkew (Franceschini & Loperfido, 2017b), and it calls the R package MaxSkew. 4. Simulations

340

345

In this section we use simulations to assess the performance of the proposed method and to investigate the sampling distribution of directional skewness. The first simulation study compares the method described in the previous section with the higher-order power method (HOPM) in their ability to compute the maximal skewness attainable by a linear data projection. We simulated 10000 samples of n = 25, 50, 75 and 100 observations with d = 4, 6, 8 and 10 variables from the mixture π1 Nd (0d , Id ) + (1 − π1 ) Nd (1d , Id ), for π1 = 0.1, 0.2, 0.3, 0.4. The symbols 0d , 1d and Id denote the d−dimensional vector of zeros, the d−dimensional vector of ones and the d−dimensional identity matrix, respectively. For each simulated sample, we computed the ratio of the skewness obtained with the HOPM to the skewness obtained with the proposed method, using one and two iterations. Table 1 reports the integer part of the average ratios, multiplied by 100.

14

350

355

360

Simulations’ results might be summarized as follows. Our method always outperforms HOPM. The difference tends to decrease when more variables are present and when more iterations are performed. Also, the difference is very little affected by the sample size or the number of sampled units. Finally, the difference slightly increases as skewness decreases, that is when the difference in the components’s weights decreases. However, the mean is notoriously sensitive to outlying observations, which in the simulation study under consideration are samples where the HOPM fails completely. Hence we also computed the medians of the same ratios (not reported here for the sake of brevity), and found them to be closer to one, although still smaller than one. When using more iterations the median ratios approached one. These findings suggest that, in most cases, the HOPM and the proposed algorithm attain convergence to the global maximum, with the former method requiring just a few more iterations (two or three, say), but there are cases where the HOPM fails completely.

15

Table 1: Integer part of the average ratios, multiplied by 100, of the skewnesses obtained with the HOPM to the skewnesses obtained with our method, for π1 = 0.1, 0.2, 0.3, 0.4, d = 4, 6, 8, 10 and n = 25, 50, 75, 100.

Units

25

50

75

100

365

370

375

Variables

4 6 8 10 4 6 8 10 4 6 8 10 4 6 8 10

One iteration 0.1 77 79 83 87 0.1 77 76 78 80 0.1 77 76 76 78 0.1 78 77 78 78

0.2 77 79 83 87 0.2 77 75 77 79 0.2 77 74 73 75 0.2 78 75 73 73

0.3 77 79 83 88 0.3 76 73 76 79 0.3 75 72 72 75 0.3 76 72 70 71

0.4 76 78 82 88 0.4 75 73 76 79 0.4 75 71 73 74 0.4 75 71 70 71

Two iterations 0.1 80 86 93 96 0.1 80 83 86 90 0.1 81 82 84 87 0.1 82 82 84 86

0.2 80 86 92 96 0.2 80 81 85 89 0.2 80 79 82 85 0.2 81 79 81 82

0.3 80 85 92 97 0.3 79 80 84 89 0.3 79 78 80 84 0.3 79 77 77 80

0.4 79 85 92 96 0.4 78 79 84 89 0.4 77 76 79 83 0.4 77 74 77 79

The computational method presented in this section also addresses some inferential issues which arise in projection pursuit. Formal testing allows us to decide whether the apparent structure detected by projection pursuit is real or just the effect of noise (Sun, 2006). Under normality, sample directional skewness has an asymptotic distribution (Machado, 1983), but its analytical form is currently unknown. Friedman (1987) proposes to circumvent the problem by assessing the significance of results by comparing the observed value of the projection index with the sampling distribution of the same index, obtained by simulating many random samples from a gaussian distribution of the same dimension and cardinality as the data. His approach has been criticised by Posse (1995), on the ground of being too computationally expensive for moderate to large data sets, since a complex optimization must be performed for each generated sample. The method described in this section eases the computational problems in Friedman’s proposal and also gives some insights into the sampling distribution of directional skewness. More precisely, we shall use the univariate skew-normal 16

distribution to model it. Its definition easily follows from the multivariate case, but we shall follow a slightly different parameterization for notational ease. A random variable X ∈ R has a univariate skew-normal distribution with location parameter ξ, scale parameter ψ and shape parameter λ if its probability density function is ( ) ( ) 2 x−ξ x−ξ f (x; ξ, ψ, λ) = ϕ Φ λ , (19) ψ ψ ψ

380

385

390

395

400

405

410

where λ, ξ, z ∈ R, ψ ∈ R+ and ϕ (·) and Φ (·) denote the probability density function and the cumulative distribution function of a standard normal distribution, respectively (see, for example, Christiansen & Loperfido, 2014). When f (x; ξ, ψ, λ) is the pdf of X we shall write X ∼ SN (ξ, ψ, λ). The parameters ξ, ψ, λ do not in general equal the expectation µ, the standard deviation σ, skewness γ1 or kurtosis β2 . However, µ, σ, γ1 and β2 are simple functions of ξ, ψ and λ. The skew-normal distribution has been used to approximate the asymptotic distribution of the sample mean of count data (Chang et al., 2008), dependent data (Bartoletti & Loperfido, 2010) and multivariate data (Gupta & Kollo, 2003; Christiansen & Loperfido, 2014) when the parent distribution is skewed. However, to the best of our knowledge, the skew-normal distribution has never received any attention as an approximation to the sampling distribution of directional skewness. We shall first graphically motivate this novel use of the skew-normal distribution by simulating 10000 samples of size 100 from a bivariate normal distribution. For each sample we computed the maximal skewness attainable by a linear combination of its variables. Then we fitted a skew-normal distribution to these skewnesses. Figure 3 shows their histogram, with superimposed the fitted skew-normal density. Figure 4 shows their PP-plot, with respect to the cdf of the fitted skew-normal distribution. Both graphs clearly hint that the skew-normal provides a very good fit to sample directional skewnesses, at least in the bivariate case. We also simulated 10000 samples of size n = 50, 100, 150, 200, 250, 300 from a multivariate normal distribution with d = 2, 4, 6, 8, 10 components. For each sample of given size and cardinality, we computed ( ) the directional skewe ψ, eλ e , where ξ, e ψe and λ e are ness and fitted the skew-normal distribution SN ξ, the method of moment estimates of the location, scale and shape parameters. ( )2 / 2 Then we computed the values qi = si − ξe ψe , where si is the directional skewness of the i -th sample, for i = 1, ..., 10000. Azzalini (1985) showed that (X − ξ)2 /ψ 2 ∼ χ21 when X ∼ SN (ξ, ψ, λ), so that the sampling distribution of the qi ’s should be approximatively chi-squared with one degree of freedom. Finally, we obtained the relative frequency of qi values smaller than or equal to the α−th quantile of a chi-squared distribution with one degree of freedom, where α = 0.25, 0.75, 0.90, 0.95, 0.99.

17

Table 2: Number of qi values smaller than or equal to the α−th quantile of a chi-squared distribution with one degree of freedom, where α = 0.25, 0.75, 0.90, 0.95, 0.99, n = 50, 100, 150, 200, 250, 300 and d = 4, 6, 8, 10. Heach column’s header reports their expected number, would their distribution be chi-squared with one degree of freedom.

Units 50

100

150

200

250

300

Variables 4 6 8 10 4 6 8 10 4 6 8 10 4 6 8 10 4 6 8 10 4 6 8 10

2500 2697 2593 2472 2341 2597 2530 2358 2299 2555 2464 2310 2204 2497 2461 2290 2228 2536 2375 2418 2334 2553 2445 2402 2459

7500 7704 7618 7610 7658 7655 7666 7576 7540 7554 7630 7592 7587 7566 7503 7511 7544 7506 7558 7521 7569 7500 7532 7531 7532

18

9000 9034 9050 9063 9054 9047 9052 9090 9086 9041 9069 9093 9144 9035 9041 9077 9122 9000 9063 9068 9067 9017 9045 9083 9050

9500 9470 9508 9516 9533 9493 9481 9515 9540 9509 9517 9522 9555 9480 9524 9563 9557 9515 9523 9537 9516 9500 9516 9535 9515

9900 9837 9862 9864 9868 9846 9867 9882 9894 9872 9871 9880 9883 9886 9895 9902 9887 9894 9894 9882 9880 9891 9891 9896 9888

Figure 3: Histogram of the directional skewnesses obtained from 10000 simulated samples of size 100 from a bivariate normal distribution, with superimposed the fitted skew-normal density.

415

420

425

430

Table 2 reports the simulation’s results. The headers of the third, fourth, fifth, sixth and seventh columns are the expected numbers of qi ’s smaller than the first quartile, third quartile, ninth decile, last ventile and last centile of the chi-squared distribution with one degree of freedom, under the assumption that the qi ’s are drawn from this distribution. For nearly every quantile under consideration, the number of observed qi ’s not exceeding it differs less than 10% from the corresponding expected value. Moreover, the difference is most of the times below 5% and sometimes below 1%. The only minor exceptions occur with the first quartiles of samples of size 150 and 200 drawn from ten-variate distributions, where the differences are 11.84% and 10.88%, respectively. The fit seems to improve in the higher quantiles and does not seem to be significantly affected neither by the number of observations nor by the number of variables. We conclude that the skew-normal distribution satisfactorily fits the sampling distribution of directional skewness, especially in the tails, even for moderate sample sizes. Regrettably, we lack a theoretical understanding of these empirical findings. Our current research investigates the connection between the skew-normal distribution and the maximum of a gaussian process, the latter being related to the asymptotic distribution of directional skewness (Baringhaus & Henze, 1991). The paper by Naito (1997) contains many interesting results and insightful comments, which could be the starting point for understanding the asymptotic, null distribution of the directional sample

19

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4

0.6

0.8

1

Figure 4: PP-plot of the directional skewnesses obtained from 10000 simulated samples of size 100 from a bivariate normal distribution, with respect to the cdf of the fitted skew-normal distribution.

435

440

445

450

455

skewness. The last simulation study assesses the proposed initialization method by comparing the corresponding skewness with the maximal skewness achievable by a linear combination obtained from the same data. To this purpose, we simulated 1000 samples of size n = 50, 100, 150, 200 and dimension d = 5, 10, 15, 20 from the skew−t distribution with 10 degrees of freedom St (0d , Id , α1d , 10). The shape parameter α = 1, 10 controls skewness: the greater the parameter, the greater the skewness. For each simulated sample we computed the ratio of the skewness achieved by the proposed initialization method to the maximal skewness achievable by a linear combination. For each sample size, dimension and shape parameter we computed the mean, the median, the skewness and the kurtosis (i.e. the fourth standardized moment) of the ratios. Table 3 reports the simulation’s results, which may be summarized as follows. Simulations show some merit of the proposed initialization method, since the mean and median ratios are never smaller than 0.77 and 0.86. Its performance gets better as either the shape parameter or the sample size increases, but gets worse as the number of variables increases. As a practical implication, less iteration steps, or no iterations at all, are needed when the sample size is large, data are markedly skewed and the number of variables is moderate. All the ratios are heavily and negatively skewed, as well as being very leptokurtic, hinting for the presence of small, outlying observations. As a direct consequence, the mean ratio is always smaller than the median ratio, the latter being more robust to outliers than the former. These facts suggest that

20

Table 3: Summary statistics of the ratios corresponding to the same sample size, number of variables and shape parameter’s value, between the skewness obtained with the proposed initialization method and the maximal skewness achievable by a linear combination. The columns mean and med contain the integer part of the mean and median ratios multiplied by 100. The columns skew and kurt contain the skewnesses (i.e. the third standardized moments) and the kurtoses (i.e. the fourth standardized moments) of the same ratios.

n

5

10

15

20

d 50 100 150 200 50 100 150 200 50 100 150 200 50 100 150 200

α=1 mean med skew 83 94 -1.82 85 95 -2.00 90 97 2.68 91 98 -3.25 mean med skew 78 89 -1.38 81 92 -1.72 85 94 -2.16 87 94 -2.5 mean med skew 77 87 -1.40 81 91 -1.71 80 90 -1.6300 84 93 -1.99 mean med skew 77 86 -1.45 80 90 -1.65 81 90 -1.78 82 91 -1.95

kurt 5.49 6.18 9.93 14.15 kurt 3.83 5.14 7.43 9.43 kurt 4.17 5.12 4.83 6.44 kurt 4.38 4.95 5.51 6.25

α = 10 mean med skew kurt 85 95 -2.01 6.19 93 98 -3.57 16.47 95 99 -4.97 30.62 97 99 -8.22 98.38 mean med skew kurt 78 90 -1.45 4.08 84 93 -2.1 6.66 87 95 -2.58 9.75 90 96 -3.03 12.49 mea med skew kurt 77 88 -1.37 3.83 81 91 -1.64 5.02 82 92 -1.82 5.48 84 93 -2.05 6.84 mean med skew kurt 78 88 -1.50 4.38 80 89 -1.62 4.85 82 91 -1.90 6 82 92 -1.87 5.79

the performance of the initialization method is very often good, but might be hampered by outlying observations coming from the heavy tails of the skew−t distribution. 5. Application 460

465

This section illustrates skewness-based projection pursuit using the results of the athletes competing in the decathlon at the Games of the XXXI Olympiad (Rio de Janeiro, Brazil, year 2016). The dataset contains ten variables (the points scored in each event) and 23 cases (the decathletes who scored points in each event). It is freely available at www.iaaf.org, the official website of the IAAF (International Association of Athletics Federations). Confirmatory data analysis is very difficult due to the limited number of cases (especially when compared to the number of variables) and their nonrandom nature (each

21

470

475

480

485

athlete is extraordinarily gifted for the decathlon). We shall therefore follow an exploratory approch, mainly based on graphical tools. The decathlon is a combined event in athletics consisting of ten events: one hundred meters (100M), long jump (LONGJ), shot put (SHOTP), high jump (HIGHJ), four hundred meters (400M), one hundred and ten meters hurdles (110H), discus throw (DISCT), pole vault (POLEV), javelin throw (JAVET), one thousand and five hundred meters (1500M). Each athlete had his performances recorded in seconds for track event, meters for throwing events and centimeters for jumping events. All performances are converted into decathlon points (simply points, henceforth) according to the IAAF scoring tables. We chose to use points rather than performances because points are what ultimately matter for ranking decathletes. The bar chart of the total points scored by each athlete (Figure 5) does not strike the eye with any particular feature. Ashton Eaton, the winner of the decathlon, does not stand far from his competitors, despite having achieved the second best performance of all times (and being regarded as one of the best decathletes ever). Skewness, as measured by the third standardized moment, ranges from −0.988 to 0.579. Kurtosis, as measured by the fourth standardized moment, ranges from 2.001 to 4.500. Multiple scatterplots suffer from the curse of dimensionality, as it is apparent from Figure 6. On top of that, its panels give contradictory results. 25

20

15

10

5

0

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Figure 5: Bar chart of the total points scored by each decathlete.

490

We shall now use principal component analysis to get a better insight into the data. Figure 7 contains the scatterplot associated with the first two principal components, which account for about 55% of the total variation. Data 22

950

600

900

700 900

750

950

600

750 950

750

950

800

X100.METRES

600 750

750

LONG.JUMP

900

SHOT.PUT

950

600

HIGH.JUMP

700 900

800

X400.METRES

900

0.METRES.HURD

950

600

DISCUS.THROW

900

750

POLE.VAULT

750

600

JAVELIN.THROW

600

X1500.METRES

800

950

600 750

800

950

600

900

600

900

Figure 6: Multiple scatterplot of the points scored by the decathletes in the ten events.

495

500

appear to be uniformly scattered in the first, third and fourth quadrant of the graph. In the second quadrant there are only three athletes, and two of them appear to be very different from each other. One is Saluri, who ranked last and obtained the lowest score of the first principal component. The second one is Nakamura, who ranked second to last and obtained the highest score of the second principal component. However, as remarked by several authors (see, for example, Bouveyron & Brunet-Saumard, 2014), principal component analysis might not detect interesting data features. We shall address this limitation with skewness-based projection pursuit. Not surprisingly, skewnesses of the first and second most skewed projections are way higher than skewnessess of the original variables and their principal components, being 2.647 and 2.417, respectively. More surprisingly, kurtoses of the first and second most skewed projections are way higher than kurtoses of the original

23

−0.7 −0.8

DUBLER

SALURI

−0.9

TAIWO DISTELBERGER

DE ARAÚJO

EATON WARNER

−1.0

second principal component

NAKAMURA

BOURRADA VAN DER PLAETSEN KAZMIREK

GARCÍA AUZEIL TONNESEN

WIESIOLEK

ZIEMEK ABELE

−1.1

MAYER HELCELET FELIX

VICTOR

1.8

1.9

USHIRO

SUÁREZ

2.0

2.1

2.2

2.3

2.4

first principal component

Figure 7: Scatterplot of the first principal component (horizontal axis) and the second principal component (vertical axis).

24

4 2 1

SALURI WIESIOLEK NAKAMURA AUZEIL

0

DUBLER HELCELET DE ARAÚJO BOURRADA FELIX MAYER GARCÍA VAN DER PLAETSEN

TONNESEN ABELE ZIEMEK KAZMIREK WARNER EATON DISTELBERGER

−1

second skewed projection

3

TAIWO

VICTOR

USHIRO

−2

SUÁREZ

−11

−10

−9

−8

−7

−6

−5

first skewed projection

Figure 8: Scatterplot of the most skewed projection (horizontal axis) and the most skewed projection among those orthogonal to it (vertical axis).

505

510

515

520

variables and their principal components, being 10.814 and 10.948, respectively. Their high values are mostly due to two outlying athletes (Saluri and Taiwo), which are far apart from the bulk of the data and from each other, as it is apparent from the scatterplot in Figure 8. Once they are removed, skewness and kurtosis of the first (second) most skewed projections drop to 0.801 and 3.350 (−0.548 and 2.600), respectively. The first projection hints that Saluri is an outlier, while the second skewed component hints that Taiwo is. The former athlete has already been detected as a potential outlier via principal components. The latter athlete, however, went unnoticed in previous analyses. Their performances are very different. Saluri is a natural candidate as an outlier: he is the lowest-ranked decathlete and scored worst, or nearly so, in several events. On the other hand, Taiwo is more of a surprise an outlier, since he ranked eleventh and obtained about average scores in nearly all events. According to Caussinus & Ruiz-Gazen (2009), widespread use of projection pursuit would greatly benefit from real data examples showing its advantages over well-established multivariate techniques. The exploratory data analysis in this section is meant to be a step in this direction. Analysis of the original variables was unable to detect sources of nonnormality. Principal component analysis detected one outlier, but suffered from both masking and swamping

25

525

530

effects. Skewness-based projection pursuit clearly showed the existence of two outlying athletes. However, Cook et al. (1993) caution against putting too much emphasis on structures found in sparse, high-dimensional data. Bayesian analysis might appropriately address the problem, given the limited number of cases and the great amount of informal, but relevant, knowledge about the decathlon. Unfortunately, the inferential approach falls outside the scope of the present section. 6. Conclusion

535

540

545

550

555

560

565

Skewness maximization plays a prominent role within the framework of projection pursuit: skewness is a valid projection index, symmetry is deemed uninteresting by many authors and the Kullback-Leibler divergence of a symmetric distribution from an asymmetric one may be approximated by the skewness of the latter. Skewness-based projection pursuit also appears in other fields of multivariate analysis, including multivariate normality testing, independent component analysis and model-based clustering. The widespread use of skewnessbased projection pursuit has been hampered by computational difficulties. For example the HOPM might not be able to maximize the skewness of a linear projection. The contribution of the present paper is twofold. First, it motivates the right dominant singular vector of the third standardized moment as a starting value of the iterative maximization procedure. Second, it proposes an iterative method which monotonically converges to the global maximum, thus avoiding the drawbacks of the HOPM. Practical relevance of the contribution is illustrated with both real and simulated data. More precisely, in the Decathlon dataset, skewness-based projection pursuit detects a potential outlier which is undetected by PCA. Skewness-based projection pursuit might benefit from this paper in several ways. First, in the exploratory step, meaningful plots are quickly obtained using the R package MaxSkew (Franceschini & Loperfido, 2017a), which implements the proposed algorithm. Second, in the inferential step, the function SkewBoot in the R package MultiSkew (Franceschini & Loperfido, 2017b), which calls MaxSkew, computes the bootstrap distribution and the corresponding p−value. Third, skewness could be removed by retaining only projections onto a linear subspace orthogonal to the one spanned by the most skewed components, if the latter account for all skewness in the data. These projections might be obtained using the function MinSkew in Multiskew. Skewness-based projection pursuit, albeit very useful, suffers from several limitations. First, it cannot be applied when third moments do not exist, e.g. for skew−t distributions with three degrees of freedom or less. Second, the sampled distribution might be asymmetric with a null matrix as its third cumulant (it happens for some location mixtures of two multivariate skew-normal distributions). Third, as remarked by an anonymous referee, the sampled distribution might be nonnormal and symmetric, as for example the mixture 0.5 · Nd (µ1 , Σ) + 0.5 · Nd (µ2 , Σ). For this distribution, the most interesting 26

570

575

580

projection best separates the two clusters, and it coincides with the projection which minimizes kurtosis (Pe˜ na & Prieto, 2000). These limitations suggest the following research directions. The first and second limitations might be dealt with projection indices which are measures of skewness different from the third standardized moment. Examples are Bowley’s measure of skewness and the number of observations greater than the mean. The third limitation might be dealt with projection indices depending on both skewness and kurtosis, as for example the one proposed by Jones & Sibson (1987). The same index appears in independent component analysis, as an approximations to the Kullback-Leibler divergence (Hyvarinen et al., 2001, page 115), and in econometrics, where it is commonly known as Jarque-Bera statistic. The iterative approach described in this paper might be easily adapted to these indices, but it is not clear how the corresponding starting values should be chosen. Acknowledgements

585

The author would like to thank the Editor, an anonymous Associate Editor and two anonymous Referees for their comments, which greatly helped in improving the quality of the present paper. The author would also like to thank Cinzia Franceschini for her help in developing the R packages MaxSkew and MultiSkew. 7. References

590

Arevalillo, J., & Navarro, H. (2015). A note on the direction maximizing skewness in multivariate skew−t vectors. Statistics & Probability Letters, 96 , 328–332. Arnold, B., Castillo, E., & Sarabia, J. (2002). Conditionally specified multivariate skewed distributions. Sankya (A), 64 , 1–121.

595

Arnold, B., Castillo, E., & Sarabia, J. (2007). Distributions with generalized skewed conditionals and mixtures of such distributions. Communications in Statistics-Theory and Methods, 36 , 1493 1503. Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12 , 171–178. Azzalini, A., & Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika, 83 , 715–726.

600

Azzalini, A., & Genton, M. (2008). Robust likelihood methods based on the skew-t and related distributions. International Statistical Review , 76 , 106– 129. Balakrishnan, N., & Scarpa, B. (2012). Multivariate measures of skewness for the skew-normal distribution. Journal of Multivariate Analysis, 104 , 73–87.

27

605

610

Baringhaus, L., & Henze, N. (1991). Limit distribution for measures of multivariate skewness and kurtosis based on projections. Journal of Multivariate Analysis, 38 , 51–69. Bartoletti, S., & Loperfido, N. (2010). Modelling air pollution data by the skew-normal distribution. Stochastic Environmental Research and Risk Assessment, 24 , 513–517. Berro, A., Marie-Sainte, S., & Ruiz-Gazen, A. (2010). Genetic algorithms and particle swarm optimization for exploratory projection pursuit. Annals of Mathematics and Artificial Intelligence, 60 , 153–178.

615

Bouveyron, C., & Brunet-Saumard, C. (2014). Model-based clustering of highdimensional data: A review. Computational Statistics and Data Analysis, 71 , 52–78. Branco, M., & Dey, D. (2001). A general class of skew-elliptical distributions. Journal of Multivariate Analysis, 79 , 99–113.

620

Cassart, D., Hallin, M., & Paindaveine, D. (2011). A class of optimal tests for symmetry based on local Edgeworth approximations. Bernoulli , 17 , 1063– 1094. Caussinus, H., & Ruiz-Gazen, A. (2009). Exploratory projection pursuit. In G. Govaert (Ed.), Data analysis (pp. 76–92). Wiley.

625

Chang, C., Lin, J., Pal, N., & Chiang, M. (2008). A note on improved approximation of the binomial distribution by the skew-normal distribution. The American Statistician, 62 , 167–170. Christiansen, M., & Loperfido, N. (2014). Improved approximation of the sum of random vectors by the skew-normal distribution. Journal of Applied Probability, 51 , 466–482.

630

635

Comon, P., Golub, G., Lim, L., & Mourrain, B. (2008). Symmetric tensors and symmetric tensor rank. SIAM Journal of Matrix Analysis and its Applications, 30 , 1254–1279. Cook, D., Buja, A., & Cabrera, J. (1993). Projection pursuit indexes based on orthonormal function expansions. Journal of Computational and Graphical Statistics, 2 , 225–250. Daszykowski, M. (2007). From projection pursuit to other unsupervised chemometric techiniques. Journal of Chemometrics, 21 , 270–279.

640

De Lathauwer, L., Comon, P., De Moor, B., & Vanderwalle, J. (1995). Higherorder power method: Application in independent component analysis. In NOLTA (Ed.), Proceedings of the International Symposium on Nonlinear Theory and its Applications (NOLTA ’95) (pp. 91–96). Las Vegas (NV).

28

De Lathauwer, L., De Moor, B., & Vanderwalle, J. (1996). Blind source separation by simultaneous third-order tensor diagonalization. In Proceedings of the 8th European Signal Processing Conference. 645

De Lathauwer, L., De Moor, B., & Vanderwalle, J. (2000). On the best rank-1 and rank-(r1 ,r2 ,...rN ) approximation of high-order tensors. SIAM Journal of Matrix Analysis and its Applications, 21 , 1324–1342. Diaconis, P., & Freedman, D. (1984). Asymptotics of graphical projection pursuit. The Annals of Statistics, 12 , 793–815.

650

655

Diaconis, P., & Salzman, J. (2008). Projection pursuit for discrete data. IMS Collections, Probability and Statistics: Essays in Honor of David A. Freedman, 2 , 265–288. Ferguson, T. (1961). On the rejections of outliers. In Proc. Fourth Berkeley Symp. on Math. Statist. and Prob. (pp. 253–287). Univ. of Calif. Press volume 1. Franceschini, C., & Loperfido, N. (2014). Testing for normality when the sampled distribution is extended skew-normal. In M. Corazza, & C. Pizzi (Eds.), Mathematical and Statistical Methods for Actuarial Sciences and Finance (pp. 159–170). Springer.

660

665

Franceschini, C., & Loperfido, N. (2017a). MaxSkew: skewness-based projection pursuit, . URL: https://CRAN.R-project.org/package=MaxSkew. R package version 1.1. Franceschini, C., & Loperfido, N. (2017b). MultiSkew: Measures, Tests and Removes Multivariate Skewness, . URL: https://CRAN.R-project.org/package=MultiSkew. R package version 1.1.1. Friedman, J. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82 , 249–266.

670

Friedman, J., & Tukey, J. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Compututers, C-23 , 881–889. Gupta, A., & Kollo, T. (2003). Density expansions based on the multivariate skew-normal distribution. Sankya, 65 , 821–835. Hall, P. (1989). On polynomial-based projection indices for exploratory projection pursuit. The Annals of Statistics, 17 , 589–605.

675

Huber, P. (1985). Projection pursuit (with discussion). Annals of Statistics, 13 , 435–475. Hui, G., & Lindsay, B. (2010). Projection pursuit via white noise matrices. Sankhya B , 72 , 123–153.

29

680

Hyvarinen, A., Karhunen, J., & Oja, E. (2001). Independent Component Analysis. New York NY: John Wiley & Sons. Hyvarinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9 , 1483–1492. Jones, M. C., & Sibson, R. (1987). What is projection pursuit? (with discussion). Journal of the Royal Statistical Society, A, 150 , 1–38.

685

690

Kim, H., & Kim, C. (2017). Moments of scale mixtures of skew-normal distributions and their quadratic forms. Communications in Statistics - Theory and Methods, 46 , 1117–1126. Kim, H., & Mallick, B. (2003). Moments of random vectors with skew-t distribution and their quadratc forms. Statistics and Probability Letters, 63 , 417–423. Kim, S., Kim, J., & Choi, S. (2008). Independent arrays or independent time courses for gene expression time series analysis. Neurocomputing, 71 , 2377– 2387.

695

Kofidis, E., & Regalia, P. (2002). On the best rank-1 approximation of higherorder supersymmetric tensors. SIAM Journal Matrix Ana. Appl., 23 , 863– 884. Kruskal, J. (1969). Toward a practical method which helps uncover the structure of a set of multivariate observations by finding the linear transformation which optimizes a new index of condensation. Statistical Computation, / , 427–440.

700

Lin, J., Saito, N., & Levine, R. (1999). Edgeworth approximation of the Kullback Leibler distance towards problems in image analysis. In Tech. Rep. ,http://www.math.ucdavis.edu/saito. Universityof California, Davis. Loperfido, N. (2010). TEST , 19 , 146–165.

705

Canonical transformations of skew-normal variates.

Loperfido, N. (2013). Skewness and the linear discriminant function. Statistics & Probability Letters, 83 , 93–99. Loperfido, N. (2015a). Vector-valued skewness for model-based clustering. Statistics & Probability Letters, 99 , 230–237.

710

Loperfido, N. (2015b). Singular value decomposition of the third multivariate moment. Linear Algebra and its Applications, 473 , 202–216. Loperfido, N., Mazur, S., & Podg´orski, K. (2017). Third cumulant for multivariate aggregate claim models. Scandinavian Actuarial Journal , To appear . Machado, S. (1983). Two statistics for testing for multivariate normality. Biometrika, 70 , 713–718.

30

715

Malkovich, J., & Afifi, A. (1973). On tests for multivariate normality. Journal of American Statistical Association, 68 , 176–179. Mendell, N., Finch, S., & Thode, H. (1993). Where is the likelihood ratio test powerful for detecting two component normal mixtures? Biometrics, 49 , 907–915.

720

725

Miettinen, J., Nordhausen, K., Oja, H., & Taskinen, S. (2014). Deflation-based FastICA with adaptive choices of nonlinearities. IEEE Transactions on Signal Processing, 62 , 5716–5724. Miettinen, J., Nordhausen, K., Oja, H., & Taskinen, S. (2017). fICA: Classical, Reloaded and Adaptive FastICA Algorithms, . URL: https://CRAN.R-project.org/package=fICA. R package version 1.1-0. Naito, K. (1997). A generalized projection pursuit procedure and its significance level. Hiroshima Mathematical Journal , 27 , 513–554. Nason, G. (1995). Three-dimensional projection pursuit. Applied Statistics, 44 , 1430.

730

Nason, G. (2001). Robust projection indices. Journal of the Royal Statistical Society B , 63 , 551–567. Ortega, J. (1987). Matrix Theory: a Second Course. New York, NY.

735

Paajarvi, P., & Leblanc, J. (2004). Skewness maximization for impulsive sources in blind deconvolution. In Proceedings of the 6th Nordic Signal Processing Symposium - NORSIG. Espoo, Finland. Pan, J., Fung, W., & Fang, K. (2000). Multiple outlier detection in multivariate data using projection pursuit techniques. Journal of Statistical Planning and Inference, 83 , 153–167.

740

Pe˜ na, D., & Prieto, F. (2000). The kurtosis coefficient and the linear discriminant function. Statistics & Probability Letters, 49 , 1995–2007. Pe˜ na, D., Prieto, F., & Viladomat, J. (2010). Eigenvectors of a kurtosis matrix as interesting directions to reveal cluster structure. Journal of Multivariate Analysis, 101 , 1995–2007.

745

Posse, C. (1995). Projection pursuit exploratory data analysis. Computational Statistics and Data Analysis, 20 . Qi, L. (2006). Rank and eigenvalues of a supersymmetric tensor, the multivariate homogeneous polynomial and the algebraic hypersurface it defines. Journal of Symbolic Computation, 41 , 1309–1327.

750

Salvan, A. (1986). Test localmente pi` u potenti tra gli invarianti per la verifica dell’ipotesi di normalit`a. In S. I. di Statistica (Ed.), Atti della XXXIII Riunione Scientifica della Societ` a Italiana di Statistica, volume II . Cacucci. 31

Serfling, R., & Mazumder, S. (2013). Computationally easy outlier detection via projection pursuit with finitely many directions. Journal of Nonparametric Statistics, 25 , 447–461. 755

760

Stone, J., & Porrill, J. (1998). Independent component analysis and projection pursuit: A tutorial introduction. In Technical report available at http://www.shef.ac.uk/∼pc1jvs/index.html . Stone, J., Porrill, J., Porter, N., & Wilkinson, I. (2002). Spatiotemporal independent component analysis of event-related fmri data using skewed probability density functions. NeuroImage, 15 , 447–461. Sun, J. (2006). Projection pursuit. In Encyclopedia of Statistical Sciences. New York: Wiley volume 10.

765

Takemura, A., Matsui, M., & Kuriki, S. (2006). Skewness and kurtosis as locally best invariant tests of normality. In Technical Report METR 06-47, available at http://www.stat.t.u-tokyo.ac.jp/˜takemura/papers/metr200647.pdf . Tichavsky, P., Koldovsky, Z., & Oja, E. (2006). Performance analysis of the FastICA algorithm and Cramer-Rao bounds for linear independent component analysis. IEEE Transactions on Signal Processing, 54 , 1189–1203.

770

Tyler, D., Critchley, F., D¨ umbgen, L., & Oja, H. (2009). Invariant co-ordinate selection (with discussion). J. R. Statist. Soc. Ser. B , 71 , 1–27. Zarzoso, V., & Comon, P. (2007). Comparative speed analysis of fastica. In Independent Component Analysis and Signal Separation. ICA 2007. Lecture Notes in Computer Science 4666 Davies M.E., James C.J., Abdallah S.A., Plumbley M.D. (eds). Berlin, Heidelberg: Springer.

32

Suggest Documents