On the Multivariate Normality of Data Arising from ... - Science Direct

7 downloads 0 Views 201KB Size Report
single test can be expected to be optimal (D'Agostino, ... then T4 has an asymptotic chi-squared distribution ... The test statistic is simply the integrated squared.
Journal of Archaeological Science (1999) 26, 117–124 Article No. jasc.1998.0368, available online at http://www.idealibrary.com on

On the Multivariate Normality of Data Arising from Lead Isotope Fields M. J. Baxter† Department of Mathematics, Statistics and Operational Research The Nottingham Trent University, Nottingham NG11 8NS, U.K. (Received 9 January 1997, revised manuscript accepted 22 October 1998) There has been recent and extensive debate about the analysis and interpretation of lead isotope ratio data in archaeology. This paper addresses the specific technical issue of whether data arising from lead isotope fields can be reasonably modelled by trivariate normal distributions. This assumption underpins much of the statistical analysis of such data. It is argued that the univariate and coordinate dependent approaches that have been used to test for normality need to be complemented with truly multivariate tests. Several such tests are described and applied to seven recently published data sets. The results suggest that non-normality may be the rule rather than the exception. Some of the consequences of this are discussed.  1999 Academic Press Keywords: LEAD ISOTOPE DATA, MULTIVARIATE NORMALITY, KERNEL DENSITY ESTIMATION, PROBABILITY PLOTS, MULTI-MODAL DATA.

Introduction

number of such tests are reviewed in the ‘‘Tests of Multivariate Normality’’ section. In the section entitled ‘‘Application’’, several tests of multivariate normality are applied to seven of the data sets published in Stos-Gale, Gale & Annetts (1996). The results suggest that non-normality may be the rule rather than the exception. The form of non-normality and possible consequences are discussed in the final two sections.

T

he paper by Sayre et al. (1992), on the statistical analysis of lead isotope data, initiated a sometimes heated debate that is still continuing (e.g., Tite, 1996; Stos-Gale et al., 1997). There are many issues on which the protagonists in this debate disagree; this note addresses the purely technical issue of whether or not data arising from lead isotope fields can be reasonably modelled by a trivariate normal distribution. The statistical procedures advocated by Sayre et al. (1992) and others depend, for their validity, on the assumption of normality. Claims that this assumption is usually valid have been disputed (Scaife et al., 1996) but published evidence supporting either position is limited (see next section). A limitation of the approaches that have been used to assess normality is that they are essentially univariate, whether the data are used as recorded, or transformed to principal components before testing (e.g., Sayre et al., 1992, who rely on graphical analysis). Baxter & Gale (1997) have shown that univariate tests of normality applied to the principal components for pairs of lead isotope ratios can be indicative of non-normality, but this approach is not guaranteed to work even if data are non-normal. It is argued in this paper that such approaches need to be supplemented with truly multivariate methods that do not depend on the coordinate system used (i.e., the original ratios, or their principal components), and a

Normality and Lead Isotope Ratio Analysis Conventionally in archaeology, lead isotope analyses are presented in the form of three ratios 208Pb/206Pb, 207 Pb/206Pb and 206Pb/204Pb. For a single ore body, measurements on n samples may be visualized as a cloud of points in three-dimensional space that may be used to estimate the lead isotope field for the ore body. That lead isotope data are usually normally distributed is asserted in Gale & Stos-Gale (1993) and Sayre et al. (1992). In the former case the evidence is based on statistical tests of univariate normality of the three ratios; in the latter case graphical inspection of the principal component scores for the three ratios is the main approach. Sample sizes are often quite small so that these approaches, which are a sensible first step in any analysis, might be expected to have difficulty in detecting non-normality. In addition to the sample size problem, there are two further difficulties with the approaches that have been used. The first is that

† Email: [email protected]

117 0305–4403/99/010117+08 $30.00/0

 1999 Academic Press

118

M. J. Baxter

non-normality of the individual ratios implies trivariate non-normality but the converse is not the case. The second difficulty is that both approaches are coordinate dependent — on the system defined by the original ratios in the former case, and on the system defined by the principal components in the latter case. The assumption that a field has a trivariate normal distribution has been used in various ways. One common approach is to calculate confidence ellipsoids for bivariate pairs of ratios in order to demonstrate that fields for different ore bodies are distinct. Sayre et al. (1992) and the subsequent discussion in Archaeometry, particularly the contribution of Leese (1992), give technical details. The main point, as far as the present paper is concerned, is that the assumption of trivariate normality of fields is necessary for the procedures to be valid. The use of lead isotope analyses for provenancing purposes in this way has been questioned (Budd et al., 1993, 1995, 1996), in particular the normality assumption (Scaife et al., 1996).

Tests of Multivariate Normality Notation This section focuses on formal tests of multivariate normality; less formal, though important, graphical approaches will be illustrated later. Looney (1995) notes that over 50 such tests have been proposed. No single test can be expected to be optimal (D’Agostino, 1986: 413), so that the use of several tests is advisable. Let X be an np data matrix with the ith row xi, a p1 vector. Denote the mean of the xi by x. The estimated covariance matrix is S={(n1)/n}a &(xi x)(xi x) /(n1)

rij =(xi x) S 1 (xj x), dij =(xi xj) S 1 (xi xj) and scaled ‘‘residuals’’ si =S 1/2 (xi x). Tests of multivariate skewness and kurtosis Two early tests of multivariate normality proposed by Mardia (1970, 1974, 1975), and subsequently extensively studied, are multivariate skewness and kurtosis statistics defined as n

b1,p =

3

and

n2 n

b2,p =

(2)

2

&i=1 rii n

T4 =n{b*2,p 6b2,p +3p(p+2)}/24 where

n

b*2,p =

4

&i,j=1 rij n2

then T4 has an asymptotic chi-squared distribution with (p+3)C4 degrees of freedom.

An omnibus test combining skewness and kurtosis statistics The foregoing statistics are intended to detect specific departures from normality. Mardia & Kent (1991) combine them in the form T=T3 +T4

(4)

which is asymptotically chi-square with (p+2)C3 + (p+3) C4 degrees of freedom under the normality hypothesis. This is an omnibus test, in contrast to the previous two, and is identical to the smooth test proposed by Koziol (1986, 1987), though differently motivated.

(1)

where a=0 gives the unbiased sample estimate of the population covariance matrix for the field, and a=1 gives the maximum likelihood estimate. For two rows, xi and xj, define

&i,j=1 rij

respectively, where a=1 in equation (1). For small samples, and for p=2 or p=3, tables given in Kres (1983) can be used to see if the skewness and kurtosis depart significantly from what is expected under multivariate normality. Asymptotically, T3 =nb1,p/6 is distributed approximately as chi-square with (p+2)C3 degrees of freedom. Let

(3)

The multivariate Shapiro-Wilk test Mardia (1980) noted, on the basis of studies completed at that time, that the performance of any new test of multivariate normality needed to be compared with the the tests based on (2) and (3), and Malkovich & Afifi’s (1973) generalization of the Shapiro-Wilk test which is now described. The univariate Shapiro-Wilk test is defined as W=(&aizi)2/&(zi z)2

(5)

where the zi are an ordered sample of n observations. The cofficients, ai, depend on the covariance matrix of the order statistics of a sample of standard normal random variables and can be approximated, without the need for table reference, using results given in Royston (1992). Now let z be a linear combination of the variables that define X; then Malkovich & Afifi (1973) propose, as a statistic, the minimum of W over all possible linear combinations. Finding this minimum is, in general, a non-trivial problem but can be approximated using ‘‘brute-force’’ methods for small p. Essentially, the test looks for a direction from which to view

On the Multivariate Normality of Data Arising from Lead Isotope Fields

119

the data that maximizes non-normality as defined by the Shapiro-Wilk statistic. A selection of critical values is given in Romeu & Ozturk (1993).

Table 1. Tests of univariate normality for individual ratios for three fields using the Shapiro-Wilk statistic. Entries are levels of significance

Tests based on kernel density estimates Some more recently developed tests are based on a multivariate kernel density estimate for the scaled residuals (Bowman & Foster, 1993). Let f|(s) be the multivariate kernel density estimate based on the si using a normal kernel with a single smoothing parameter h. Its expected value under the hypothesis of normality is g(s), the probability density of the p-variate normal distribution with covariance matrix ó2I, where ó2 =1+h2 and I is the pp identity matrix. The test statistic is simply the integrated squared error or ‘‘distance’’ between the estimated and expected distributions given by

Field

n

208/206

207/206

206/204

Kea Lavrion Seriphos

62 59 36

0·401 0·011 0·995

0·726 0·085 0·469

0·972 0·029 0·554

{g(s)| f (s)}2ds. Bowman & Foster (1993) evaluate this to give the form N(0,2(1+h2)I)2&N(si,(1+2h2)I)/ i

n+N(0,2h2I)/n+2&N(si sj,2h2I)/n2 (6)

Isotope ratios

difficulties may, in the past, have deterred the use of some of the statistics used here. Another practical problem is the need to refer, for levels of significance, to specially constructed tables of critical values which are not always easily accessible and not always reliable.

Applications Recently Stos-Gale, Gale & Annetts (1996) have published lead isotope ratio data for a number of fields in the Aegean. The seven fields for which there are 15 or more observations are used here.

i

Suggest Documents