A Statistical Test for the Time Constancy of Scaling Exponents Darryl N VEITCH(1) and Patrice ABRY(2) (1) - SERC, Level 3, 110 Victoria Street, Carlton, Victoria 3053, - Australia tel: (+61) (0)3 9925 4014 - Fax: (+61) (0)3 9925 4094 Email :
[email protected] - Web : http://www.serc.rmit.edu.au/darryl (2) - CNRS URA 1325 - ENS Lyon - 46, allee d'Italie 69 364 LYON Cedex 07 - France tel: (+33) (0)4 72 72 84 93 - Fax: (+33) (0)4 72 72 80 80
[email protected] - http://www.physique.ens-lyon.fr/ts
Abstract
A wavelet based statistical test is described for distinguishing true time variation of the scaling exponent describing scaling behaviour, from statistical uctuations of estimates across time of a constant exponent. The test is applicable to diverse scaling phenomena including long range dependence and exactly selfsimilar processes in a uniform framework, without the need for prior knowledge of the type in question. It is based on the special properties of wavelet-based estimates of the scaling exponent over adjacent blocks of data, strongly motivating an idealised inference problem: the equality or otherwise of means of independent Gaussian variables with known variances. A uniformly most powerful invariant test exists for this problem and is described. A separate UMPI test is also described for when the scaling exponent undergoes a level change. The power functions of the two tests are given explicitly and compared. Using simulation the eect in practice of deviations from the idealizations made of the statistical properties of the wavelet detail coecients are analysed and found to be small. The tests inherit the signi cant robustness and computational advantages of the underlying wavelet-based estimator. The use of the test in practice is illustrated on the Bellcore Ethernet data sets.
Keywords: Self-similarity, long-range dependence, scaling exponent, wavelets,
hypothesis testing, time constancy, telecommunications networks. the support of Ericsson Australia is gratefully acknowledged
1
1 Introduction Stochastic processes exhibiting scaling behaviour, such as exact self similarity or longrange dependence, have been recognised in many elds as relevant models of time series data with scale invariance features. A prominent new example of the latter is telecommunications network trac, the extraordinary scaling properties of which have stimulated much new work in the areas of data modelling using such scaling models, the generation of synthetic time series, and the study of their sample path properties, as well as their impact on queueing performance. The reader is referred to [23] for a recent comprehensive bibliography of these areas. An interest in modelling data necessarily leads to issues of measurement and statistical estimation. One of the diculties in the analysis of data with scaling features is the poor and non-standard performance of many statistical tools, which rely typically on stationarity of the model or on a short range correlation structure, or both. It has been recently shown however how estimation approaches based on wavelet analysis [3, 4, 21, 6] can overcome many of these disadvantages. Notably, unbiased and ecient semi-parametric estimates of the scaling exponent, the key parameter describing scaling, are possible. Furthermore, this is achieved at very low computational cost: O(n) where n is the length of the data, essential in the analysis of very long series such as those arising in telecommunications. There is another key diculty however which, although raised from time to time [9], has as yet received little attention and no satisfactory solution. It is the fact that the very high variability inherent in scaling processes is very easily confused with nonstationarity, both in a `judgement by eye' sense, and in the sense of poor robustness and performance of many standard statistical tools. This diculty is very signi cant as it aects the fundamental issue of the class of models which one might select a priori to model data, as well as the reliable estimation of model parameters in the presence of polluting non-stationarities. It is entirely possible for example that, by using an inappropriate statistical tool to detect self-similarity and measure its Hurst parameter H , results will be obtained which seem to indicate scaling behaviour, even when in fact the data is not scaling but is non-stationary in a non-scaling sense. Conversely, non-stationarity can be erroneously taken as evidence of scaling behaviour. There is thus a strong and clear need for procedures and formal statistical tests to allow the variability due to scaling behaviour, to be clearly distinguished from that due to nonscaling non-stationarities. The main contribution of this paper is the description of a simple yet optimal statistical test for the central problem of determining the constancy or otherwise in time of the scaling exponent. It is based on a wavelet-based estimator 2
described in [3, 21, 6] which enables many of the statistical diculties due to scaling and non-stationarity to be avoided in a natural way. Under well justi ed idealizations the test is Uniformly Most Powerful Invariant. In section 2, scaling is described and both stationary or non-stationary classes of scaling de ned. In section 3 wavelets are introduced and the wavelet-based estimator of the scaling exponent presented. The properties of the estimator inspire a simple test for the constancy of the scaling exponent. In section 4 the idea of the test is idealized as a model inference problem. An optimal (Uniformly Most Powerful Invariant) test for this problem exists and is explained, and its properties illustrated. An optimal test for a class of level shift problems is also given. In section 5 the constancy of scaling test is de ned in the wavelet domain and applied to simulated data, its properties measured empirically, and compared against those expected from the idealized problem. Practical issues are addressed and an analysis procedure given. Finally in section 6 the test is used on Ethernet data, where it is shown how pitfalls in the interpretation of single estimates can be avoided. Conclusions are given in section 7.
2 Scaling phenomena Under the term scaling we gather several dierent phenomena which share a common intuitive property: that of not exhibiting any characteristic time-scale. Rather, the behaviour of scaling processes is characterised by the fact that all scales within a scaling range are equally important, and are related via some renormalisation or rescaling operation. A vital point to note is that both stationary and non-stationary classes of scaling processes exist. They dier in the aspect of the process where the scale invariance is to be found, and/or the scaling range. We rst consider Long-Range Dependence (LRD), a long-memory property of secondorder stationary stochastic processes. Its simplest de nition is given by the power-law divergence at the origin of the spectrum: fX ( ) cf j j? ; j j ! 0: (1) This asymptotic de nition involves two parameters, the dimensionless scaling exponent , and the `power parameter' cf which has the dimensions of variance (see [21] for a detailed discussion of the role of cf ). Short range dependence implies trivial scaling at large scales with = 0. An important class of scaling processes which are non-stationary are the exactly self similar processes, characterised by the famous scaling exponent H , the Hurst parameter. A process X = fX (t); t 2 Rg is self-similar with parameter H > 0 (H -ss) if X (0) = 0 3
and fX (at); t 2 Rg and faH X (t); t 2 Rg have the same nite-dimensional distributions. One aspect of such processes which shows their non-stationarity very clearly is their monotonically increasing variance: Var[X (t)] = Var[X (1)]jtj2H . For H < 1 such processes may have stationary increments, and if so the increment processes are LRD processes, with = 2H ? 1. The value H = 1=2, such as that of Brownian Motion, is somewhat special as in that case the increments are not only stationary but also mutually independent (white noise). There are other kinds of scaling properties, such as fractality, which refers to a scaling `roughness' at small scales, and multi-fractality, a more complex description of local fractal-like irregularity where a single exponent is insucient, however we will not enter into a broader discussion of these here (see [6]). Again, the central point is that there are dierent classes of scaling phenomenon, covering both the stationary and nonstationary worlds, each of which has a scaling exponent as a key de ning parameter. For simplicity in what follows let denote generically the second order scaling exponent of the process x(t) under study, regardless of the kind of scaling it describes. Extensions to statistics other than second order can also be considered, however we will not do so here. In relation to the modelling of network trac, self-similar processes with stationary increments can serve as models of the cumulative input, say the number of bytes x(t) arriving in the interval [0; t], arising from a stationary LRD process which describes the
uctuating input trac rate. [17]. In relation to the possible confusions between time constancy of scaling and nonstationarity, the following basic points should be made clear. Clearly, if a test concludes that the scaling parameter varies over time, then the data is not scaling, nor can it be stationary. However one cannot conclude that the test serves as a test for stationarity, indeed such a test is impossible (outside of a strict parametric context), as there are an in nite number of things to check. In fact a conclusion that scaling is constant can also be used to conclude non-stationarity, as a scaling process can be non-stationary in its own right, but not stationarity, as extraneous non-stationary trends may be present, superposed on the scaling data, even if the scaling is not itself of non-stationary type. In conclusion, non-stationarity may sometimes be concluded from the result of a test of the constancy of scaling, but never stationarity, and so the test of constancy is in no way a de-facto test for stationarity.
4
3 The Wavelet Approach to Scaling Analysis 3.1 Wavelet analysis and scaling exponent estimation The discrete wavelet transform The coecients dx(j; k) of the discrete wavelet
transform (DWT) [10, 16] result from the comparison, by means of inner products, of the process to be analysed x and a family of functions f j;k g, called the wavelet basis:
dx(j; k) = hx;
j;k i :
The wavelet basis f j;k g consists of shifted and dilated templates of a single reference pattern 0 , usually called the mother-wavelet. The mother-wavelet consists of a basic pattern whose time support and frequency support are both strongly concentrated: it therefore acts as an elementary atom of information. The time-shift operator, whose action on 0 yields 0;k (t) = 0 (t ? k) ; allows the selection of a particular time instant t around which the analysis is to be performed. The dilation (or change of scale) operator, whose action on 0 yields 1 t j;0 (t) = p j 0 j ; 2 2
varies the time support of the wavelet and therefore allows the analysis of the process to be concentrated about a given scale (or equivalently, frequency). Acting together, the operators generate the full, two parameter set of basis functions: !
1 t ? 2j k ; p ( t ) = j;k 2j 0 2j centred on a sparse set of points in the time-scale plane known as the dyadic grid, that is the points f(scale = 2j ; t = 2j k); j; k 2 Zg. The fact that the wavelet transform is a joint time and scale representation of the process x plays a key role in the problem addressed here: the scale variable captures the scaling phenomena that exist in the data, while the time variable enables their constancy along time to be investigated. The mother-wavelet is moreover characterised by an integer N , called the number of vanishing moments, de ned as:
k = 0; 1; 2 : : :; N ? 1; 5
Z
tk 0 (t)dt 0 :
The number of vanishing moments controls details of the statistical performance of the estimator of the scaling exponent presented below and lends it robustness and versatility (see [3, 4, 21, 6] for a more thorough exploration of these questions). Wavelet analysis of scaling processes Let x be either a H -ss process with stationary increments of some order (see [6]), or a LRD process, or any other scaling process as mentioned in the previous section, and denote generically the second order scaling exponent by . The wavelet coecients of x satisfy the following two key properties in the scaling range [13, 19, 3, 4, 21]:
P1: Provided N ( ? 1)=2, the sequences fdx(j; k); k 2 Zg are stationary, and
for each xed octave j in the scaling range their variances reproduce precisely (H ss) or to a very high accuracy (LRD) the power law underlying the scale invariance of the process: IEdx(j; k)2 ' 2j :
P2: Any two wavelet coecients exhibit a correlation that is asymptotically controlled by N such that the larger N , the weaker the correlation:
IEdx(j; k)dx(j 0; k0) j2j k ? 2j k0j?1?2N ; j2j k ? 2j k0 j ! +1: 0
0
More precisely, the non-stationarity of H -ss processes or the long range correlations of LRD processes can be transformed to stationary and short-range dependence in the wavelet domain provided N is chosen such that N =2. Henceforth the following idealization of this result will be used: ID1: The dx(j; k) are strictly uncorrelated.
De nition of the estimator.
From P1, one can think of estimating the scaling exponent from a linear t in a log2 IEdx(j; k)2 vs log2 (2j ) = j plot. Intuitively, P1 and P2 indicate that the simple time average 1=nj Pk dx(j; k)2 (where nj is the number of wavelet coecients at octave j ) will be an ecient estimator of IEdx(j; k)2 . The wavelet based estimator therefore reads:
yj = log (1=nj Pk dx(j; k)2) ? gj P2 ^ = j wj yj
(2)
where the sum is over j 2 (j1 ; j2), the range of octaves over which the scaling phenomenon is observed and the linear regression performed. The gj are deterministic quantities that account for the fact log2 IEdx(j; k)2 6= IE log2 dx(j; k)2 , see [21, 6] for details and complete expressions. The weights wj follow the standard formulae for weighted 6
m linear regression: wj = 1=aj (S0 j ? S1)=(S0S2 ? S12), where Sm = Pj a?1 j j (m = 0; 1; 2) and the aj are arbitrary numbers, set here to the variances of the respective yj , as this choice yields minimal variance for the purely regression part of the estimation problem. Statistical performances of ^ It has been shown [13, 3, 2, 4, 21] that the waveletbased estimator de ned above exhibits excellent statistical properties. Before brie y listing them below, it is worthwhile emphasizing here that the de nition and statistical performance of ^ also apply to the analysis of 1=f -type processes, and also to some extent to that of fractal and multifractal processes [6].
1. The estimation procedure and its statistical performance are the same regardless of the precise nature of the scaling existing in x. 2. It is strictly unbiased (even for nite data) provided the g(j ) can be calculated exactly (for example for Gaussian processes closed form expressions are available [21]). If the g(j ) can only be approximated (for example set to zero), it is asymptotically unbiased and has negligible bias in practice, even for short length data. 3. It can be shown that for Gaussian process the variance of the estimator reaches the Cramer-Rao lower-bound of the corresponding estimation problem [3, 2, 21, 22]. Moreover an analytical closed-form expression of this variance can be given [3, 2, 4]: 2 = 2 (j1; j2 ; fnj g) is a known function of the amount of data and the scaling range only, and in particular is independent of the unknown . 4. The estimator exhibits robustness in two respects. i) It does not require the a priori knowledge of a full parametric model to describe the data: it is semi-parametric in nature [4, 21]. ii) It is insensitive to important classes of deterministic nonstationarities possibly superimposed onto the scaling phenomenon to be analyzed. It is for instance blind to smooth determistic trends added to the process x (understood as drifts of the mean) [4, 6], or to smooth time evolutions of the variance of x itself [18]. The above properties have been thoroughly studied elsewhere and will not be further detailed here. The test for the constancy of the scaling parameter developed below will however require that two additional properties be addressed in more detail, namely the Gaussianity of ^, and the independence of estimates obtained from adjacent non overlapping blocks. These will discussed in the next two subsections. 7
3.2 Gaussianity of the estimator The exact form of the probability distribution of ^ depends on the statistical details of the analyzed process x. However, the theoretical, as well as the numerical, arguments developed below show that it is asymptotically normally distributed. Theoretical arguments If we assume exact decorrelation among the wavelet coef cients (ID1), we have exact decorrelation of the yj that enter in the de nition of ^ in equation (2). For all analyzed processes x such that the yj have a nite variance, and this covers a very large class of processes including many with in nite variance [6, 8], one can therefore apply a generalised central limit argument (see e.g., [12], theorem 3, section VIII.4): in the limit of J ! +1, where J = j2 ? j1 +1 is the width of the scaling range, ^ is normally distributed. In many scaling contexts, including LRD and H -ss processes, increasing J is equivalent to increasing the number of available samples n, so we say that, under ID1, ^ is asymptotically normally distributed. Numerical simulations The idealization ID1 is not strictly true. To test the effect in practice of residual correlation between wavelet coecients on the asymptotic gaussianity of ^, the following numerical simulations were performed. We synthesized, by the so-called spectral synthesis method, K realizations of fractional Gaussian noise (fGn) for dierent values of and of various lengths n, and recorded an estimate of for each (for uniformity we use here as the scaling parameter of fGn rather than the more common choice of the Hurst parameter H , where = 2H ? 1). For each (; n) pair the empirical probability distribution function of ^ given by the K independent estimates was compared with that of a Gaussian random variable, by the standard technique of quantile-quantile plots. For the results shown in gure 1 we have K = 10000, = f?0:5; 0:5g, n = f29; 217g, j1 = 3, and Daubechies3 (N = 3) wavelets were used. The results clearly show that the distribution of ^ corresponds closely to that of a Gaussian random variable. The agreement is not only excellent for long data sets, being close in the range corresponding to around 3 for n = 217 , but also for much shorter data sets: between 2 for n = 29. Note that 10000 independent random samples from a Gaussian variable would not exhibit more convincing quantile-quantile plots than these! This numerical study reveals that asymptotic Gaussianity for ^ remains valid under departures from ID1, and also that it holds even for very small number of analyzed samples.
8
•• • •••••••• ••• •••••••• ••••• • • • • • • • • • ••••• alpha = 0.5 ••••••• ••••••• •••••••• 10,000 realisations •••••• • • • • • • • •••• ••••••• n=2^9 •••••• •••••• ••••••• ••••••• • • • •• • •• • • ••••••• ••••••••• ••••••••• ••••••• ••••••• • • • • • • •••••••• •••••• ••••••• ••••••• ••••••••• • • • • • • ••••••• •••••• •••••• •••••• •••••• • • • • •••••• ••••••••• •••••• • ••••
-2
0 Quantiles of Standard Normal
Quantiles of alpha estimates 0.0 0.5 1.0 -0.5
Quantiles of alpha estimates -1.0 -0.5 0.0 -1.5 • • -4
4
-4
-2
0 Quantiles of Standard Normal
2
4
•
-0.47
• •• • ••••••• •••••••••• • 10,000 realisations • • • • • ••••••••• n=2^17 •••••••• ••••••• ••••••• •••••••• • • • • • • ••• ••••••• ••••••• ••••••••• ••••••••• ••••••• • • • •• • • • • •••••••••••••• •••••• •••••••••• ••••••• ••••••• • • •• • • • • • •••••••• ••••• ••••••••• •••••••• ••••••• ••••••• • • • • • • •• ••••••• •••••• •••••• •••••• •••••• • • • • • • ••••• ••••• ••• • •••
• alpha = 0.5
-4
•
-2
0 Quantiles of Standard Normal
2
Quantiles of alpha estimates 0.50 0.52
•• • •••• •••••••••• 10,000 realisations ••••• ••••• • • • • • n=2^17 •••••••• •••••••• •••••••• •••••••• ••••••• • • • • • • •••••• ••••••••• •••••• •••••••••• •••••••••• •••••••••••• • •• • • • • • ••••••• ••••••• ••••••• •••••••• ••••••••••• • • • • • • •• •••••••••• ••••••••• ••••••• •••••••• •••••••• • • • • • • • •• •••••••• ••••••••• ••••••••• •••••••• •••••• • • • • ••
0.48
Quantiles of alpha estimates -0.51 -0.50 -0.49 -0.48 •
•
• •
2
alpha = -0.5
-0.52 -0.53
1.5
0.5
• • •••• • •••••••••• ••••••• 10,000 realisations •••••••• ••••••• • • • • • • • • n=2^9 ••••••••• •••••••• •••••••• •••••••• ••••••••• • • • • • • • • ••••• ••••••••• ••••••••••• ••••••• •••••••••• ••••••••• • • • •• • • • •••••••• ••••••• •••• ••••••••• •••••••••• ••••••• • • • • • • ••• ••••••• ••••••• •••••• ••••••• •••••••• • • • •••• ••••• •••••• ••••• •••••• • • • •• • •• alpha = -0.5
•
• 4
-4
-2
0 Quantiles of Standard Normal
2
4
Figure 1: Distributional comparison of ^ against Gaussian. Quantile-quantile plots of ^ against a standard Gaussian variable, obtained from 10000 realizations of fGn, with ( rst row) n = 29 and (second row) n = 217 , (left column) = ?0:5, (right column) = 0:5. In each case the empirical distribution function of ^ is very close to Gaussian despite the residual correlations among wavelet coecients, even for small n.
9
3.3 Correlation between estimates from adjacent blocks Returning to the idea of spliting the data into non-overlapping blocks, we now investigate the correlation between estimates obtained from adjacent blocks. Theoretical arguments If estimates on adjacent blocks are computed using a time domain estimator, these estimates will be strongly dependent because of the nonstationarity or long-range dependence of the original process. Exact decorrelation of the wavelet coecients (ID1), on the contrary, implies that wavelet-based estimates on dierent blocks are mutually uncorrelated, another key feature for the constancy test developed below. Again, ID1 is not strictly satis ed and it is necessary to examine if the correlations between estimates from dierent blocks, induced by residual correlations between the wavelet coecients, are signi cant or not. This is done through numerical simulations using the Fischer's z-statistic [11]. Fischer's z-statistic Let fw1(k); w2(k)g, k = 1; : : : ; K be K independent 2-dimensional random variables with a joint Gaussian distribution and correlation coecient r. Let denote the empirical correlation coecient
= (w1w2 ? w1 w2)=(w1 w2 ) with wi = PKk=1 wi(k)=K and w2 = PKk=1 wi2(k)=K ? (wi)2, (i = 1; 2). Let z be de ned as: z = ln(1 + )=(1 ? )=2: i
It is known (see e.g., [11]) that z is approximately normally distributed, with mean 1 1+r r 1 2 ln 1?r + K ?1 and variance approximately equal to K ?3 . This result allows the use of a standard hypothesis test with a null hypothesis of r 0, to test for correlation between w1 and w2 with a given con dence level. Numerical simulations Again using the spectral synthesis technique, K realizations of fGn were synthesized for each of various lengths n, with a common value of corresponding to strong LRD: = 0:6 (H = 0:8). Each series was split in half and an estimation of performed independently on each. Let f^1n(k); ^2n(k)g, k = 1; : : : ; K denote these series of estimates. For each n the Fischer z-statistic is computed, and the above test applied to examine the null hypothesis of complete decorrelation. In the numerical simulations performed: K = 2000, n = f212 ; 213; ; 218g, j1 = 3, and Daubechies3 (N = 3) wavelets were used. Figure 2 shows, as a function of the length of the original series n, that the z values all fall well within the 95% con dence interval (dashed lines) corresponding to zero correlation. This clearly indicates that we have no 10
reason to reject the r 0 hypothesis and therefore strongly justi es the idealization of exact decorrelation between adjacent estimates to be used presently. 0.05
fgn : 2000 realisations 0.04
0.03
Fisher Statistic
0.02
0.01
0
−0.01
−0.02
−0.03
−0.04
−0.05 12
13
14
15
16
17
18
log2(n)
Figure 2: Testing for correlation between adjacent estimates. The circles show the Fisher's z-statistics computed from estimates from pairs of adjacent blocks, as a function of n (block length equals n=2). For each n the z-values fall within the 95% con dence interval (dashed lines) indicating that there is no reason to reject the exact decorrelation idealization.
3.4 Testing the time variation of the scaling exponent The idea of the test Let us rst summarise the properties of the wavelet based
estimator ^ of the scaling exponent which, as will be seen, are central in the design of the test. 1. The estimation procedure and its statistical performance are the same regardless of the precise nature of the scaling, and the possibility of varying the number of vanishing moments N gives robustness to the estimator against important classes of deterministic non-stationarities. 2. To an excellent approximation ^ has a Gaussian distribution, with a bias which is negligible in practice, and a variance that is a known function of j1 , j2 and the nj only. 3. Estimates taken over adjacent blocks are close to uncorrelated. The last property is the crucial one which inspires the following simple approach to test the time constancy of the exponent . The data is split into m adjacent nonoverlapping blocks, and wavelet-based estimates ^i , i = 1 : : : m, are obtained separately 11
for each over a common scale range (j1 ; j2 ). Testing whether the scaling parameter is constant or not therefore amounts to testing whether the uncorrelated Gaussian random variables ^i, with known variances, have identical means or not. The simplicity of properties 1 to 3, together with the fact that can take any real value as the kind of scaling is not known a priori, oers the possibility of constructing such a test with optimal, known properties, as described in the next section. Choice of H0 We choose the null hypothesis H0 to correspond to the i being equal (though still unknown). The alternative hypothesis H1 is therefore that the i are not all equal. Although this seems a natural, intuitive choice, it should be remembered that this actually amounts to putting the priority on a low error of type I, since it is directly selected via the signi cance level, with no direct control over the error of type II, that is without regard to the power of the test. This means that the H0 chosen in fact puts a priority on high con dence in decisions to reject rather than accept the proposition that scaling is constant. The justi cation of this choice is that, if non-constant is concluded, then one is faced with dealing with data which is both non-scaling and nonstationary. This renders analysis and modeling far more complex, as many analysis and modelling techniques are in themselves thereby invalidated. It is therefore of great practical importance not to conclude that data is non-scaling without good reason. The choice of H0 taken therefore, allows us to accept a simple constant model whenever this is reasonable, that is unless one is very sure that it is not, in keeping with the principle of parsimonious modelling.
4 The Idealised Inference Problem In this section we discuss certain inference problems inspired by the attractive properties, discussed above, of wavelet-based estimates made over m adjacent blocks. Speci cally, the near independence of these estimators and their approximate Gaussianity are idealised to de ne the following parametric problem. Consider m independent Gaussian variables fXig with unknown means fig taking real values, and known common variance fi2 = 2 g, i = 1, 2, : : : ; m. We wish to test the null hypothesis H0 that the means share a common (unknown) value against the alternative hypothesis H1 that that they are not all the same. Note that both the null and alternative hypotheses are composite, that is that they are sets in parameter space with more than a single element. The extension to the case where the variances i2 vary, for example because blocks of un-equal size are chosen, is straightforward, and the solution is given below. For simplicity however we will present in detail only the case of equal variance. Note that 12
tests for this problem could be obtained by other methods. Notably, likelihood ratio tests [20] can be used to rapidly derive the tests presented below. By deriving them via invariance methods however, we gain the important advantage of the knowledge of their optimality, as well as considerable insight into the problem.
4.1 Invariant tests
In choosing a test from the unlimited alternatives available, it is naturally desirable to select one which is optimal in some way. Typically it would be satisfying to maximize the power of the test, that is the probability of accepting H 1 given it is true, for each xed level of signi cance . Recall that a test which, for each level of signi cance and regardless of the unknown value of the parameters, has equal or higher power than any other test, is known as a uniformly most powerful (UMP) test. A uniformly most powerful test does not exist for the above problem, however a UMP invariant test does, and is given in Lehmann ([14], p.377). Before detailing the test we quickly review the nature and meaning of UMP invariant tests, and then discuss the invariance relevant to the present problem. A very readable introduction to UMP tests is given in [20], whereas details of invariant tests and their (possible) optimality is given in [14]. Consider a parametric inference problem consisting of a vector random variable X 2 , having distribution function F (x; ) depending on the xed but unknown parameter vector 2 . The null hypothesis H0 is that lies in the subset ! , and the alternative hypothesis H1 is that it lies in the complement: 2 ? !. We say that the distribution of X has an invariance property if there exists a group G of transformations mapping to itself such that for each g 2 G the variable X0 = gX is in the same family, ie. has distribution F (x0; 0) for some 0 2 . In such a case the transformation g : X 7! gX has induced a related transformation g : 7! 0 in the parameter space. Provided that appropriate closure properties are satis ed, these transformations fgg form a group G acting in which is isomorphic, that is equivalent in terms of group structure, to G. (Note that the invariance property of F (; ) is intrinsically dependent on both its space and parameter dependencies. The transformation of one must be able to be `countered' by a corresponding transformation of the other, to result in a net invariance of form.) In order for the inference problem itself to be invariant, we require in addition that H0 be invariant, that is g ! = !. The principle of an invariant test is that, given that under the joint action of G and G the inference problem is invariant, the test should be also. If this were not the case then under transformation we would have a problem of exactly the same type, but with a test which had changed. The interpretation would be that the test is dependent in a quite arbitrary way on a choice 13
of co-ordinate system. Formally a test is invariant under G if (gx) = (x) for all x 2 , g 2 G [14]. The group G de nes a set of equivalence classes on called orbits. The orbit to which a point x belongs is just the set fgxg, g 2 G. An invariant function I (x) under a group G is one which is constant on each orbit, and a maximal invariant is an invariant function MI (x) which takes a dierent value on each orbit. Since all maximal invariants group together points which are equivalent under G in essentially the same way, it is not surprising that a test invariant under G must be a function of a (any) maximal invariant ([14], Theorem 1, p.285). In other words, as an invariant test, by de nition, only notices which orbit the sample data is on and ignores other details, it should only depend on a maximal invariant, which is precisely a function which simply labels orbits. Furthermore it can be shown ([14], Theorem 3, p.292) that any invariant function I (x) under G has a distribution which is a function only of a (any) maximal invariant of MI () of G. Putting these two results together, it follows that any test invariant with respect to G is a function only of MI (x), and with distribution a function only of MI (). The presence of invariance therefore results not only in a reduction in the number of allowable tests, but also in a considerable simpli cation of the original problem, as the dimensionality of MI (x) and MI () is typically lower than that of and respectively. It remains to determine if for this simpler problem a UMP exists or can be found. If so it is called an UMP invariant (UMPI) test for the original problem.
4.2 An UMPI test for the constancy of means The test, and its critical region We now apply these ideas to the present problem where both and are Rm , and = f1; 2; : : : ; mg. It is not dicult to see
that it is invariant with respect to two groups of transformations: 1) the addition of the same constant to each component of X, and 2) with respect to proper orthogonal transformations of X which preserve the `diagonal', ie. the line in Rm with all components of X equal. These two groups together generate the larger group G. Orbits of G (and of G, as in this case the two are not only isomorphic but in fact strictly identical) are m ? 1 dimensional cylinders of in nite length centred on the diagonal. A maximal invariant is evidently given by the radius of the cylinder measured perpendicularly from the diagonal, as this uniquely identi es an orbit. The square of this radius, measured in units of variance, is given by
V=
X
(Xi ? X)2 =2 14
where X denotes the mean of the components of X. The problem is therefore reduced to one where the scalar positive random variable V depends only on a positive scalar parameter , and H0, which is now simple, corresponds to = 0.
= P(i ? )2=2 H0 : = 0: It is not dicult to show ([20], p.103) that a UMP test exists for such a simple one-sided problem, the critical region being given by V > C . Furthermore, V can be expressed as V = ms2 =2, where s2 is the biased form of the sample variance. It therefore follows from a standard result on s2 that the distribution of V under H0 is just that of a Chi-squared variable with m ? 1 degrees of freedom [20], and is thus independent of the common mean . The constant C is therefore determined from the R signi cance level via C1 fm?1(v) dv = , where fm?1 (v) is the density function of the Chi-squared variable. This test is of size and similar [20]. In the case of dierent variances i2 , the test readily generalises ([14], p.377) to V > C P P X1 X1 2 2 X =2 2 P where V = 2 Xi ? 1=2 , and becomes = 2 i ? P1==2 . Under H0 we still have = 0 with C determined exactly as before. The power of the test Under H1 the distribution of V becomes that of a non-central Chi-squared variable with m ? 1 degrees of freedom and a non-centrality parameter equal R to . The power is therefore given by C1 fm?1; (v) dv, where fm?1; (v) is the non-central Chi-squared density. This integral can be readily evaluated in practice as a sum of central Chi-squared distributions ([1], equation 26.4.25, p.942.). Exactly the same facts hold in the case of general i2 , provided the generalized de nition of above is used. Although the above test is optimal with respect to the power of competing tests, actual values of power of course depend on the closeness of the unknown parameters to H0, and can be very poor, in fact arbitrarily close to . This can be seen explicity by noting that the integrals given above for the signi cance level and the power, tend to each other as tends to zero. To gain some idea of the power of the test under dierent conditions, consider rst gure 3 where power functions are given for several values of p m. The curves are given as functions of rather than of as it is physically more meaningful: dimensionally is the square of the radius of the cylinder, i.e. the square p of the distance of the means from H0, whereas corresponds to the distance itself. These curves essentially contain all the information about the power of the test. It is not immediately obvious what a given value of means in practical terms. To gain a feel for what real data may look like corresponding to given values of power, i
i
i
i
i
i
15
i
i
1
0.9
m=2 m=4 m=8 m=16
0.8
0.7
Power
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
θ1/2
Figure 3: Power functions. Power functions for dierent m values as a function of p , corresponding to the distance of from H0 . we provide gure 4 in which examples of vectors with m = 10 (dashed lines) and corresponding data samples X (asterisks) are given, for each of the power values 0:8 (top row), 0:5 (middle) and 0:2 (bottom row). The variance is taken to be 2 = 1. From p gure 3, these powers correspond to [3:96; 2:97; 1:88] respectively. Recall that each value of corresponds to an entire equivalence class of values. The gures in each row give two dierent examples of vectors with the same value. The dierence between the columns is that on the left there is no particular pattern to the elements of , as they were chosen randomly. In contrast in the righthand column they were chosen to conform to a level-change scenario where the means take only two values, with a change point in the middle. Such a pattern may occur in real data, however it should be emphasized that the powers given in gure 4 correspond to the general test as described above, where no assumptions are made on the structure of the . The plots illustrate the fact that as the elements of become more uniform, the ability to correctly judge that they are in fact dierent decreases, and thus the power is lower. Although the purpose of gure 4 is to `train the eye', it in fact shows that when observing the random data X it is far from obvious whether the means share the same value or not. This emphasizes the need in practice for a well de ned test to assist in such decisions. The choice of m It should be noted that each dierent value of m is a separate 16
4
4
2
2
0
0
−2
−2 2
4
6
8
10
6
6
4
4
2
2
µi
Power =0.5
Two−level µ 6
µi
Power =0.8
Arbitrary µ 6
0
4
6
8
10
2
4
6
8
10
4
6
8
10
0
−2
−2 2
4
6
8
10
6
6
4
4
2
2
µi
Power =0.2
2
0
0
−2
−2 2
4
6
8
10
2
block number
block number
Figure 4: Power scenarios. In each plot vectors are given for m = 10 (dashed lines) and corresponding data samples X (asterisks) with = 1. Each row gives two examples p of and X at a xed power and corresponding , from top to bottom: (Power; ) = (0:8; 3:96), (0:5; 2:97) and (0:2; 1:88). In the left column is chosen randomly, on the right it is of the form = f1; 1; 1; 1; 1; 2; 2; 2; 2; 2g.
17
inference problem and that it makes no a priori sense to compare the power functions for dierent m. For example, although the curves for larger m lie below those for smaller m in gure 3, it does not necessarily follow that `increasing m results in a lower power'. Indeed, for each m the parameter is really a separate quantity m : they are identi ed in the plot purely for ease of display. Comparisons across m values can only be made in a context where further details/assumptions of the experimental situation allow the speci cation of the dependence of m on m: m = (m; i (m); i(m)), including how the dierent m can be meaningfully compared. As power is not de ned under H0 there is no question of optimizing its value with respect to m, but only of chosing C ( ) = Cm( ) to x the signi cance level. Since = m is zero under H0 , independently of m, it follows that Cm( ) is a function of m only and can always be found. Consider the following example under H1. An experimentor may have the opportunity to perform an additional experiment(s) of the same type, resulting in new observations Xm+1 : : : Xm , and the bene ts of this in terms of increased power could be of interest. In such a case it may be reasonable to assume that m has increased to m0 at constant 2 , and at constant average i. It follows that increases in proportion to the increase in m, and therefore to make a fair comparison for dierent m the power functions should be plotted as function of =m. Replotting the curves of gure 3 in this way reveals that power increases with m, at xed =m. This is a simple re ection of the fact that having more data allows more accurate discriminations to be made. Another example arises as an important element of the choice of m problem in the wavelet context, as discussed in the next section. Begin with m = m0 with m0 corresponding to a given H1 , that is a given vector and variances i (m0). Imagine that the physical nature of the problem is such that each variable Xi can be split in a simple way, that is written as the average of l i.i.d components: Xi = (Pj Yi;j )=l, each distributed as Yi;j N (i; li (m0)). This leads to a new problem of the same type but with m increased to m = lm0 , and variances (in m0 groups of l) all increased by l. In this case it is easy to check that m = (m; i(m); i) = m0 , so that the power functions can be compared against a common variable, and are in fact exactly as shown in gure 3. Now we can conclude that increasing m decreases power. Hence for a problem of this type, where increasing m does not imply an increase in the amount of data, it is best not to `split' the original problem but to keep m as small as possible. 0
18
4.3 An UMPI test for the two-level problem An important special case of non-constant scaling is that of a `level change', that is an abrupt change at a selected time instant separating constant scaling regimes to either side. If there are physical reasons to expect behaviour of this type in data, or if there is compelling empirical evidence, then it may be desirable to test directly for it rather than to use a test such as that above which does not make use of the additional assumed structure. Another reason is that in many contexts a level change is a popular rst order model of time varying behaviour. We therefore give an optimal test for this situation. Another advantage of treating this problem here is that it enables an interesting quantitive comparison of powers to be made between the general and two-level tests. The test, and its critical region In our idealised framework the problem is modelled as follows. Let the rst m1 variables denoted by fXig, i = 1, : : : ; m1 , each have mean 1, and the remaining m2 = m ? m1 variables fYig, i = 1, : : : ; m2 have common mean 2. The composite null hypothesis is that 1 = 2 against the composite alternative hypothesis 1 6= 2. This `jump' problem can be seen as a restricted form of H1 in the general framework described above, or alternatively recognised as a simple form (known variance) of the classic problem of the equality of means between two groups of Gaussian variables with common variance ([20], p.110). A maximal invariant is given by V2 = jX ? Yj= in the state space = IRm, and 2 = j1 ? 2j= in the parameter space = IR2. It follows that an UMP invariant test is given by V2 > C , where C is chosen from the signi cance level using the fact that (X ? Y)= is normally distribution with mean (1 ? 2 )= and variance m=(m1 m2 ). Under H0 the mean vanishes. Note that, although we have introduced this inference problem as testing for a single jump in mean, where the 1 values occur rst followed by the 2, in fact the independence of the variables implies that their order plays no role, even though the index i corresponds to `time'. Only the number m1 and m2 of variables in the dierent groups are important. Thus exactly the same analysis also serves for a variety of other scenarios, for example with m1 = 1, m2 = m ? 1, to that of a single perturbed value in arbitrary position in otherwise stationary scaling data. Another example would be the case where the two values alternate with index i, corresponding in some sense to the `least stationary' possibility involving only two levels. The power of the test In gure 5 power functions of the two-level test are given as functions of the absolute scaled mean dierence 2 = j1 ? 2 j=. In the lefthand plot the dierent curves correspond to dierent values of m1 , with xed m = 10. It can be 19
1
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Power
Power
1
0.9
0.5
m =5 1 m =4 1 m =3 1 m =2 1 m =1
0.4
0.3
0.5
0.4
0.3
1
0.2
0.2
m = 10 blocks
0.1
0
Two−level test m=10, m1=5 General test comparison Two−level test m=4, m1=2 General test comparison Two−level and general, m=2
0.1
0
0.5
1
1.5
2
θ =| µ − µ | / σ 2
1
2.5
3
3.5
0
4
2
0
0.5
1
1.5
2
θ =| µ − µ | / σ 2
1
2.5
3
3.5
4
2
Figure 5: Two-level power and comparison with general test. Left: Power functions with m = 10 blocks, with m1 = 1 (bottom) up to m1 = 5 (top). The power drops to = 0:05 under H0 (2 = 0) in each case, and exceeds 0:8 for each m1 when 2 > 3. Right: Comparison between power functions of the two-level (solid curves) and general tests (dashed curves) for m = [10; 4; 2] from top to bottom. For m = 2 the tests are identical and the curves coincide. The two-level test has a signi cantly higher power, the more so for higher m. seen that the power only becomes uniformly greater than 1=2 for mean dierences in excess of two standard deviations, that is for 2 > 2. Note that the power functions are uniformly and monotonically increasing in m1 . In the righthand plot curves for three dierent values of m are given, from top to bottom m = [10; 4; 2] (solid lines), each with m1 = m=2. For comparison purposes the power functions of gure 3 with the same m values are also reproduced on the gure (dashed lines). Comparisons between curves with the same m show that the two-level test has the higher power in each case, except when m = 2 where the general and two-level problems are strictly identical. (Comparisons across dierent m are not meaningful.) This is to be expected, as in the two-level test explicit use is made of the two-level structure and its known change point. Note that if the two-level structure of is assumed reduces to m(2 =2)2, so in replotting q the curves from the general test into gure 5 the transformation 7! 2 =m = 2 has been applied. It can be seen that the larger m is the greater the increase in power in using the two-level test over the general test. The decision of which test can be validly used will naturally depend on the speci c situation.
20
4.4 Tests in simple hypothesis cases It may be required to perform tests against speci c values of the . The main case where this arises is when prior studies have shown typical values of for certain kinds of data, and it is desired to test if a new set of data is consistent with this nding (for example studies of Ethernet telecommunications trac data have repeatedly found values close to 0:6 [15]). Optimal tests in the idealised framework for this situation where H0 is simple, and other related situations, are simpler than those described above and can be solved by the same techniques, and so will not be described here.
5 A Wavelet Test for the Constancy of Scaling In this section we use the theoretical results from the previous section to write, in the wavelet framework, a hypothesis test for the constancy of the scaling parameter . De nition. Let x denote the series to be analyzed of length n. Compute, according to the de nitions given in section 2 and using a common scaling range (j1 ; j2 ), the estimates f^1; : : : ; ^i; : : : ; ^m g obtained from m adjacents blocks with, possibly, unequal lengths ni. From section 2, we know that the f^1; : : : ; ^m g can be considered as independent Gaussian variables, with unknown means, and known, but possibly dierent variances, that is ^i N (i; i2);
where i2 = 2j1?1(1 ? 2?J )=(ln2 2(1 ? 2J (J 2 + 4) + 2?2J )) =ni to an excellent approximation, where J = j2 ? j1 + 1 is the width of the scaling range (see [2, 21] for exact expressions). We wish to test the null hypothesis H0: the means are identical, against H1: the means dier. From the previous section, we know that an UMPI test can be de ned by forming the statistic:
V=
X
1 ^ ? P ^i =i2 2: i2 i P 1=i2
(3)
It was shown that the distribution of V is a function only of a single parameter:
=
X
1 ? P i=i2 2: i2 i P 1=i2
(4)
Under H0 we have 0 and V is distributed as a Chi-squared variable with m ? 1 degrees of freedom with density fm?1 . Under H1, > 0 and V is distributed as a 21
non central Chi-squared distribution with m ? 1 degrees of freedom and non centrality parameter , with density fm?1; . Let denote the chosen signi cance level of the test and de ne the critical region boundary C = C ( ) via Z +1
C
fm?1 (x)dx :
The test reads: If V > C; Reject H0 (conclusion, is not constant); If V C; Accept H0 (conclusion, no evidence that is not constant): The power function of the test, that is the probability, as a function of the particular H1, of accepting H1 when it is true, reads: P() =
Z +1
C
fm?1; (x)dx:
Statistical properties The statistical properties and performance of the wavelet
hypothesis test (signi cance and power) would be those described in the previous section if the Gaussianity and independence of the ^i held strictly. We have already established, in section 2, that they are a priori justi ed idealizations. It is important to check however a posteriori the eect of residual correlations among the wavelet coecients on the statistical performance of the test. To do so we performed numerical simulations. To study the type I error (reject H0 when true), we synthesized K sample paths of fGn of length n and constant parameter . We then split each of the samples into m blocks of equal length and applied the above test to each, counting the number of times that V fell within the critical region V > C ( ), with a signi cance level of 100(1 ? ) = 95%. In the results presented in gure 6 (left plot), we have K = 2000, n = 213 , m = f2; 4; 8; 16g, = 0:6 (H = 0:8) and used Daubechies3 wavelets. The plot shows that the wavelet test closely reproduces the theoretical 5% rejection rate, so we conclude that residual correlations only slightly aect the type I error probability. To study the type II error (accept H0 when false), we synthesized K sample paths of fGn of length n with abruptly changing from to + at sample n=2 + 1. The samples were then split into m blocks of equal length and the above test applied to each as described above, again with a signi cance level of 95%. In the results presented in gure 6, (right plot), we had K = 200, n = 214, m = 10, = 0 (H = 0:5), and was set to three dierent values corresponding to three evenly spread values of power. 22
Again Daubechies3 wavelets were used. Figure 6 shows that the resulting values of the power (circles) for the three dierent values of are extremely close to those derived theoretically (solid line). This study reveals that, despite the fact that Gaussianity and independence do not strictly hold, the statistical properties and performance of the wavelet-based hypothesis test are very close to those of the idealized problem for which exact theoretical results are available. This test can therefore be regarded as UMPI in practice. 1
0.07
Idealised test Wavelet test
0.9 0.06
0.8 0.05
% of rejection of H0
0.7
0.6
Power
0.04
0.03
0.5
0.4
N=3 0.3
0.02
13
n=2
0.2
0.01
0.1
2000 realisations 0 0.5
1
1.5
2
2.5
log (m)
3
3.5
4
0
4.5
0
1
2
3
4
5
6
7
1/2
θ
2
Figure 6: Statistical properties of the wavelet-based constancy test. Left plotType I error: the percentage of rejections of H0 (when true) in simulations using fGn, as a function of the number of blocks m, is very close to the chosen target of 5%, independently of m. Right plot- Type II error: for three dierent values of the actual power of the wavelet test (circles) is compared against the idealised test (solid line). The plots con rm that the statistical performance of the wavelet test is very close to that of the idealised test despite residual correlations between the wavelet coecients that imply slight departures from strict gaussianity and independence of the estimates over dierent blocks.
Choosing m When applying the test to a set of data of length n, one has to select the number m and sizes fni g of the blocks into which the data is split. How can these
be chosen optimally? what does optimal mean? For simplicity we will only discuss the case where the blocks are of equal size, leading to the question, how is m to be chosen? Aspects of this question have already been addressed at the end of section 4.2, however that was in the context of the idealised problem, which constitutes only a portion of the practical issue at hand. The essential dierence in the wavelet context is that, in order to use the test for an m xed, it is necessary to assume that scaling is constant over each of the blocks separately, whereas in the idealised case this was given. Such an assumption must of course be tested, and this can only be done by increasing m to examine the data at a higher time resolution. Only if m is high enough so that scaling is eectively constant over each block- and this may never be the case- can the test be applied in essentially the same way as in the idealised context. These issues will be illustrated in a detailed example below. 23
To begin, rst note that for a given m the common variance m2 of the estimates ^i are roughly inversely proportional to the length of the blocks. Hence m2 is roughly proportional to m: m2 ' 2m where 2 = 12 is the variance of the estimate over the full series. Note moreover that increasing m means that the length n=m of the blocks decreases. Each block must however remain large enough so that the analysed scaling phenomenon remains visible and measurable over a suciently wide range of scales. This important practical consideration puts a limit on the maximum size of m, however it will be ignored for the moment. Consider rst that H0 is true. In this case the assumption of scaling over each block is clearly satis ed, and so the discussion at the end of section 4.2 applies: there is no preferred value of m as power is not de ned and the signi cance level can be freely chosen for any m. Assume now that H1 is true. To understand the role of m, consider the following simple toy problem: the data to be analysed consists of 4 concatenated subseries, each exhibiting scaling with scaling parameters fA; B ; A; B g over a shared scaling range. If one choses m = 2, scaling is not constant over each block, violating the assumptions of the test and yielding estimates f^1; ^2 g which do not in fact correspond to anything meaningful. Nonetheless, the two estimates are statistically identical, and via the test procedure one is therefore very likely to accept H0, as it is not possible from looking at m = 2 alone to determine if scaling is constant over each block or not. Essentially the number of blocks is not large enough to see or follow precisely enough the variation in time of . If one chooses m = 4, one obtains the estimates f^1; ^2; ^3 ; ^4g which are meaningful, and the assumptions of the test satis ed, as the scaling is constant over each block with exponents fA; B ; A; B g. The variances are 42 = 42 yielding 4 = (A ? B )2=(42). In this case, the power of the test (probability of accepting H1 when R true) is P4(4 ) = C+1 f3;4 (x)dx and can be read o the m = 4 curve of gure 3. 4 If one now chooses m = 8, the estimates f^1 ; ^2; ^3; ^4; ^5; ^6 ; ^7; ^8g, are obtained which again are meaningful, as the scaling is constant over each block with exponents fA; A; B ; B ; A; A; B ; B g. The variances are 82 = 82, yielding a 8 which from equation (4) can be shown to be 8 = 24 =2 = 4 . In this case the power of the test is R f7;8 (x)dx and can be read o the m = 8 curve of gure 3. Since 8 = 4 , P8(8 ) = C+1 8 it is valid to compare the two power functions using a common coordinate, and so from gure 3 one sees that for any xed value of > 0, the power of the test decreases when m is increased from 4 to 8. This is a direct consequence of the fact that the estimates for m = 8 have double the variance of those for m = 4, yielding more uncertainty in 24
the decision making process than can be countered by the increase in m. Note that the last two cases fall precisely under the second example at the end of section 4.2 with m0 = 4 and l = 2, and the same conclusion is reached: there is no bene t in increasing m beyond m0 , provided that the assumptions of the test are valid at m = m0 . From this example it can be inferred that the choice of the optimal m is subject to the following trade-o: on the one hand, m has to be large enough to see or track variations in time of the scaling parameter; on the other hand, m has to be as small as possible to avoid degrading the power of the test due to an increase in variance of the estimates. The optimal choice is therefore that m be such that the data be split into the largest possible blocks within which the scaling parameter is not varying. This optimal choice can of course not be made in practice since it depends on the speci c unknown H1 under study. It implies therefore an experimental methodology where m is varied, and observations interpreted according to the above heuristic understanding of the role of m. For instance, if the decision is accept H0 for all m, then the nal decision is accept H0. Under H1, the recommendation of the test may be accept H0 for small m (because the blocks are too wide to see the variations of ), then reject H0 for a given set of median values of m, then again accept H0 for large values of m because the statistical
uctuations of the estimates have become so high that they mask the time variation of . In other words, the power has become so low that one cannot reliably detect that H1 is true. In this case, the nal decision should be reject H0. The above argument overlooks the fact that varying m implies varying the range of octaves of analysis (j1;m ; j2;m), which implies a maximum m in practice in order to preserve a minimum number of scales per block for a reasonable estimate. This is another reason why m should be chosen as small as possible, however it becomes problematic when it is not possible to choose m large enough to follow the time variation of the exponent in sucient detail, ie to ensure a constant exponent over each block. The practical test procedure To analyse the existence of scaling in data, and in particular to test for the constancy of scaling, one can proceed as follows:
Initial steps
1. Choose a mother-wavelet and perform the discrete wavelet transform of the data. Because the wavelet transform is a time-scale representation of the data and is well suited to capture scaling, the outcome of the test should not depend on the wavelet. However, varying the number of vanishing moments N of the wavelets so that N =2 is necessary to ensure stationarity and short range dependence in the details, and imparts robustness to irrelevant superimposed non-stationarities. These issues must rst be addressed by an appropriate choice of N [4, 6]. 25
2. Perform a global analysis of the data as proposed in [4, 6]. This involves computing the Logscale Diagram (the yj vs j plot) of the full data set. If a range of scales (j1 ; j2 ) can be found over which a scaling is observed, then estimate according to equation (2). The question is then: is the observed ^ meaningful? that is, constant over the whole data set over the same range of scales? If no scaling range can be found the question becomes: can the data be split into sub-blocks over some or all of which scaling exists? The purpose of step 2 is to look for initial evidence of scaling, and to gain, possibly, an initial idea of the scaling range over which to perform estimation on the blocks. Whether scaling is observed or not, the data should be split into blocks in any case and the following procedure applied, since if scaling is non constant, the Logscale Diagram of the full data set is not useful and can be very misleading as illustrated in the next section.
Procedure on blocks
1. Choose a signi cance level. 2. Choose a m > 1, but not so large as to exclude the scales of interest from each block. 3. Examine the Logscale Diagrams for each block, and select a range of scales (j1;m ; j2;m) common to each where scaling is observed. If no common range can be chosen, it clearly reveals that the scaling is not constant and the test at this m is not de ned. Go to point 5. (Note that if m is so large that the largest scale available is smaller than the upper cuto of the scaling phenomenon present, then j2;m will be constrained to be the largest scale available in the block and will decrease as m increases.) 4. Compute the threshold Cm and the test statistic Vm . Compare Vm and Cm and record the test outcome at this m. 5. If valid m values remain go to point 2. 6. Analyse the set of m dependent test outcomes to draw the nal conclusion.
6 Application to Ethernet Data To illustrate the use of the test on actual data, we apply it to some of the celebrated Bellcore Ethernet data sets. Recall brie y that these consist in lists of arrival times and 26
Ethernet frame lengths recorded on a Local Area computer communications Network. For a thorough description the reader is refered to [15] (see also [4]). From each of the data sets \pAug", \pOct", \OctExt" and \OctExt4" we have extracted an aggregated rate process of arriving work, that is a discrete time series corresponding to the number of bytes transmitted during contiguous constant length time intervals, here of length 12, 10, 1000, and 10 milliseconds respectively. These time series are plotted in the top right plots in gures 7, 8, 9, and 10 respectively. For each time series the Logscale Diagram (LD), shown in the plots on the left in each gure, is rst computed using Daubechies3 wavelets. Recall that the fact that a LD which exhibits a good alignment over a range of scales including the largest in the data, and which yields an estimated slope (scaling exponent) ^ in the range 0 < ^ < 1, is evidence that the data is long-range dependent with LRD parameter (for a more complete description of the reading of LDs, see [15, 6, 4]). For the pAug, pOct and OctExt times series, this is indeed what is observed, suggesting that they are LRD. For the OctExt4 time series however there is no clear evidence of scaling, although an estimate of the slope on the larger scales was nonetheless made, as shown. In order to investigate the time constancy of scaling in these series, it is necessary to split them into blocks. For pAug, pOct, and OctExt we are interested in checking that the evidence for LRD observed over the whole series is con rmed as being valid and constant in time, and for OctExt4 we wish to see if clear evidence of scaling appears over subsets of the series. Each time series was split into m = 12 blocks. The m estimates, made in each case with (j1; j2 ) = (7; max-possible), are shown in the bottom right plots of the gures together with their con dence intervals and the outcome of the test with a signi cance level of 95%. We restate that, without knowledge of the true nature of scaling in the data, it is not possible to choose an optimal m, and that the relevant experimental strategy is to vary it. The results presented below are robust to changes of m, the choice of m = 12 being an arbitrary one for purposes of illustration. For each of pAug and pOct it was observed (not shown) that there is acceptable evidence that scaling is present in the same scaling range over each block, so that the assumptions of the test are satis ed and can be applied. In both cases the test then strongly indicated that there is no reason to suspect a change in the scaling exponent, and so the hypothesis of constant scaling is accepted. One can therefore return with con dence to the full series to estimate its value. For the times series OctExt and OctExt4 it was also observed, albeit less convincingly, that scaling is present in the same scaling range over each block. The test can therefore be applied to judge if the large variations in the ^i observed are 27
pAug
pAug (j1=7,j2=15) 35
14000 12000 10000 8000 6000 4000
30
2000 0
0.5
1
1.5
2
2.5
Y
j
Time
5
x 10
(j1=7,j2=12) Not Rejected
1 25
α
0.8 0.6 0.4 0.2 0 20
0
5
10
2
15
Octave j
4
6
8
10
12
Block
Figure 7: pAug. Left: the LD of the whole time series with N = 3. Top right: the time series of bytes per 12ms intervals. Bottom right: the estimates from 12 adjacent blocks with (j1; j2 ) = (7; 12), and the test outcome: Accept H0 . consistent with statistical uctuations of a constant underlying exponent, or not. In both cases the test strongly indicated that they were not and that H0 should be rejected. It is instructive to examine a little further the cases OctExt and OctExt4. Since for both H0 was rejected, we must consider in hindsight that the LD's of the entire series, shown in gures 9, and 10 respectively, are meaningless from the point of view of measuring a scaling exponent. In both cases, what is seen in the LD is in fact a kind of average of dierent phenomena taking place in each block. The case of OctExt is of particular interest because the LD exhibits an apparently clear alignment over scales (6; 14), and certainly over (10; 14) (given the con dence intervals of the yj ), convincingly suggesting that scaling is present. This alignment however is merely an undesirable artifact resulting from the `averaging' of the non-stationarity across the series. The fact that in reality the scaling is not constant across the series is graphically illustrated in the extreme variability of the ^i in the bottom left plot of gure 9. These examples show the importance in practice of the examination of the constancy of scaling using Logscale Diagrams over blocks, and of the use of a well founded hypothesis test to allow objective judgements to be made on the constancy of the scaling parameter.
7 Conclusion In this paper a statistical test was provided to investigate formally the `constancy of scaling', that is if variations in estimates of scaling exponents observed across a data set are consistent with a constant underlying exponent, or if instead the scaling behaviour varies in time. Along with the test a methodology was developed to examine 28
pOct
pOct (j1=7,j2=14)
12000
34
10000 8000 32
6000 4000 30
2000 0
2
4
6
8
j
Y
10
12
14
Time
28
16 4
x 10
(j1=7,j2=11) Not Rejected
1.2 26
1
α
0.8 24
0.6 0.4 0.2 0
22
0
2
4
6
8
10
12
2
14
4
6
Octave j
8
10
12
Block
Figure 8: POct. Left: the LD of the whole time series with N = 3. Top right: the time series of bytes per 10ms intervals. Bottom right: the estimates from 12 adjacent blocks with (j1 ; j2) = (7; 11), and the test outcome: Accept H0.
OctExt (j1=7,j2=15)
OctExt
4
x 10
36
6 5
34 4 3 32 2 1
30
0
2
4
6
j
Y
8
10
Time
28
12 4
x 10
(j1=7,j2=12) 1.5
26
Rejected
1 24
α
0.5 0
22 −0.5 20
0
2
4
6
8
10
12
14
2
Octave j
4
6
8
10
12
Block
Figure 9: OctExt. Left: the LD of the whole time series with N = 3. Top right: the time series of bytes per 1000ms intervals. Bottom right: the estimates from 12 adjacent blocks with (j1; j2 ) = (7; 12), and the test outcome: Reject H0.
29
OctExt4
4
OctExt4 (j1=7,j2=15)
x 10
40
15
38
10
36
5
34
0 32
1
2
3
4
5
6
Y
j
Time
7 4
x 10
(j1=7,j2=12)
30
Rejected 3 28
2
α
26
1 0
24
−1 22
0
2
4
6
8
10
12
2
14
Octave j
4
6
8
10
12
Block
Figure 10: OctExt4. Left: the LD of the whole time series with N = 3. Top right: the time series of bytes per 10ms intervals. Bottom right: the estimates from 12 adjacent blocks with (j1; j2 ) = (7; 12), and the test outcome: Reject H0. the prior question of whether scaling exists at all. A key outcome is an ability to determine if an apparent scaling observed across a time series is in fact meaningful, and the corresponding estimate reliable. The methodology and test rest on the wavelet domain's natural ability to analyse scaling: the wavelet-based analysis applies to many dierent kinds of scaling phenomena in a uniform framework, it exhibits statistical robustness, and the resulting estimator of the scaling exponent has excellent statistical properties including negligible bias, low variance of a known form, and asymptotic Gaussianity. It exhibits the additional key feature that estimates obtained from adjacent non overlapping blocks of data are close to uncorrelated. Testing the constancy of the scaling exponent can therefore be idealised to simply testing whether uncorrelated Gaussian random variables with known (possibly dierent) variances, have identical means or not. Simulation studies showed that the statistical properties of the test procedure in the wavelet domain are close to those of the idealised problem, which is known to be a uniformly most powerful invariant test, with explicitly known power functions. The methodology and test were applied to Ethernet data where their utility in avoiding erroneous interpretations of measured exponents was demonstrated. An extension to the work here, using the methodology and test as the building block of a change point detection procedure for the scaling exponent , is under investigation.
References [1] M.Abramowitz, I.A.Stegun, Handbook of Mathematical functions, Dover Publications Inc., New York, 1970.
30
[2] P. Abry, Ondelettes et Turbulences - Multiresolutions, algorithmes de decompositions, invariance d'echelle et signaux de pressions, Diderot, Editeur des Sciences et des Arts, Paris, 1997. [3] P. Abry, P. Goncalves and P. Flandrin, "Wavelets, spectrum estimation and 1=f processes", in A. Antoniadis and G. Oppenheim, eds, Wavelets and Statistics, Lectures Note in Statistics 103, pp. 15{30. Springer-Verlag, New York, 1995. [4] P. Abry, D. Veitch, "Wavelet analysis of long-range dependent trac", IEEE Trans. on Info. Theory, 44(1) pp 2{15, 1998. [5] P.Abry, D.Veitch, and P.Flandrin, "Long-range dependence: revisiting aggregation with wavelets", Journal of Time Series Analysis, 19(3) pp 253{266, 1998. [6] P. Abry, P. Flandrin, M.S. Taqqu, D. Veitch, "Wavelets for the analysis, estimation and synthesis of scaling data", submitted as a chapter to Self Similar Network Trac Analysis and Performance Evaluation, K. park and W. Willinger, Eds., 1999. [7] P.Abry, M.S.Taqqu, D.Veitch, "On the automatic selection of scaling range in the semiparametric estimation of scaling exponents", in preparation. [8] P. Abry, L. Delbeke, P. Flandrin, "Wavelet-based estimator for the self-similarity parameter of -stable processes", submitted to ICASSP99. [9] J. Beran, N. Terrin. Estimation of the long-memory parameter, based on a multivariate central limit theorem. Journal of Time Series Analysis, Vol.15, No. 3, pp. 269{277, 1994. [10] I.Daubechies, Ten Lectures on Wavelets. SIAM, Philadelphia (PA), 1992. [11] W.H. Press, B.P. Flannery, S.A. Teukolsky, W.T. Vetterling, Numerical Recipies in C, The art of scienti c computing, second edition. Cambridge University Press. [12] W. Feller, An Introduction to Probability Theory and its Applications, Wiley, second edition, 1971. [13] P. Flandrin, "Wavelet analysis and synthesis of fractional Brownian motion", IEEE Trans. on Info. Theory, 38 pp 910-917, 1992. [14] E.L.Lehmann, Testing Statistical Hypotheses, Wiley, second edition 1986. [15] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, "On the self-similar nature of Ethernet trac (Extended version)", IEEE/ACM Trans. on Networking, 2 pp 1{15, 1994.
31
[16] S.Mallat, "A Wavelet Tour of Signal Processing", Academic Press 1998. [17] I.Norros, A storage model with self-similar input, Queueing Systems 16, pp.387{396, 1994. [18] M.Roughan, D.Veitch "Measuring Long-Range-Dependence under Changing Trac Conditions", Accepted to IEEE Infocom99, NY, NY March 1999. [19] A.H. Tew k, M. Kim, "Correlation structure of the discrete wavelet coecients of fractional Brownian motion", IEEE Trans. Info. Theory, 38, pp 904{909, 1992. [20] S.D.Silvey, Statistical Inference, Chapman & Hall 1975. [21] D.Veitch, P.Abry, "A wavelet-based joint estimator of the parameters of long-range dependence", to appear IEEE Trans. on Info. Theory , Special Issue on Multiscale Statistical Signal Analysis and its Applications, 1998. [22] G.W. Wornell, A.V. Oppenheim, "Estimation of fractal signals from noisy measurements using wavelets", IEEE Trans. on Signal Proc., 40(3), pp 611{623, 1992. [23] W.Willinger, M.S.Taqqu, A.Erramilli, "A Bibliographical Guide to Self-Similar Trac and Performance Modeling for High-Speed Networks", appearing in Stochastic Networks: Theory and Applications, F.P.Kelly, S.Zachary, I.Ziedins, editors, Royal Statistical Lecture Notes, Vol. 4, pp 339-366, Clarendon Press, Oxford, UK, 1996.
32