ROBUST CHANGEPOINT DETECTION BASED ON ...

3 downloads 0 Views 131KB Size Report
Alexandre Lung-Yut-Fong, Céline Lévy-Leduc, Olivier Cappé. Institut Telecom & CNRS, LTCI, Telecom ParisTech. ABSTRACT. We introduce a novel statistical ...
ROBUST CHANGEPOINT DETECTION BASED ON MULTIVARIATE RANK STATISTICS Alexandre Lung-Yut-Fong, C´eline L´evy-Leduc, Olivier Capp´e Institut Telecom & CNRS, LTCI, Telecom ParisTech ABSTRACT We introduce a novel statistical test for unsupervised detection of changepoints in multidimensional sequences of temporal observations. The test statistic is based on a multivariate generalization of the Mann-Whitney Wilcoxon two-sample test. The proposed test performs nonparametric changepoint localization and returns a quantifiable measure of significance in the form of a p-value. This approach is also parameter-free and can easily be extended to cases where the data is partly censored or has missing values. The performance of the method is illustrated through experiments on a publicly available econometric datasets. Index Terms— changepoint detection, multivariate data, rank test 1. INTRODUCTION Detection of change-, or break-, points in time series is a challenge that arises in many different contexts such as monitoring and surveillance [1], computer security [2, 3], audio and image processing [4], financial and econometric modelling [5], or, bioinformatics [6]. In many of these applications, the ultimate goal is both to segment the signal into homogeneous segments and to identify these segments, using a reduced set of labels. This challenging task can usually only be addressed by parametric modelling of the signals of interest, often using strong prior information and sometimes using supervised learning data. Typical examples of models used for this purpose include Hidden Markov Models or more general latent variables models [6]. In many contexts, however, especially when dealing with highdimensional signals for which ground truth labelling is not available, the joint segmentation and labelling task is daunting and there is a strong interest in developing methods that can solve the sole segmentation problem without relying on a precise probabilistic modelling of the input data. Recently several works have addressed this task using kernel-based methods [4, 7]. An additional requirement that is of importance in many applications is to provide a reliable and, if possible, interpretable measure of significance for individual changepoint detections. The work of [8] introduced a variant of [7] with computable p-values (false-alarm rates under the null hypothesis that no change is present) associated with detection decisions. Another important distinction is that many of the aforementioned works indeed rely on homogeneity testing. To turn such a test into a changepoint detection algorithm, one typically applies the test repeatedly in strongly overlapping sliding time windows, testing homogeneity between the left and the right part of each window. While this is a perfectly sensible way of proceeding, which can locate changepoints very accurately when the window overlap is sufficient, the reliability of the overall detection decision is however

hard to establish (this is an instance of multiple dependent testings). In contrast, in works such as [3, 9], each window is searched for the optimal changepoint location and the p-value takes into account this optimization process. We will refer to such methods as changepoint detectors. In this work, we will focus on batch changepoint detectors. One obvious limitation of this class of methods is that they test for the presence of a single changepoint in the considered observation window and hence do require some knowledge of the typical separation between changepoints and/or may fail when dealing with changepoints that are too close of one another. However, this context makes it possible to develop methods that are both able to cope with a large variety of multidimensional signals and to come up with quantifiable detection decisions. The main contribution of this work is a nonparametric changepoint detector that can cope with multidimensional signals, and its asymptotic distribution under the null hypothesis which allows us to return a p-value associated with each changepoint detection. The method is thus an alternative to [9] and does not rely on the specification of a kernel function. It generalizes the test considered in [3], which only dealt with one dimensional signals. Although this will not be the focus of this paper, the proposed method also shares with the test considered in [3] the advantage of being applicable to multidimensional data with partially censored or even missing values. The proposed approach is based on a homogeneity test that is inspired by the work of [10]. The paper is organized as follows. The changepoint detection test is described in Section 2. In Section 3, we report the results of numerical experiments carried out on a real dataset considered in previous works. 2. CHANGEPOINT DETECTION 2.1. Test statistic Let X1 , . . . , Xn be a sequence of independent K-dimensional observations, i.e. Xt = (Xt,1 , . . . , Xt,K )′ , t = 1, . . . , n. The change-point detection test can be stated as deciding between the following hypotheses: (H0 ): “(Xt )1≤t≤n are i.i.d. random vectors” against (H1 ): “There exists an unknown index r such that (X1 , . . . , Xr ) are distributed under P1 and (Xr+1 , . . . , Xn ) under P2 , respectively, with P1 6= P2 .” The objective is here to obtain both an estimate of the most likely changepoint position (localization) and to quantify its significance (detection). To this aim, let us define for 1 ≤ n1 ≤ n−1, Vn (n1 ) = (Vn,1 (n1 ), . . . , Vn,K (n1 ))′ by Vn,k (n1 ) =

n1 n 1 X X

n3/2

i=1 j=n1 +1

1{Xi,k ≤Xj,k } −1{Xj,k ≤Xi,k } , (1)

where k = 1, . . . , K and

2.2. Implementation Issues

ˆ −1 Vn (n1 ) , Sn (n1 ) = Vn (n1 )′ Σ

(2)

ˆ is the K × K matrix whose (k, k′ )th element is defined by where Σ ˆ n,kk′ Σ

n 4X ˆ {Fn,k (Xi,k ) − 1/2}{Fˆn,k′ (Xi,k′ ) − 1/2} . (3) = n i=1

P In (3), Fˆn,k (t) = n−1 n j=1 1(Xj,k ≤ t) denotes the empirical cumulative distribution function (c.d.f.) of the k-th coordinate X1,k . ˆ n corresponds to an empirical estimate of the covariThe matrix Σ ance matrix Σ with general term Σkk′ = 4 Cov (Fk (X1,k ); Fk′ (X1,k′ )) , 1 ≤ k, k′ ≤ K , Fk denoting the c.d.f. of X1,k . Vn,k is the so-called Mann-Whitney-Wilcoxon rank statistic. The changepoint detector that we propose hereafter consists in maximizing Sn – which can actually be seen as a homogeneity testing statistic, inspired by the approach proposed in [10] – over all possible changepoint positions. The following theorem defines the corresponding test statistic Wn as well as its p-value under the null hypothesis that no change in distribution occurs within the analyzed window. Theorem 1 Assume that (Xi )1≤i≤n are RK -valued i.i.d. random vectors such that, for all k, the c.d.f. Fk of X1,k is a continuous function. Let Wn = max |Sn (n1 )| , (4) 1≤n1 ≤n−1

where Sn (n1 ) is defined by (2). Then, d

Wn −→ sup 0 ǫ) and ǫ is a fixed positive threshold. The test statistic is then computed as Sˆn (n1 ) = Vn (n1 )′ Σ†n Vn (n1 ) and Wn is compared, under the null hypothesis, to the quantiles of the sum of the squares of K ′ independent Brownian bridges, where K ′ is the number of non-null values of s†i . This way of proceeding was found to be very effective for signals whose components can be highly dependent (typically using ǫ to be of order of the relative machine precision). 2.3. Extensions The statement of Theorem 1 requires the continuity of the c.d.f. Fk of each coordinate; hence, it is not directly applicable, for instance, to discrete variables. In such cases however, it can be shown that Theorem 1 is still valid upon redefining Σ as  − ) + Fk (X1,k ) − 1} × Σkk′ = E {Fk (X1,k

 − , {Fk′ (X1,k ′ ) + Fk′ (X1,k′ ) − 1}

(7)

where Fk (x− ) denotes the left-limit of the c.d.f. in x. Another useful extension concerns the case of censored or missing data that can be dealt with in great generality by introducing lower X i,k and upper X i,k censoring values such that

X i,k ≤ Xi,k ≤ X i,k , where a strict inequality indicate censoring (for missing values, simply set X i,k = −∞ and X i,k = +∞). In this case, Theorem 1 may also be extended upon redefining the marginal rank statistics by Vn,k (n1 ) =

1 n3/2

n1 n X X

1{X i,k ≤X j,k } − 1{X j,k ≤X i,k } (8)

In results not shown here, we also checked using overlapping windows that the segmentation was robust with respect to the exact placement of the windows; this is indeed the case except for changes that occur close to a window boundary (say, less than thirty samples) for which the test power is greatly reduced. Hence, we recommend using slightly overlapping windows (shifted by one third or one half of their length) to avoid this potential loss of efficiency.

i=1 j=n1 +1

and the covariance matrix by h − Σkk′ = E {F k (X 1,k ) + F k (X 1,k ) − 1} ×

i − {F k′ (X 1,k′ ) + F k′ (X 1,k′ ) − 1} ,

(9)

where F k and F k denote the c.d.f.’s of X 1,k and X 1,k , respectively. In either of the two cases above (discrete variables or censoring), one must replace Σ in (7) or (9) (respectively) by its empirical ˆ to obtain a computable test statistic. estimate Σ

1 The data is available at http://mba.tuck.dartmouth.edu/ pages/faculty/ken.french/data_library.html 2 The 1930 change, which appears to be visible even by eye, is not properly detected because it is too close to the beginning of the record –see comment in next paragraph.

1940

1950

1960

1970

1980

1990

2000

1930

1940

1950

1960

1970

1980

1990

2000

1930

1940

1950

1960

1970

1980

1990

2000

1930

1940

1950

1960

1970

1980

1990

2000

1930

1940

1950

1960

1970

1980

1990

2000

1930

1940

1950

1960

1970

1980

1990

2000

1930

1940

1950

1960

1970

1980

1990

2000

17 30 5 17 30

In this section, we consider the “industry portfolios” dataset that consists of five- to thirty-dimensional econometric series that have been analyzed, in the context of changepoint detection, by [5, 12]. The data consists of monthly returns of 5, 17 and 30 industry sectors portfolios which average the returns of coherent groups of stocks taken from the NYSE, NASDAQ and AMEX over the 1926-2004 period1 . Both [5] and [12] applied a pre-processing transformation in order to cope with the heavy-tailed nature of the data; this was not necessary in our case, given that our method is invariant under monotonic transformations of the data. There exists no ground truth segmentation of these datasets and [5] and [12] who considered, respectively, the five and thirty portfolios subsets found very different results. The original data comes in two slightly different variants. When using the Average Value Weighted Returns (AVWR) variant, as in [5, 12], our method only found one location with p-value smaller than 10−2 . The Average Equal Weighted Returns (AEWR) series however gave very different results which are presented in Figure 1 below. We used a window size of n = 200 points, corresponding to a little more than sixteen years. Figure 1 only shows locations that are identified with significant p-values (represented by the thickness of the vertical change locator). One can observe that the changepoint decisions are generally consistent between the three different portfolio sizes, which was to be expected as these correspond to aggregation of the same original data with various levels of details. In particular, for the higherdimensional signal (30-portfolio) the detection remains consistent, showing that the method adapts itself to the dimension of the data and that the p-values are indeed well-calibrated. Although no ground truth is available to this data, one can observe that most detected changes, and especially those for which the p-value is small (shown by thicker lines), are synchronized with years where the US real GDP suffered from a negative growth (19301933, 1938, 1945–1947, 1954, 1958, 1974–1975, 1980, 1983 and 1991 on this period)2 .

1930

5

3. NUMERICAL EXPERIMENTS

Fig. 1. Top: Time series of monthly average equal weighted returns for the 5-portfolio dataset. Middle: Changepoints found by the MultiRank method for different threshold of the probability of false alarm: 0.05 (dashed lines), 0.01 (solid lines) and 0.001 (thick lines) for the 5-, 17- and 30-industry portfolios. Bottom: Changepoints found by the likelihood ratio test method In Figure 1, the MultiRank approach is compared to a baseline parametric likelihood-ratio based changepoint detection test [13]. The latter is specifically designed to deal with changes in the mean of i.i.d. multivariate Gaussian vectors and is based on the Hotelling’s T 2 statistic. To improve the performance of this approach on this dataset, we followed the suggestion of [5] and used a preliminary log transform. The locations of the changepoints found using this approach, displayed in the bottom plot of Figure 1, are in good agreement with the results of the MultiRank detector (middle plot in Figure 1). However, it is generally observed that the p-values of the Gaussian detector are significantly lower, especially for the higherdimensional data (17 and 30 portfolios). With the Gaussian detector, there are also more inconsistencies between the locations of the changepoints detected from the three different portfolio datasets, an effect that is clearly observable at the beginning of the data record (before year 1940). Both of these observations can be related to the fact that the data is non-Gaussian (even with the log transform). For

1975

1980

1985

1975

1980

1985

1975

1980

1985

0.8 0.4 0.0 MultiRank stat.

Marginal stats.

4. CONCLUSION

8 6 4 2 0

Fig. 2. Top: Time series in a 200-sample window (April 1972– January 1989) of the average equal weighted returns for the 5portfolio dataset. Middle: (|Vn,k (t)|)1≤t≤n , k = 1, . . . , 5 corresponding to the signal in that window. Bottom: MultiRank test statistic (Sn (t))1≤t≤n .

this dataset, the MultiRank detector appears to be more robust than the parametric Gaussian detector, although the baseline performance of the latter is already appreciable. In Figure 2, a particular 200-sample window of the AEWR 5portfolio data for the period April 1972–January 1989 is examined. We compare the absolute values of the marginal statistics |Vn,k | (middle of Figure 2) with the MultiRank statistic (bottom). One can observe that methods based on the marginal rank statistics |Vn,k | would, in this case, lead to an incorrect localization of the changepoint. Indeed, not only does the MultiRank statistic Wn take into account the marginal values of the rank statistics but also their possible correlations. For this dataset, it turns out that very strong correlations exist between the different dimensions as shown by Figure 3 (the smallest correlation is equal to 0.91). The MultiRank statistic does take the correlation (between ranks) into account both for locating the changepoint –as it is obvious from Figure 2– but also in computing the p-value. Additional studies on simulated data have shown that changes in correlated coordinates consistently lead to less significant p-values than changes of the same magnitude in uncorrelated coordinates. We insist here on the fact that the correlations displayed on Figure 3 pertains to the rank statistics, and hence are highly sensitive to nonlinear dependencies between the coordinates (in contrast to the standard coordinate correlations).

1.00 1

0.99 0.98

2

0.97 0.96

3

0.95 0.94

4 0.93 0.92

5

0.91 1

2

3

4

5

Fig. 3. Estimation of the correlation matrix of Fˆn,k (Xi,k ), i = 1, . . . , 200, k = 1, . . . , 5, where the Xi,k are the data shown in Figure 2.

We proposed and analyzed a nonparametric method for changepoint detection in multivariate data. Compared to alternatives, the MultiRank test statistic can be computed very efficiently and its asymptotic distribution under the null distribution (i.e., in the absence of change) is known, which makes it possible to compute p-values and thus design a detector with guaranteed false-alarm rates. The method is invariant under monotonic marginal transformations of the data avoiding the use of ad hoc data preprocessing steps. The method appears to be robust and can handle both high dimensional data and strong correlations –possibly non-linear– between the coordinates of the observations. 5. REFERENCES [1] M. Basseville and I. V. Nikiforov, Detection of Abrupt Changes: Theory and Applications, Prentice-Hall, 1993. [2] A.G. Tartakovsky, B.L. Rozovskii, R.B. Blazek, and H. Kim, “A novel approach to detection of intrusions in computer networks via adaptive sequential and batch-sequential changepoint detection methods,” IEEE Trans. Signal Process., vol. 54, no. 9, 2006. [3] C. L´evy-Leduc and F. Roueff, “Detection and localization of change-points in high-dimensional network traffic data,” Annals of Applied Statistics, vol. 3, no. 2, pp. 637–662, 2009. [4] Fr´ed´eric D´esobry, Manuel Davy, and Christian Doncarli, “An online kernel change detection algorithm,” IEEE Trans. Signal Process., vol. 53, no. 8, pp. 2961–2974, August 2005. [5] Makram Talih and Nicolas Hengartner, “Structural learning with time-varying components: tracking the cross-section of the financial time series,” J. R. Stat. Soc. Ser. B Stat. Methodol., vol. 67, no. 3, pp. 321–341, 2005. [6] F. Picard, S. Robin, E. Lebarbier, and J.-J. Daudin, “A segmentation/clustering model for the analysis of array cgh data,” Biometrics, vol. 63, pp. 758–766, 2007. [7] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨olkopf, and Alexander J. Smola, “A kernel method for the two-sample-problem,” in NIPS, 2006, pp. 513–520. [8] Z. Harchaoui, F. Bach, and E. Moulines, “Testing for homogeneity with kernel fisher discriminant analysis,” in Advances in Neural Information Processing Systems, 2007. [9] Z. Harchaoui, F. Bach, and E. Moulines, “Kernel change-point analysis,” in Advances in Neural Information Processing Systems, 2008. [10] L. J. Wei and J. M. Lachin, “Two-sample asymptotically distribution-free tests for incomplete multivariate observations,” J. Amer. Statist. Assoc., vol. 79, no. 387, pp. 653–661, 1984. [11] J. Kiefer, “K-sample analogues of the Kolmogorov-Smirnov and Cram´er-V. Mises tests,” Ann. Math. Statist., vol. 30, pp. 420–447, 1959. [12] Xiang Xuan and Kevin Murphy, “Modeling changing dependency structure in multivariate time series,” in ICML ’07: Proceedings of the 24th international conference on Machine learning, New York, NY, USA, 2007, pp. 1055–1062, ACM. [13] M. S. Srivastava and K. J. Worsley, “Likelihood ratio tests for a change in the multivariate normal mean,” J. Amer. Statist. Assoc., vol. 81, no. 393, pp. 199–204, 1986.

Suggest Documents