Biometrika (2015), xx, x, pp. 1–23
C 2007 Biometrika Trust Printed in Great Britain
A Fast Approach for Multiple Change-point Detection in Two-dimensional Data B Y V INCENT B RAULT, C E´ LINE L E´ VY-L EDUC UMR MIA-Paris, AgroParisTech, INRA, Universit´e Paris-Saclay, 75005, Paris, France
[email protected] [email protected]
5
AND J ULIEN
C HIQUET ´ LaMME - UMR CNRS 8071, Universit´e d’Evry-Val-d’Essonne ´ 23 boulevard de France, 91037 Evry, France
[email protected] S UMMARY We propose a novel approach for estimating the location of block boundaries (change-points) in a random matrix consisting of a block wise constant matrix observed in white noise. Our method consists in rephrasing this task as a variable selection issue. We use a penalized leastsquares criterion with an `1 -type penalty for dealing with this problem. We first provide some theoretical results ensuring the consistency of our change-point estimators. Then, we explain how to implement our method in a very efficient way. Finally, we provide some empirical evidence to support our claims and apply our approach to HiC data coming from molecular biology which are used for better understanding the chromatin structure.
10
15
Some key words: change-points, lasso, two-dimensional data, HiC experiments.
1. I NTRODUCTION Detecting automatically the block boundaries in a block wise constant matrix corrupted with noise is a very important issue which may have several applications. One of the main situations in which this problem occurs is the detection of chromosomal regions having close spatial location in the nucleus. Detecting such regions will improve our understanding of the influence of the chromosomal conformation on the cells functioning. The data provided by the most recent technology called HiC consist of a list of pairs of locations along the chromosome which are often summarized as a square matrix such that each entry corresponds to the number of interactions between two positions along the chromosome, see Dixon et al. (2012). Since this matrix can be modeled as a block wise constant matrix corrupted by some additional noise, it is of particular interest to design an efficient and fully automated method to find the block boundaries of large matrices, which may typically have several thousands of rows and columns, in order to identify the interacting chromosomal regions. A large literature is dedicated to the change-point detection issue for one-dimensional data. This problem can be addressed from a sequential (online) Tartakovsky et al. (2014) or from a retrospective (off-line) Brodsky & Darkhovsky (2000) point of view. Many off-line approaches are based on the dynamic programming algorithm which retrieves K change-points within n observations of a one-dimensional signal with a complexity of O(Kn2 ) in time Kay (1993).
20
25
30
35
V. B RAULT, J. C HIQUET AND C. L E´ VY-L EDUC
2
40
45
50
55
60
Such a complexity is however prohibitive for dealing with very large data sets. In this situation, Harchaoui & L´evy-Leduc (2010) proposed to rephrase the change-point estimation issue as a variable selection problem. This approach has also been extended by Vert & Bleakley (2010) to find shared change-points between several signals. To the best of our knowledge no method has been proposed for addressing the case of two-dimensional data where the number of rows and columns may be very large (n × n ≈ 5000 × 5000, namely 2 × 107 observations). The only statistical approach proposed for retrieving the change-point positions in the two-dimensional framework is the one devised by L´evy-Leduc et al. (2014) but it is limited to the case where the blockwise matrix is assumed to be blockwise constant on the diagonal and constant outside the diagonal blocks. It has first to be noticed that the classical dynamic programming algorithm cannot be applied in such a framework since the Markov property does not hold anymore. Moreover, the grouplars approach of Vert & Bleakley (2010) cannot be used in this framework since it would only provide change-points in columns and not in rows. As for the generalized Lasso recently devised by Tibshirani & Taylor (2011) or the two dimensional fused Lasso of Hoefling (2010), they are very helpful for image denoising but do not give access to the change-point positions. The paper is organized as follows. In Section 2, we first describe how to rephrase the problem of two-dimensional change-point estimation as a high dimensional sparse linear model and give some theoretical results which prove the consistency of our change-point estimators. In Section 3, we describe how to efficiently implement our method. In Section 4, we provide experimental evidence of the relevance of our approach on synthetic and real data coming from molecular biology called HiC data.
2.
S TATISTICAL FRAMEWORK 2·1. Statistical modeling In this section, we explain how the two-dimensional retrospective change-point estimation issue can be seen as a variable selection problem. Our goal is to estimate t?1 = (t?1,1 , . . . , t?1,K ? ) 1 and t?2 = (t?2,1 , . . . , t?2,K ? ) from the random matrix Y = (Yi,j )1≤i,j≤n defined by 2
Y = U + E,
(1)
where U = (Ui,j ) is a blockwise constant matrix such that Ui,j = µ?k,` 65
70
if t?1,k−1 ≤ i ≤ t?1,k − 1 and t?2,`−1 ≤ j ≤ t?2,` − 1,
with the convention t?1,0 = t?2,0 = 1 and t?1,K ? +1 = t?2,K ? +1 = n + 1. An example of such a ma1 2 trix U is displayed in Figure 1. The entries Ei,j of the matrix E = (Ei,j )1≤i,j≤n are iid zeromean random variables. With such a definition the Yi,j are assumed to be independent random variables with a blockwise constant mean. Let T be a n × n lower triangular matrix with nonzero elements equal to one and B a sparse matrix containing null entries except for the Bi,j such that (i, j) ∈ {t?1,0 , . . . , t?1,K ? } × 1 {t?2,0 , . . . , t?2,K ? }. Then, (1) can be rewritten as follows: 2
Y = TBT> + E,
75
(2)
where T> denotes the transpose of the matrix T. For an example of a matrix B, see Figure 1. Let Vec(X) denotes the vectorization of the matrix X formed by stacking the columns of X into a single column vector then Vec(Y) = Vec(TBT> ) + Vec(E). Hence, by using that Vec(AXC) = (C> ⊗ A)Vec(X), where ⊗ denotes the Kronecker product, (2) can be rewritten
3
A Fast Approach for Multiple Change-point Detection in Two-dimensional Data t?2,0
t?2,1
t?2,2 t?2,3
t?2,K ?
t?1,0 µ?1,1
µ?1,2 µ?1,3 µ?1,4
t?2,K ? +1
2
t?2,0
2
t?1,0
µ?1,5
t?1,1
t?1,1 µ?2,1
t?1,2
µ?3,1
µ?2,2 µ?3,2
µ?2,3 µ?3,3
µ?2,4 µ?3,4
µ?2,5 t?1,2
µ?3,5
t?1,K ? 1
t?1,K ? µ?4,1
µ?4,2 µ?4,3 µ?4,4
1
µ?4,5
t?1,K ? +1 1
t?1,K ? +1 1
t?2,1
t?2,K ?
t?2,2 t?2,3
t?2,K ? +1
2
2
B1,1
0
0
0
B1,5
0
0
B1,8
0
B1,10
0
0
B1,13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
B7,1
0
0
0
B7,5
0
0
B7,8
0
B7,10
0
0
B7,13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
B11,1
0
0
0
B11,5
0
0
B11,8
0
B11,10
0
0
B11,13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
B13,1
0
0
0
B13,5
0
0
B13,8
0
B13,10
0
0
B13,13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Fig. 1. Left: An example of a matrix U with n = 16, K1? = 3 and K2? = 4. Right: The matrix B associated to this matrix U.
as: Y = X B + E,
(3)
where Y = Vec(Y), X = T ⊗ T, B = Vec(B) and E = Vec(E). Thanks to these transformations, Model (1) has thus been rephrased as a sparse high dimensional linear model where Y and E are n2 × 1 column vectors, X is a n2 × n2 matrix and B is n2 × 1 sparse column vectors. Multiple change-point estimation Problem (1) can thus be addressed as a variable selection problem: b n ) = Argmin kY − X Bk2 + λn kBk1 , B(λ (4) 2 B∈Rn
2
P 2 where kuk22 and kuk1 are defined for a vector u in RN by kuk22 = N i=1 ui and kuk1 = PN i=1 |ui |. Criterion (4) is related to the popular Least Absolute Shrinkage and Selection Operator (LASSO) in least-square regression. Thanks to the sparsity enforcing property of the `1 norm, the estimator Bb of B is expected to be sparse and to have non-zero elements matching with those of B. Hence, retrieving the positions of the non zero elements of Bb thus provides estimators b n ) the set of active of (t?1,k )1≤k≤K1? and of (t?2,k )1≤k≤K2? . More precisely, let us define by A(λ variables: n o b n ) = j ∈ {1, . . . , n2 } : Bbj (λn ) 6= 0 . A(λ b n ), consider the Euclidean division of (j − 1) by n, namely (j − 1) = nqj + rj For each j in A(λ then b t1 = (b t1,k )1≤k≤|Ab (λ 1
n )|
80
b n )}, b ∈ {rj + 1 : j ∈ A(λ t2 = (b t2,` )1≤`≤|Ab (λ 2
where b t1,1 < b t1,2 < · · · < b t1,|Ab (λ 1
n )|
,
n )|
85
90
b n )} ∈ {qj + 1 : j ∈ A(λ
b t2,1 < b t2,2 < · · · < b t2,|Ab (λ 2
n )|
. (5)
V. B RAULT, J. C HIQUET AND C. L E´ VY-L EDUC
4
95
100
b n )} In (5), |Ab1 (λn )| and |Ab2 (λn )| correspond to the number of distinct elements in {rj : j ∈ A(λ b and {qj : j ∈ A(λn )}, respectively. As far as we know, neither thorough practical implementation nor theoretical grounding have been given so far to support such an approach for change-point estimation in the two-dimensional case. In the following section, we give theoretical results supporting the use of such an approach. 2·2. Theoretical results In order to establish the consistency of the estimators b t1 and b t2 defined in (5), we shall use assumptions (A1–A4). These assumptions involve the two following quantities ? Imin = ? Jmin
105
=
min |t?1,k+1 − t?1,k | ∧ min ? |t?2,k+1 − t?2,k |,
0≤k≤K1?
min
0≤k≤K2
1≤k≤K1? ,1≤`≤K2? +1
|µ?k+1,`
− µ?k,` | ∧
min
1≤k≤K1? +1,1≤`≤K2?
|µ?k,`+1 − µ?k,` |,
which corresponds to the smallest length between two consecutive change-points and to the smallest jump size between two consecutive blocks, respectively.
(A1) The random variables (Ei,j )1≤i,j≤n are iid zero mean random variables such that there exists a positive constant β such that for all ν in R, E[exp(νE1,1 )] ≤ exp(βν 2 ). ? )−1 λ → 0, as n tends to infinity. (A2) The sequence (λn ) appearing in (4) is such that (nδn Jmin n
(A3) The sequence (δn ) is a non increasing and positive sequence tending to zero such that ? 2 / log(n) → ∞, as n tends to infinity. nδn Jmin
110
? ≥ nδ . (A4) Imin n
P ROPOSITION 1. Let (Yi,j )1≤i,j≤n be defined by (1) and b t1,k , b t2,k be defined by (5). Assume ? ? b b that |A1 (λn )| = K1 and that |A2 (λn )| = K2 , with probabilty tending to one, then, P max ? b t1,k − t?1,k ≤ nδn ∩ max ? b t2,k − t?2,k ≤ nδn → 1, as n → ∞. (6) 1≤k≤K1
115
1≤k≤K2
The proof of Proposition 1 is based on the two following lemmas. The first one comes from the Karush-Kuhn-Tucker conditions of the optimization problem stated in (4). The second one allows us to control the supremum of the empirical mean of the noise. b where X and Bb are defined in L EMMA 1. Let (Yi,j )1≤i,j≤n be defined by (1). Then, Ub = X B, (3) and (4) respectively, is such that n X
120
n X
n X
n X
λn sign(Bbj ), if Bbj 6= 0, Ubk,` = 2 k=rj +1 `=qj +1 k=rj +1 `=qj +1 n n n n X X X X λn b Yk,` − Uk,` ≤ , if Bbj = 0, 2 k=rj +1 `=qj +1 k=rj +1 `=qj +1 Yk,` −
(7)
(8)
where qj and rj are the quotient and the remainder of the Euclidean division of (j − 1) by n, respectively, that is (j − 1) = nqj + rj . In (7), sign denotes the function which is defined b which is by sign(x) = 1, if x > 0, −1, if x < 0 and 0 if x = 0. Moreover, the matrix U, b b b b such that U = Vec(U), is blockwise constant and satisfies Ui,j = µ bk,` , if t1,k−1 ≤ i ≤ b t1,k − 1
A Fast Approach for Multiple Change-point Detection in Two-dimensional Data
5
and b t2,`−1 ≤ j ≤ b t2,` − 1, k ∈ {1, . . . , |Ab1 (λn )|}, ` ∈ {1, . . . , |Ab2 (λn )|}, where the b t1,k , b t2,k , b b A1 (λn ) and A2 (λn ) are defined in (5).
125
L EMMA 2. Let (Ei,j )1≤i,j≤n be random variables satisfying (A1). Let also (vn ) and (xn ) be two positive sequences such that vn x2n / log(n) → ∞, then sX n −1 P max (sn − rn )−1 En,j ≥ xn → 0, as n → ∞, 1≤rn v requires at worse 2n2 operations.
145
L EMMA 4. Let A = {a1 , . . . , aK } and for each j in A let us consider the Euclidean division of j − 1 by n given by j − 1 = nqj + rj , then = ((n − (qak ∨ qa` )) × (n − (rak ∨ ra` )))1≤k,`≤K . X >X (9) A,A
1≤k,`≤K
Moreover, for any non empty subset A of distinct indices in 1, . . . , n2 , the matrix XA> XA is invertible. L EMMA 5. Assume that we have at our disposal the Cholesky factorization of XA> XA . The updated factorization on the extended set A ∪ {j} only requires solving a |A|-size triangular system, with complexity O(|A|2 ). Moreover, the downdated factorization on the restricted set A\ {j} requires a rotation with negligible cost to preserve the triangular form of the Cholesky factorization after a column deletion. Remark 2. We were able to obtain a closed-form expression of the inverse (XA> XA )−1 for some special cases of the subset A, namely, when the quotients/ratios associated with the Euclidean divisions of the elements of A are endowed with a particular ordering. Moreover, for
150
155
6
160
165
170
V. B RAULT, J. C HIQUET AND C. L E´ VY-L EDUC
addressing any general problem, we rather solve systems involving XA> XA by means of a Cholesky factorization which is updated along the homotopy algorithm. These updates correspond to adding or removing an element at a time in A and are performed efficiently as stated in Lemma 5. These lemmas are the building blocks for our LARS implementation given in Algorithm 1, where we detail the leading complexity associated with each part. The global complexity is in O(mn2 + ms2 ) where m is the final number of steps in the while loop. These steps include all the successive additions and deletions needed to reach s, the final targeted number of active ˆ associated with the series of m variables. At the end of day, we have m block-wise prediction Y ˆ estimations of B(λ). Concerning the memory requirements, we only need to store the n × n data matrix Y once. Indeed, since we have at our disposal the analytic form of any sub matrix extracted from X > X , we never need to compute neither store this large n2 × n2 matrix. This paves the way for quickly processing data with thousands of rows and columns. Algorithm 1. Fast LARS for two-dimensional change-point estimation Input: data matrix Y, maximal number of active variables s. // Initialization Start with no change-point A ← ∅, Bˆ = 0 ˆ = X > Y with Lemma 3 Compute current correlations c while λ > 0 or |A| < s do
// O(n2 )
// Update the set of active variables ˆj = λ} Determine next change-point(s) by setting λ ← kˆ ck∞ and A ← {j : c > Update the Cholesky factorization of XA XA with Lemma 4 // O(|A|2 ) // Compute the direction of descent −1 >X Get the unormalized direction w ˜A ← X·A sign(ˆ cA ) ·A q > sign(ˆ Normalize wA ← αw ˜A with α ← 1/ w ˜A cA )
// O(|A|2 )
Compute the equiangular vector uA = XA wA and a = X > uA with Lemma 3 // O(n2 ) // Compute the direction step Find the maximal step preserving equicorrelation γin ←
min+ j∈Ac
n
λ−cj λ+cj α−aj , α+aj
o
n o ˆA /wA Find the maximal step preserving the signs γout ← min+ − B j∈A The direction step that preserves both is γˆ ← min(γin , γout ) ˆ←c ˆ − γˆ a and BˆA ← BˆA + γˆ wA accordingly Update the correlations c
// O(n)
// Drop variable crossing the zero line if γout < γin then n o Remove existing change-point(s) A ← A\ j ∈ A : Bˆj = 0 Downdate the Cholesky factorization of XA> XA ˆ recorded at each iteration. Output: Sequence of triplet (A, λ, B)
// O(|A|)
A Fast Approach for Multiple Change-point Detection in Two-dimensional Data
7
4.
N UMERICAL EXPERIMENTS 4·1. Synthetic data. The goal of this section is to assess the statistical and numerical performances of our methodology. We generated observations according to Model (1) where U is a symmetric blockwise constant matrix defined by 10101 0 1 0 1 0 1 0 1 0 1 µ?k,` k∈ 1,...,K ? +1 ,`∈ 1,...,K ? +1 = (10) , { } { } 1 2 01010 10101 and the Ei,j are zero mean i.i.d. Gaussian random variables of variance σ 2 where σ is in {1, 2, 5}. Some examples of data generated from this model with n = 500, K1? = K2? = 4 can be found in Figure 2 (Top). Statistical performances. For each value of σ in {1, 2, 5}, we generated 1000 matrices following the model described in Section 4·1 with (t?1,k )1≤k≤K1? = ([nk/(K1? + 1)] + 1)1≤k≤K1? and (t?2,k )1≤k≤K2? = ([nk/(K2? + 1)] + 1)1≤k≤K2? , where [x] denotes the integer part of x and b 2 for the different K1? = K2? = 4. Figure 2 (middle) displays the mean square error n−2 kB − Bk 2 samples (in gray) and the median of the mean square errors in red as a function of the number of active variables s defined in Algorithm 1. We can see from this figure that even in the high noise level case, the mean square error is small. Moreover, the ROC curves displayed in the bottom part of Figure 2 ensure that the change-points in rows: (t?1,k )1≤k≤K1? are properly retrieved with a very small error rate even in high noise level frameworks. The same results hold for the change-points in columns: (t?2,k )1≤k≤K2? but are not displayed since they are very similar. Numerical performances. We implemented Algorithm 1 in C++ using the library armadillo for linear algebra developed by Sanderson (2010) and also provide an interface to the R platform R Core Team (2015) through the R package blockseg which will be available from the Comprehensive R Archive Network (CRAN). All experiments were conducted on Linux workstation with Intel Xeon 2.4 GHz processor and 8 GB of memory. We generated data as in Model (10) for different values of n: n ∈ {100, 250, 500, 1000, 2500, 5000} and different values of the maximal number of activated variables: s ∈ {50, 100, 250, 500, 750}. The median runtimes obtained from 4 replications (+ 2 for warm-up) are reported in Figures 3. The left (resp. the right) panel gives the runtimes in seconds as a function of s (resp. of n). These results give experimental evidence for the theoretical complexity O(mn2 + ms2 ) that we established in Section 3 and thus for the computational efficiency of our approach: applying blockseg to matrices containing 107 entries takes less than 2 minutes for s = 750. Comparison to other methodologies. Since, to the best of our knowledge, no two-dimensional method are available, we propose to compare our approach to an adaptation of the CART procedure of Breiman et al. (1984) and to an adaptation of Harchaoui & L´evy-Leduc (2010) (HL) dedicated to univariate observations. We adapt the CART methodology by using the successive boundaries provided by CART as changepoints for the two-dimensional data. The associated ROC curve is displayed in red in Figure 5. For adapting the HL methodology, we apply it to each row of the data matrix and for each λ, we obtain the change-points of each row. The change-points appearing in the different rows are claimed to be change-points for the two-dimensional data either if they appear at least in one row (the associated ROC curve for this approach called HL1 is displayed in purple in Figure 5)
175
180
185
190
195
200
205
210
V. B RAULT, J. C HIQUET AND C. L E´ VY-L EDUC
8
σ=2
σ=5
Mean square error
data example
σ=1
True positive rate
sparsity level (s) 1.00
1.00
1.00
0.75
0.75
0.75
0.50
0.50
0.50
0.25
0.25
0.25
0.00
0.00 0.00
0.05
0.10
0.15
0.20
0.00 0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
False positive rate Fig. 2. Top: Examples of matrices Y generated from the model described in Section 4·1. Middle: Mean square erb 22 for the different realizations in gray rors n−2 kB − Bk and the median of the mean square errors in red as a function of the number of nonzero elements in Bb for each scenario. Bottom: ROC curves for the estimated changepoints in rows.
215
or if they appear in ([n/2] + 1) rows (the associated ROC curve for this approach called HL2 is displayed in blue in Figure 5). Since the procedures HL1 and HL2 are much slower than ours, the ROC curves are displayed for matrices of size 250 × 250. In this comparison four different
A Fast Approach for Multiple Change-point Detection in Two-dimensional Data ●
●
timings (seconds, log-scale)
●
103
●
103
● ●
● ●
●
●
●
●
●
●
●
10
2
●
2
10
● ● ● ● ●
sample size (n) 100 250 500 1000 2500 5000
● ● ● ●
sparisty (s) 50 100 250 500 750
●
●
● ●
101
9
●
●
●
●
101
●
● ●
●
●
● ●
●
●
●
100
●
●
●
●
100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
−1
●
−1
10
10 ●
●
●
●
200
400
600
0
1000
2000
3000
4000
5000
sample size (n)
sparsity level (s)
Fig. 3. Left: Computation time (in seconds) for various value of n as a function of the sparsity level s = |A| reached at the end of the algorithm. Right: Computation time (in seconds) as a function of sample size n.
types of matrices U were considered corresponding to the following (µ? k,` ) 10101 10000 0 1 0 1 0 0 1 0 0 0 ?,(1) ?,(2) µk,` = 1 0 1 0 1 , µk,` = 0 0 1 0 0 , 0 1 0 1 0 0 0 0 1 0 10101 00001 10000 0 −1 −1 −1 −1 0 1 1 1 1 −1 −1 0 −1 0 ?,(3) ?,(4) = = µk,` 0 1 1 0 0 and µk,` −1 0 1 0 1 0 1 0 1 0 −1 −1 0 −1 0 01001 −1 0 1 0 1 with the corresponding matrices (Bi,j )(i,j)∈{t?
? ? ? 1,0 ,...,t1,4 }×{t2,0 ,...,t2,4 }
.
1 −1 1 −1 1 1 −1 0 0 0 −1 2 −2 2 −2 −1 2 −1 0 0 (1) (2) Bi,j = Bi,j = 1 −2 2 −2 2 , 0 −1 2 −1 0 −1 2 −2 2 −2 0 0 −1 2 −1 1 −2 2 −2 2 0 0 0 −1 2 0 −1 0 0 0 1 −1 0 0 0 −1 1 1 −1 1 −1 2 0 0 0 (4) (3) Bi,j = 0 0 0 −1 0 and Bi,j = 0 1 0 0 0 0 −1 0 0 0 0 0 −1 2 −1 0 0 0 −1 2 0 1 0 0 0 ?,(1)
?,(2)
The first configuration (µk,` ) is the one used in Equation (10). The second one (µk,` ) is close to the block diagonal model which correspond to the so-called cis-interactions in the human HiC ?,(3) ?,(4) experiments. The third (µk,` ) and fourth (µk,` ) cases are chosen because some rows only
220
10
V. B RAULT, J. C HIQUET AND C. L E´ VY-L EDUC
contain one non zero parameter in the associated matrix B, which makes the corresponding change-points more difficult to detect. The corresponding Y matrices are displayed in Figure 4 when the variance of the addtive noise is equal to 1. µ?,(2)
µ?,(3)
µ?,(4)
Fig. 4. From left to right: Examples of simulated matrices Y for µ?,(1) , µ?,(2) and µ?,(3) with σ = 1. 225
230
235
240
245
250
255
We can see from Figure 5 that our method outperforms the other ones and is less sensitive to the shape of the matrix U. Model selection. In the previous experiments we did not need to explain how to choose the number of estimated change-points since we used ROC curves for comparing the methodologies. However, in real data applications, it is necessary to propose a methodology for estimating the number of changepoints. This is what we explain in the following. 2 In practice, we take s = Kmax where Kmax is an upper bound for K1? and K2? . For choosing the final change-points we shall adapt the well-known stability selection approach devised by Meinshausen & B¨uhlmann (2010). More precisely, we randomly choose M times n/2 columns 2 and n/2 rows of the matrix Y and for each subsample we select s = Kmax active variables. Finally, after the M data resamplings, we keep the change-points which appear a number of times larger than a given threshold. By the definition of the change-points given in (5), a changepoint b t1,k or b t2,` may appear several times in a given set of resampled observations. Hence, the score associated with each change-point corresponds to the sum of the number of times it appears in each of the M subsamplings. To evaluate the performances of this methodology, we generated observations according to the model defined in (1) and (10) with s = 225 and M = 100. The results are given in Figure 6 which displays the score associated to each change-point for a given matrix Y (top). We can see from the top part of Figure 6 some spurious change-points appearing near from the true change-point positions. In order to identify the most representative change-point in a given neighborhood, we keep the one with the largest score among a set of contiguous candidates. The result of such a post-processing is displayed in the bottom part of Figure 6 and in Figure 7. More precisely the boxplots associated to the estimation of K1? (resp. the histograms of the estimated change-points in rows) are displayed in the bottom part of Figure 6 (resp. in Figure 7) for different values of σ and different thresholds (thresh) expressed as a percentage of the largest score. We can see from these figures that when thresh is in the interval [20, 40] the number and the location of the change-points are very well estimated even in the high noise level case. In order to further assess our methodology we generated observations following (1) with n = 1000 and K1? = K2? = 100 where we used for the matrix U the same shape as the one of the matrix µ? given in (10) except that K1? = K2? = 100. In this framework, the proportion
11
A Fast Approach for Multiple Change-point Detection in Two-dimensional Data
True positive rate
σ=1
σ=2
σ=5
σ = 10
1.00
1.00
1.00
1.00
0.75
0.75
0.75
0.75
0.50
0.50
0.50
0.50
0.25
0.25
0.25
0.25
0.00
0.00 0.00
0.25
0.50
0.75
1.00
0.00 0.00
0.25
0.50
0.75
1.00
0.00 0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
True positive rate
False positive rate 1.00
1.00
1.00
1.00
0.75
0.75
0.75
0.75
0.50
0.50
0.50
0.50
0.25
0.25
0.25
0.25
0.00
0.00 0.00
0.25
0.50
0.75
1.00
0.00 0.00
0.25
0.50
0.75
1.00
0.00 0.00
0.25
0.50
0.75
1.00
True positive rate
False positive rate 1.00
1.00
1.00
1.00
0.75
0.75
0.75
0.75
0.50
0.50
0.50
0.50
0.25
0.25
0.25
0.25
0.00
0.00 0.00
0.25
0.50
0.75
1.00
0.00 0.00
0.25
0.50
0.75
1.00
0.00 0.00
0.25
0.50
0.75
1.00
True positive rate
False positive rate 1.00
1.00
1.00
1.00
0.75
0.75
0.75
0.75
0.50
0.50
0.50
0.50
0.25
0.25
0.25
0.25
0.00
0.00 0.00
0.25
0.50
0.75
1.00
0.00 0.00
0.25
0.50
0.75
1.00
0.00 0.00
0.25
0.50
0.75
1.00
False positive rate Fig. 5. ROC curves for the estimated change-points in rows for our method (in green), HL1 (in purple), HL2 (in blue) and CART (in red). Each row is associated to a matrix: first row for µ?,(1) , second row for µ?,(2) , µ?,(3) and µ?,(4) .
of change-points is thus ten times larger than the one of the previous case. The corresponding results are displayed in Figures 8, 9 and 10. We can see from the last figure that taking a threshold equal to 20% provides the best estimations of the number and of the change-point positions. This threshold corresponds to the lower bound of the thresholds interval obtained in the previous configuration.
260
V. B RAULT, J. C HIQUET AND C. L E´ VY-L EDUC
12
σ=1
σ=2
●
20 15
25
80
20
60
15
● ●
●
10
● ●
●
● ●
5
●
5
●
● ●
●
20
● ●
●
●
●
●
●
40
●
●
10
σ=5
●
●
● ●
● ●
● ●
10
20
30
40
0
50
0
●
10
20
30
40
50
10
20
30
40
50
Fig. 6. Top: Scores associated to each estimated changepoints for different values of σ; the true change-point positions in rows and columns are located at 101, 201, 301 and 401. Bottom: Boxplots of the estimation of K1? for different values of σ and thresh after the post-processing step.
50%
σ=1 400
400
300
300
200
200
200
100
100
100
0
0
40% 30%
100
200
300
400
400 300
0 0
100
200
300
400
500
500
400
400
300
300
200
200
200
100
100
100
0
0 0
100
200
300
400 500
400
400
300
300
200
200
100
100
0
0 100
200
300
400 500
400
400
300
300
200
200
100
100
0 100
200
300
400
500 400
400
300
300
200
200
100
100
0
200
300
400
100
200
300
400
400
0
100
200
300
400
0
100
200
300
400
0
100
200
300 200 100 0 100
200
300
400
400 300 200 100 0 100
200
300
400
300
400
400 300 200 100
0 0
300
400
0 500
200
0 100
0 0
100
300
0
500
0
400
0
500
0
10%
σ=5
500
0
20%
σ=2
500
0 0
100
200
300
400
0
100
200
Fig. 7. Barplots of the estimated change-points for different variances (columns) and different thresholds (rows) for the model µ?,(1) .
300
400
A Fast Approach for Multiple Change-point Detection in Two-dimensional Data σ=1
σ=2
σ=5 200
200 150
●
150
150
● ● ● ●
● ● ● ● ● ● ● ● ● ●
100
●
● ● ● ●
● ● ●
100
● ● ●
● ●
●
●
● ● ● ● ● ● ● ● ● ● ●
100
● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
50
50
●
● ●
● ● ● ● ● ● ● ● ● ● ●
● ●
0 30
40
● ● ● ● ● ●
●
40
50
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
0 20
● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ●
10
●
50
● ●
0
50
10
20
30
40
50
10
20
30
Fig. 8. Boxplots of the estimation of K1? for different values of σ and thresholds (thresh) after the post-processing step.
50%
σ=1
σ=2 400
400
300
300
300
200
200
200
100
100
100
0
0 0
200
400
600
800
40%
400
600
800
0
300
300
300
200
200
200
100
100
100
0
0 200
400
600
800
200
400
600
800
0
400
300
300
300
200
200
200
100
100
100
0
0 0
250
500
750
250
500
750
400
400
300
300
300
200
200
200
100
100
100
0
0 0
250
500
750
250
500
750
1000
400
400
300
300
300
200
200
200
100
100
100
0
0 0
250
500
750
1000
600
800
250
500
750
0
250
500
0
250
500
0
250
500
750
0 0
400
400
0 0
400
200
0 0
400
400
30%
200
400
0
20%
0 0
400
400
10%
σ=5
400
750
1000
0 0
250
500
750
1000
Fig. 9. Barplots of the estimated change-points for different variances (columns) and different thresholds (rows) in the case where n = 1000 and K1? = K2? = 100.
750
1000
13
V. B RAULT, J. C HIQUET AND C. L E´ VY-L EDUC
14
50%
σ=1
σ=2 400
400
300
300
300
200
200
200
100
100
100
0
0 0
50
100
150
200
250
40%
100
150
200
250
400 300
300
200
200
200
100
100
100
0
0 50
100
150
200
250
50
100
150
200
250
400
300
300
300
200
200
200
100
100
100
0
0 0
50
100
150
200
250
50
100
150
200
250
400
300
300
300
200
200
200
100
100
100
0
0 0
50
100
150
200
250
50
100
150
200
250
400
300
300
300
200
200
200
100
100
0 50
100
150
200
250
150
200
250
0
50
100
150
200
250
0
50
100
150
200
250
0
50
100
150
200
250
0
50
100
150
200
250
100
0 0
100
0 0
400
400
50
0 0
400
400
0
0 0
400
400
30%
50
300
0
20%
0 0
400
400
10%
σ=5
400
0 0
50
100
150
200
250
Fig. 10. Zoom of the barplots of Figure 9.
A Fast Approach for Multiple Change-point Detection in Two-dimensional Data
15
4·2. Application to HiC data In this section, we applied our methodology to publicly available data (http:// chromosome.sdsc.edu/mouse/hi-c/download.html) already studied by Dixon et al. (2012). More precisely, we studied the interaction matrices of Chromosomes 1 and 19 of the mouse cortex at a resolution 40 kb and we compared the number and the location of the estimated change-points found with our approach with those obtained by Dixon et al. (2012) on the same data since no ground truth is available. We display in Figure 11 the number of change-points in rows found by our approach as a function of the threshold thresh used in our adaptation of the stability selection approach presented in the previous section. We also display in this figure a red line corresponding to the number of change-points found by Dixon et al. (2012). 600
●
●
270
●
●
●
160
●
500
265
●
● ● ● ●
● ●
120
●
400
● ● ●
● ●
● ● ●
●
80
●
300
● ● ●
● ● ●
● ● ●
●
● ●
200
●
40 0.0
0.5
1.0
1.5
2.0
0
5
10
15
Fig. 11. Number of change-points in rows found by our approach as a function of the threshold (’•’) in % for the interaction matrices of Chromosome 1 (left) and Chromosome 19 (right) of the mouse cortex. The red line corresponds to the number of change-points found by Dixon et al. (2012).
We also compute the two parts of the Hausdorff distance for the change-points in rows which is defined by d btB , bt = max d1 btB , bt , d2 btB , bt , (11) where bt and btB are the change-points in rows found by our approach and Dixon et al. (2012), respectively. In (11), d1 (a, b) = sup inf |a − b| ,
(12)
d2 (a, b) = d1 (b, a) .
(13)
b∈b a∈a
More precisely, Figure 12 displays the boxplots of the d1 and d2 parts of the Hausdorff distance without taking the supremum in orange and blue, respectively. We can observe from Figure 12 that some differences indeed exist between the segmentations produced by the two approaches but that the boundaries of the blocks are quite close when the number of estimated change-points are the same, which is the case when thresh = 1.8% (left) and 10% (right). In the case where the number of estimated change-points are on a par with those of Dixon et al. (2012), we can see from Figure 13 that the change-points found with our strategy present a lot of similarities with those found by the HMM based approach of Dixon et al. (2012).
275
280
V. B RAULT, J. C HIQUET AND C. L E´ VY-L EDUC
16
●
●
●
●
●
●
●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
●
●
●
●
●
●
●
200
●
● ●
●
● ●
300
● ● ●
● ● ●
150 ● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
200
●
●
● ●
● ●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
● ●
● ●
● ●
● ●
● ●
● ●
●
●
●
●
●
●
● ● ● ●
● ●
100
● ● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
● ● ●
● ● ●
●
●
●
●
●
●
●
● ●
●
● ●
100
●
● ●
●
● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●
● ● ●
● ●
50
● ● ●
●
● ● ● ●
● ● ● ● ● ● ● ● ●
●
●
●
●
● ● ● ●
● ● ●
●
● ●
● ● ●
● ●
● ●
● ● ●
● ●
● ●
● ●
●
●
●
● ● ●
● ● ●
●
●
●
● ● ● ● ●
● ● ●
0
● ●
●
●
● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ● ●
0 0
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Fig. 12. Boxplots for the infimum parts of the Hausdorff distances d1 (orange) and d2 (blue) between the changepoints found by Dixon et al. (2012) and our approach for the Chromosome 1 (left) and the Chromosome 19 (right) of the mouse cortex for the different thresholds in %.
Fig. 13. Topological domains detected by Dixon et al. (2012) (upper triangular part of the matrix) and by our method (lower triangular part of the matrix) from the interaction matrix of Chromosome 1 (left) and Chromosome 19 (right) of the mouse cortex with a threshold giving 232 (resp 85) estimated change-points in rows and columns.
285
290
Our method also gives access to the estimated change-point positions for different values of the thresholds. Figure 14 displays the different change-point locations that can be obtained for these different values of the threshold. However, contrary to our method, the approach of Dixon et al. (2012) can only deal with binned data at the resolution of several kilobases of nucleotides. The very low computational burden of our strategy paves the way for processing data collected at a very high resolution, namely at the nucleotide resolution, which is one of the main current challenges of molecular biology.
A Fast Approach for Multiple Change-point Detection in Two-dimensional Data 50
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ●●● ● ● ●●● ●● ● ●●● ●● ● ●●● ●● ● ●●● ●● ● ●●● ●● ● ●●● ●● ● ●●● ●● ● ●●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ●● ● ● ● ●
40
30
20
10
0
50
40
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●●●● ● ● ●● ● ●●●●●● ● ● ●● ● ● ●●● ●●●● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ●● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●●●● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●●● ●●●● ● ● ● ●● ●● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ●●● ●● ●●●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ●●● ● ●● ● ●●●● ●●● ● ●●● ●●● ● ●●● ● ● ● ● ● ●● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ●●● ● ●●● ●●● ●●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ●● ● ● ● ●● ●●● ●●● ●●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ●● ● ● ●●● ●● ● ●●● ● ● ● ● ● ● ●● ●●● ●●● ● ● ●●● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ●●● ●● ●● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ●●●
0
1000
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ●● ● ● ●● ● ●●● ● ● ● ● ● ● ●● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●●●● ● ●● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●● ● ● ●●●●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ●●●●● ● ● ● ●
2000
30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●● ● ● ●●● ● ● ●●● ● ● ●●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ●● ● ●● ●● ● ●● ●● ●● ● ● ● ● ● ●●● ● ● ●● ● ●● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●
3000
● ● ● ● ● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ●● ● ● ●● ● ●● ●● ● ● ●● ● ●● ●● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ●● ●● ●● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●●● ● ● ● ●●● ● ● ●● ●● ● ● ●● ● ●● ●● ● ●● ●●● ● ● ● ●● ●●● ● ● ● ●● ●● ● ● ●● ● ●● ●● ● ● ●●● ●●● ● ● ●● ●●● ● ●●● ●● ● ●● ● ●● ●● ● ● ●● ● ● ●● ●● ●● ●● ● ● ●●● ● ●●●● ● ● ● ●● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ●
20
10
0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ●●●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ●● ● ●● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ●●● ●● ●● ● ● ● ● ●● ● ●● ● ● ●●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ●● ● ●● ● ● ●●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ●●● ●● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ●●● ●●● ●● ● ● ●● ● ●● ● ● ●●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ● ●● ● ●●● ●●● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ●●●● ● ●● ●● ● ● ● ●●● ● ● ● ●● ● ●●● ●●● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ●●●● ●● ●●● ●● ● ● ● ●●● ● ● ● ●● ● ●●● ●●● ●● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●●●● ●● ●●● ●● ● ● ● ●●● ● ● ● ●● ● ●●● ●●● ●● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ●●●● ●● ●●● ●● ● ●●● ●●● ● ● ● ●● ● ●●● ●● ●● ●● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ●●●● ●● ●●● ●● ● ●●●● ●●● ● ● ● ●● ● ●●● ●● ●● ●● ● ● ● ●●●● ● ●● ● ●● ● ● ●● ● ● ●● ●●●●● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ●●●● ●● ● ●●● ● ● ●● ● ●●●● ●●● ●● ●●● ● ●●● ●● ●● ●● ● ● ● ●●●● ● ●● ● ●● ● ● ●● ● ● ●● ●●●●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●●●●● ●● ● ● ●●● ● ●● ●● ● ●●●● ●●● ●● ● ●●●●●● ●● ●● ●● ● ● ● ●●●● ● ●● ● ●● ● ● ●● ● ● ●● ●●●●●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●●●●● ● ●● ● ● ●●● ● ●● ●● ●● ● ●●● ●●● ● ● ●●●●●● ●●● ●● ●● ● ● ● ● ●●●● ● ●● ● ● ●● ● ● ●● ● ● ●● ●●●●●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●●●● ● ●● ● ● ●●● ● ●● ●● ●● ● ●●● ●●● ● ● ● ●●●●●● ●●● ●● ●● ● ● ● ● ●●●● ● ●● ● ● ●●● ● ● ●● ● ● ●● ●●●●●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●●●●● ● ● ●● ● ● ●●● ● ●● ●● ●● ● ●●● ●●● ● ● ● ●●●●●● ●●● ●● ●● ● ● ● ● ●●●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●●●●●●● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ●● ●●●●● ● ● ●● ● ● ●●● ●●●● ●● ●● ● ●●● ●●● ● ● ● ● ●●●●●● ●●●● ●● ●● ● ● ● ● ●●●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ●● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ●●●●● ● ● ●● ● ●● ● ● ●●●● ● ●● ●● ● ●●● ●●● ● ● ● ●●●●●● ●●●● ●● ●● ● ● ● ● ●●●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ●● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ●●●● ●●● ●● ● ●● ● ● ●●●● ●●●● ●● ● ●●● ●●● ● ● ● ●●●●●● ● ● ●●● ● ●● ●● ● ● ● ● ● ●●●● ● ●● ●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●●●●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ●●●● ●●● ●● ● ●● ● ●●●● ●●● ●●●● ●● ● ●●● ●●● ●● ● ● ●●●●●●● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ●●●● ● ●● ●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●●●●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ●●●● ●●● ●● ● ● ●●●●●● ●●●●●● ●●● ● ● ●●● ●●● ●● ● ●● ●●●●●●● ● ● ● ●●● ● ●●● ● ● ●● ● ● ● ●● ● ●●●●● ● ●● ●● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ●●●●●● ● ● ● ●●● ● ●●● ●●● ●●● ● ● ● ●● ●● ●● ● ●●● ● ● ●●● ●● ● ●●●●●● ●●●●●● ● ●●● ● ● ●●● ●● ● ● ●● ●●●●●●● ● ● ● ●● ●● ● ●●● ● ● ●● ● ● ● ●● ●●●●●●● ● ●● ●● ● ●●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●●●●● ● ● ● ●● ● ● ●●● ● ●●● ●●● ● ● ●● ●●● ●● ●● ● ●● ● ●●● ●●● ● ●●●●●● ●●●●●● ● ●● ●●●●●● ●● ● ●●● ● ●●●●●●● ● ● ● ●● ●● ● ●●● ● ● ●● ● ● ● ●● ●●●●●●● ● ●● ●● ● ●●● ●● ● ●● ● ● ● ●● ●● ●● ● ● ●●●●●● ● ● ● ●● ● ● ●●●● ● ●●● ●●● ●● ● ●● ● ● ●● ● ●● ●● ●● ● ●●●● ●●● ● ●●●●●● ●●●● ●●● ● ●●●●●●●●● ●● ●●● ●●● ● ●●●●●●● ● ● ● ●● ●● ● ●●● ● ● ●● ● ● ● ●● ●●●●●●● ● ●● ●● ●●●● ●●● ●● ● ●● ● ● ● ●● ●●●● ●●●●● ●● ● ● ●● ● ● ●●●● ● ●●● ●●● ● ●●● ● ● ● ●● ● ●● ●● ● ●●●● ●● ● ●●●● ●●● ●●● ●● ●●● ●●●●●●●●●●●● ●● ●●● ●●● ●●●●●●● ●● ● ● ●● ●● ●●● ● ● ●● ● ● ● ●● ● ● ●● ●●●●● ● ●● ●● ●●●●● ●● ●● ●● ● ●● ● ● ●● ● ●●● ●● ●● ●●●●● ●● ● ● ●● ● ●●●●●●● ●●●●●●● ●●●● ●● ● ●● ● ●● ●● ●●● ● ● ●●● ●●● ● ●●● ●● ●●● ●● ● ●● ●●●●●●●● ●●●● ●●●● ●●●●●●● ●● ● ● ●● ●● ●●● ● ● ● ●● ● ● ● ●● ●●● ● ●● ● ●● ● ●● ●●● ●●● ● ●●● ●● ●● ● ●● ● ● ●● ● ●●●● ●●● ●● ●● ● ●● ●● ●● ● ● ●●● ● ●●● ● ●●● ● ●● ●●●● ●●●● ●●●●● ●● ● ● ●● ● ●●● ● ● ● ●●● ● ● ●● ● ●●● ● ●●● ●● ● ●● ● ●● ●●●●●● ●●● ●● ●●●●● ●●●● ● ● ● ● ● ●●● ●●● ● ● ● ●●● ●●●● ● ●● ● ●● ● ●● ●●● ●● ●● ● ●●● ●●● ●●● ●● ● ●● ● ● ●●●● ●●● ●● ●● ● ●● ●● ● ● ●● ● ● ●●● ● ● ●●●● ● ●●●●●●● ●● ● ●●●●●● ●●●●●● ● ● ●● ● ●●●● ● ● ●● ●● ● ● ●● ● ●●●●● ●● ●● ●●●●●● ●●● ●● ●● ●● ●●● ● ● ● ● ● ●●● ●●● ● ● ●●●● ●●●● ● ●●●● ●● ● ●●● ●●●● ●● ●● ●●● ●●●● ●●● ●●●● ● ●●●● ● ●●● ●● ●● ● ●● ●● ● ● ●● ● ● ●●● ● ● ● ●●●● ●●●●●● ●●● ● ●● ● ●●●●●● ● ●●●● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ●● ●● ● ●●●●● ●●●●●●● ●●● ● ● ● ●●●●●●●● ●● ●●●●●●● ● ● ●● ● ●●● ●● ● ●● ●●●●●●●●● ● ●●●● ● ●●● ●●● ●●● ● ●●● ●● ● ●● ●● ● ● ●● ●● ● ● ● ●●● ● ●● ●● ●● ●●● ●●●● ● ● ● ●●● ● ●● ● ● ● ● ● ●●● ● ●●● ● ●● ● ● ● ● ●●●● ●●● ● ● ● ● ●●●●● ● ● ●●● ● ● ●●● ●● ● ●●●●● ●●●● ●●●●●● ● ● ●●●● ● ● ●●● ●●● ●●● ● ● ● ●● ● ●● ●●● ● ●●● ●●● ●●● ● ●●● ● ●● ● ●●● ●● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●●● ● ●● ● ●●●●●● ● ● ●●● ●● ● ● ● ●● ●● ●●●● ●●●
4000
500
17
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ●● ●● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●● ●● ● ● ●● ● ●● ●● ●● ● ●● ● ●● ● ●● ●● ● ● ●● ● ●● ● ●●● ●● ● ● ●● ● ●● ● ●●●●● ●● ● ● ●● ● ●● ● ● ●●●●● ● ●● ● ● ●● ● ●● ● ● ● ● ●●●● ● ●● ● ● ●● ● ●●●● ● ●● ● ● ● ● ●● ●●● ● ●●● ● ● ●● ● ●●●● ● ●●●● ● ●●● ●●● ● ●● ●●● ● ● ●● ●● ● ●●●●●●●● ● ●●●●● ● ●● ●●● ● ● ●● ●● ●●●●●●●●●●●● ●● ●● ●●●● ●●●●● ●●● ●●● ●● ●
1000
1500
Fig. 14. Plots of the estimated change-points locations (xaxis) for different thresholds (y-axis) from 0.5% to 50% by 0.5%. The estimated change-point locations associated to threshold which are multiples of 5% are displayed in red.
5. P ROOFS 5·1. Proofs of statistical results 2 Proof of Lemma 1. A necessary and sufficient condition for a vector Bb in Rn to minimize the P 2 P 2 function Φ defined by: Φ(B) = ni=1 (Yi − (X B)i )2 + λn ni=1 |Bi |, is that the zero vector in 2 Rn belongs to the subdifferential of Φ at Bb that is: λn b X > (Y − X B) = , 2 j > b ≤ λn , X (Y − X B) 2 j
if Bbj 6= 0, if Bbj = 0.
Using that X > Y = (T ⊗ T)> Y = (T> ⊗ T> )Y = Vec(T> YT), where (T> YT)i,j = Pn Pn b b `=j Yk,` , and that U = X B, Lemma 1 is proved. k=i Proof of Lemma 2. Note that sX n −1 En,j ≥ xn ≤ P max (sn − rn )−1 1≤rn nδn + P t2,k − t?2,k > nδn , (14) 1≤k≤K1
310
1≤k≤K2
it is enough to prove that both terms in (14) tend to zero for proving (6). We shall only prove that the second term in the rhs of (14) tends to zero, the proof being the same for the first term. Since PK2? P(max1≤k≤K2? |b P(|b t2,k − t?2,k | > nδn ), it is enough to prove that t2,k − t?2,k | > nδn ) ≤ k=1 for all k in {1, . . . , K2? }, P(An,k ) → 0, where An,k = {|b t2,k − t?2,k | > nδn }. Let Cn be defined by ? ? Cn = max ? |b (15) t2,k − t2,k | < Imin,2 /2 . 1≤k≤K2
315
It is enough to prove that, for all k in {1, . . . , K2? }, P(An,k ∩ Cn ) and P(An,k ∩ Cn ) tend to 0, as n tends to infinity. Let us first prove that for all k in {1, . . . , K2? }, P(An,k ∩ Cn ) → 0. Observe that (15) implies t2,k ≤ t?2,k . t2,k < t?2,k+1 , for all k in {1, . . . , K2? }. For a given k, let us assume that b that t?2,k−1 < b Applying (7) and (8) with rj + 1 = n, qj + 1 = b t2,k on the one hand and rj + 1 = n, qj + 1 = t?2,k on the other hand, we get that t?2,k −1 t?2,k −1 X X Yn,j − Ubn,j ≤ λn . j=bt2,k j=b t2,k P P Hence using (1), the notation: E([a, b]; [c, d]) = bi=a dj=c Ei,j and the definition of Ub given by Lemma 1, we obtain that ? t2,k )(µ?K ? +1,k − µ bK1? +1,k+1 ) + E(n; [b t2,k , t?2,k − 1]) ≤ λn , (t2,k − b 1
which can be rewritten as follows ? t2,k )(µ?K ? +1,k − µ?K ? +1,k+1 ) + (t?2,k − b t2,k )(µ?K ? +1,k+1 − µ bK1? +1,k+1 ) (t2,k − b 1
320
1
1
+E(n; [b t2,k , t?2,k − 1]) ≤ λn .
19
A Fast Approach for Multiple Change-point Detection in Two-dimensional Data Thus, P(An,k ∩ Cn ) ≤ P(λn /(nδn ) ≥ |µ?K ? +1,k − µ?K ? +1,k+1 |/3) 1
1
+ P({|µ?K ? +1,k − µ bK1? +1,k+1 | ≥ |µ?K ? +1,k − µ?K ? +1,k+1 |/3} ∩ Cn ) 1
1
1
+ P({|E(n; [b t2,k , t?2,k − 1])|/|t?2,k − b t2,k | ≥ |µ?K ? +1,k − µ?K ? +1,k+1 |/3} ∩ An,k ). (16) 1
325
1
? /3, v = nδ and The first term in the rhs of (16) tends to 0 by (A2). By Lemma 2 with xn = Jmin n n (A3) the third term in the rhs of (16) tends to 0. Applying Lemma 1 with rj + 1 = n, qj + 1 = b t2,k on the one hand and rj + 1 = n, qj + 1 = (t?2,k + t?2,k+1 )/2 on the other hand, we get that
? ? )/2−1 (t?2,k +t?2,k+1 )/2−1 (t2,k +t2,k+1 X X b Yn,j − Un,j ≤ λn . j=t? j=t? 2,k
2,k
Since b t2,k ≤ t?2,k , Ubn,j = µ bK1? +1,k+1 within the interval [t?2,k , (t?2,k + t?2,k+1 )/2 − 1] and we get that bK1? +1,k+1 |/2 ≤ λn + |E(n, |[t?2,k , (t?2,k + t?2,k+1 )/2 − 1])|. (t?2,k+1 − t?2,k )|µ?K ? +1,k+1 − µ 1
Therefore the second term in the rhs of (16) can be bounded by P λn ≥ (t?2,k+1 − t?2,k )|µ?K ? +1,k − µ?K ? +1,k+1 |/12 1 1 ? ? −1 ? ? + P (t2,k+1 − t2,k ) E(n, , |[t2,k , (t2,k + t?2,k+1 )/2 − 1]) ≥ |µ?K ? +1,k − µ?K ? +1,k+1 |/6 1
1
By Lemma 2 and (A2), (A3) and (A4), we get that both terms tend to zero as n tends to infinity. We thus get that P(An,k ∩ Cn ) → 0, as n tends to infinity. Let us now prove that P(An,k ∩ Cn ) tend to 0, as n tends to infinity. Observe that
330
P(An,k ∩ Cn ) = P(An,k ∩ Dn(`) ) + P(An,k ∩ Dn(m) ) + P(An,k ∩ Dn(r) ), where Dn(`) = ∃p ∈ {1, . . . , K ? }, Dn(m) = ∀k ∈ {1, . . . , K ? }, Dn(r) = ∃p ∈ {1, . . . , K ? },
b t2,p ≤ t?2,p−1 ∩ Cn , t?2,k−1 < b t2,k < t?2,k+1 ∩ Cn , b t2,p ≥ t?2,p+1 ∩ Cn .
335
Using the same arguments as those used for proving that P(An,k ∩ Cn ) → 0, we can prove that (m) (`) P(An,k ∩ Dn ) → 0, as n tends to infinity. Let us now prove that P(An,k ∩ Dn ) → 0. Note that K2? −1
P(Dn(`) )
≤
X
? ? P({t?2,k − b t2,k > Imin /2} ∩ {b t2,k+1 − t?2,k > Imin /2})
k=1 ? +P(t?2,K ? − b t2,K2? > Imin /2). 2
(17)
V. B RAULT, J. C HIQUET AND C. L E´ VY-L EDUC
20
Applying (7) and (8) with rj + 1 = n, qj + 1 = b t2,k on the one hand and rj + 1 = n, qj + 1 = t?2,k on the other hand, we get that t?2,k −1 t?2,k −1 X X Yn,j − Ubn,j ≤ λn . j=bt2,k j=b t2,k 340
Thus, ? ? P({t?2,k − b t2,k > Imin /2} ∩ {b t2,k+1 − t?2,k > Imin /2}) ? ? ≤ P(λn /(nδn ) ≥ |µK ? +1,k − µK ? +1,k+1 |/3) 1
1
? +P({|µ?K ? +1,k − µ bK1? +1,k+1 | ≥ |µ?K ? +1,k − µ?K ? +1,k+1 |/3} ∩ {b t2,k+1 − t?2,k > Imin /2}) 1
1
1
+P({|E(n; [b t2,k , t?2,k − 1])|/(t?2,k − b t2,k ) ≥ |µ?K ? +1,k − µ?K ? +1,k+1 |/3} 1
∩{t?2,k
1
? −b t2,k > Imin /2}).
(18)
Using the same arguments as previously we get that the first and the third term in the rhs of (18) tend to zero as n tends to infinity. Let us now focus on the second term of the rhs of (18). Applying (7) and (8) with rj + 1 = n, qj + 1 = b t2,k+1 on the one hand and rj + 1 = n, qj + 1 = t?2,k on the other hand, we get that b bt2,k+1 −1 t2,k+1 −1 X X Ubn,j ≤ λn . Yn,j − j=t? j=t? 2,k
2,k
Hence, t2,k+1 − 1])| ≤ λn . bK1? +1,k+1 )(b t2,k+1 − t?2,k ) + E(n, [t?2,k ; b |(µ?K ? +1,k − µ 1
The second term of the rhs of (18) is thus bounded by ? P({λn (b t2,k+1 − t?2,k )−1 ≥ |µ?K ? +1,k − µ?K ? +1,k+1 |/6} ∩ {b t2,k+1 − t?2,k > Imin /2}) 1
1
t2,k+1 − 1])| ≥ |µ?K ? +1,k − µ?K ? +1,k+1 |/6} +P({(b t2,k+1 − t?2,k )−1 |E(n, [t?2,k ; b 1
1
? /2}), ∩{b t2,k+1 − t?2,k > Imin
which tend to zero by Lemma 2, (A2), (A3) and (A4). It is thus proved that the first term in the rhs of (17) tends to zero as n tends to infinity. The same arguments can be used for addressing ? /2. the second term in the rhs of (17) since b t2,K2? +1 = n and hence b t2,K2? +1 − t?2,K ? > Imin 2
(r)
345
350
Using similar arguments, we can prove that P(An,k ∩ Dn ) → 0, which concludes the proof of Proposition 1. 5·2. Proofs of computational lemmas Proof of Lemma 3.. Consider X v for instance (the same reasoning applies for X > v): we have X v = (T ⊗ T)v = Vec(TVT> ) where V is the n × n matrix such that Vec(V) = v. Because of its triangular structure, T operates as a cumulative sum operator on the columns of V. Hence, the computations for the jth column is done by induction in n operations. The total cost for the n columns of TV is thus n2 . Similarly, right multiplying a matrix by T> boils down to perform cumulative sums over the rows. The final cost for X v = Vec(TVT> ) is thus 2n2 in case of a dense matrix V, and possibly less when V is sparse.
A Fast Approach for Multiple Change-point Detection in Two-dimensional Data Proof of Lemma 4.. Let A = {a1 , . . . , aK }, then = (T ⊗ T)> X >X •,A (T ⊗ T)•,A , A,A
21 355
(19)
where (T ⊗ T)•,A (resp. (T ⊗ T)> •,A ) denotes the columns (resp. the rows) of T ⊗ T lying in A. For j in A, let us consider the Euclidean division of j − 1 by n given by: (j − 1) = nqj + rj , then (T ⊗ T)•,j = T•,qj +1 ⊗ T•,rj +1 . Hence, (T ⊗ T)•,A is a n2 × K matrix defined by: i h (T ⊗ T)•,A = T•,qa1 +1 ⊗ T•,ra1 +1 ; T•,qa2 +1 ⊗ T•,ra2 +1 ; . . . ; T•,qaK +1 ⊗ T•,raK +1 . Thus, (T ⊗ T)•,A = T•,QA ∗ T•,RA , where QA = {qa1 + 1, . . . , qaK + 1}, RA = {ra1 + 1, . . . , raK + 1}
and ∗ denotes the Khatri-Rao product, which is defined as follows for two n × n matrices A and B A ∗ B = [a1 ⊗ b1 ; a2 ⊗ b2 ; . . . an ⊗ bn ] , where the ai (resp. bi ) are the columns of A (resp. B). Using Liu & Trenkler (2008), we get that > > (T ⊗ T)> •,A (T ⊗ T)•,A = T•,QA T•,QA ◦ T•,RA T•,RA , where ◦ denotes the Hadamard or entry-wise product. Observe that by definition of T, (T> T ) = n − (qak ∨ qa` ) and (T> •,RA T•,RA )k,` = n − (rak ∨ ra` ). By (19), •,QA •,QA k,` > X X A,A is a Gram matrix which is positive and definite since the vectors T•,qa1 +1 ⊗ T•,ra1 +1 , T•,qa2 +1 ⊗ T•,ra2 +1 , . . . , T•,qaK +1 ⊗ T•,raK +1 are linearly independent.
360
Proof of Lemma 5.. The operations of adding/removing a column to a Cholesky factorization are classical and well treated in books of numerical analysis, see e.g. Golub & Van Loan (2012). An advantage of our settings is that there is no additional computational cost for computing X > Xj when entering a new variable j thanks to the closed-form expression (9).
6. C ONCLUSION In this paper, we proposed a novel approach for retrieving the boundaries of a block wise constant matrix corrupted with noise by rephrasing this problem as a variable selection issue. Our approach is implemented in the R package blockseg which will be available from the Comprehensive R Archive Network (CRAN). In the course of this study, we have shown that our method has two main features which make it very attractive. Firstly, it is very efficient both from the theoretical and practical point of view. Secondly, its very low computational burden makes its use possible on very large data sets coming from molecular biology.
ACKNOWLEDGEMENTS The authors would like to thank the French National Research Agency ANR, which partly supported this research through the ABS4NGS project (ANR-11-BINF-0001-06).
365
370
375
22
V. B RAULT, J. C HIQUET AND C. L E´ VY-L EDUC R EFERENCES
380
385
390
395
400
405
B REIMAN , L., F RIEDMAN , J. H., O LSHEN , R. A. & S TONE , C. J. (1984). Classification and Regression Trees. Statistics/Probability Series. Belmont, California, U.S.A.: Wadsworth Publishing Company. B RODSKY, B. & DARKHOVSKY, B. (2000). Non-parametric statistical diagnosis: problems and methods. Kluwer Academic Publishers. D IXON , J. R., S ELVARAJ , S., Y UE , F., K IM , A., L I , Y., S HEN , Y., H U , M., L IU , J. S. & R EN , B. (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380. E FRON , B., H ASTIE , T., J OHNSTONE , I., T IBSHIRANI , R. et al. (2004). Least angle regression. The Annals of statistics 32, 407–499. G OLUB , G. H. & VAN L OAN , C. F. (2012). Matrix computations. JHU Press. 3rd edition. H ARCHAOUI , Z. & L E´ VY-L EDUC , C. (2010). Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association 105, 1480–1493. H OEFLING , H. (2010). A path algorithm for the fused lasso signal approximator. J. Comput. Graph. Statist. 19, 984–1006. K AY, S. (1993). Fundamentals of statistical signal processing: detection theory. Prentice-Hall, Inc. L E´ VY-L EDUC , C., D ELATTRE , M., M ARY-H UARD , T. & ROBIN , S. (2014). Two-dimensional segmentation for analyzing hi-c data. Bioinformatics 30, i386–i392. L IU , S. & T RENKLER , G. (2008). Hadamard, khatri-rao, kronecker and other matrix products. Int. J. Inform. Syst. Sci. 4, 160–177. ¨ M EINSHAUSEN , N. & B UHLMANN , P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, 417–473. O SBORNE , M. R., P RESNELL , B. & T URLACH , B. A. (2000). A new approach to variable selection in least squares problems. IMA journal of numerical analysis 20, 389–403. R C ORE T EAM (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. S ANDERSON , C. (2010). Armadillo: An open source C++ linear algebra library for fast prototyping and computationally intensive experiments. Tech. rep., NICTA. TARTAKOVSKY, A., N IKIFOROV, I. & BASSEVILLE , M. (2014). Sequential Analysis : Hypothesis Testing and Changepoint Detection. CRC Press, Taylor & Francis Group. T IBSHIRANI , R. J. & TAYLOR , J. (2011). The solution path of the generalized lasso. Ann. Statist. 39, 1335–1371. V ERT, J.-P. & B LEAKLEY, K. (2010). Fast detection of multiple change-points shared by many signals using group lars. In Advances in Neural Information Processing Systems.