Under consideration for publication in Knowledge and Information Systems
Algorithms for unimodal segmentation with applications to unimodality detection Niina Haiminen, Aristides Gionis, Kari Laasonen Helsinki Institute for Information Technology, BRU Department of Computer Science University of Helsinki, Finland
[email protected]
Abstract. We study the problem of segmenting a sequence into k pieces so that the resulting segmentation satisfies monotonicity or unimodality constraints. Unimodal functions can be used to model phenomena in which a measured variable first increases to a certain level and then decreases. We combine a well-known unimodal regression algorithm with a simple dynamicprogramming approach to obtain an optimal quadratic-time algorithm for the problem of unimodal k-segmentation. In addition, we describe a more efficient greedy-merging heuristic that is experimentally shown to give solutions very close to the optimal. As a concrete application of our algorithms, we describe methods for testing if a sequence behaves unimodally or not. The methods include segmentation error comparisons, permutation testing and a BIC-based scoring scheme. Our experimental evaluation shows that our algorithms and the proposed unimodality tests give very intuitive results, for both real-valued and binary data. Keywords: Unimodal; Segmentation; Regression; Algorithms; BIC; Binary data
1. Introduction The problem of regression, which deals with fitting curves or functions to a set of points, is among the most well-studied problems in statistics and data mining. Regression functions represent data models that can be used for knowledge extraction, understanding of the data, and prediction (Hand et al. 2001). In many cases, the data are assumed to come from an a priori known class of distributions and the task is to find the parameters of the distribution that best fits the data. In this paper we focus on the case where the regression functions are required to be monotonic or unimodal. The problem of Received November 1, 2004 Revised January 16, 2005 Accepted February 19, 2005
2
N. Haiminen, A. Gionis, K. Laasonen
computing monotonic and unimodal regression functions has drawn attention in statistics (Lee 1983, Robertson et al. 1988, Stout 2000) and computer science (Pardalos et al. 1999, Pardalos et al. 1995), because it arises in a wide range of applications, such as statistical modeling (Robertson et al. 1988), operations research (Kaufman et al. 1993), medicine and drug design (Hardwick et al. 2000), and image processing (Restrepo et al. 1993). Unimodal functions can be used to model phenomena in which a measured variable shows a single-mode behavior: its expected value first rises to a certain level and then drops. Examples of data that exhibit unimodal behavior include (i) the size of a population of a species over time, (ii) daily volumes of network traffic, (iii) stock-market quotes in quarterly or annual periods, when small fluctuations are ignored, etc. Monotonicity is a special case of unimodality. A monotonic regression function for a real-valued sequence can be computed in linear time by variants of the classic “pool adjacent violators” (PAV) algorithm (Ayer et al. 1955). Recently, Stout (2000) has shown how to cleverly organize the PAV operations to achieve a linear-time algorithm for the seemingly more complex problem of unimodal regression. However, one of the drawbacks of the above methods is that they do not restrict the size of the resulting regression and they can potentially report models with very high complexity. In this paper we address the issue of compact representation of unimodal regression functions by studying the following problem: given a univariate sequence and an integer k, partition the sequence into k segments and represent each segment by a constant value, so that the segment values satisfy unimodality (or monotonicity) constraints and the total representation error is minimized. We call this problem unimodal (or monotonic) k-segmentation. The problem is polynomial, but naive dynamic programming algorithms have running times of the order of high-degree polynomials. In this paper we show that the unimodal k-segmentation problem can be solved in O(n2 k) time, which is the same as the time required to solve the unrestricted ksegmentation problem, i.e., segmenting without the unimodality constraints. Our algorithm combines the PAV algorithm with dynamic-programming techniques. The algorithm can be extended to handle higher-mode segmentations, e.g., bimodal, and the optimality proof holds with minor changes to it. In addition to the optimal algorithm, we describe a fast greedy heuristic, which runs in time O(n log n) and in practice gives solutions very close to optimal. Additionally, we discuss how to apply our algorithms in order to devise unimodality tests for sequences. We explore three alternatives: comparing segmentation errors, permutation testing, and a scoring scheme based on the Bayesian information criterion (BIC). We found that all tests are able to distinguish between unimodal and nonunimodal sequences, both for real-valued and binary data. In our experimental evaluation we also demonstrate that the proposed algorithms provide intuitive segmentations. This paper is organized as follows. In Section 2 we introduce the notation and define the problem of unimodal k-segmentation. In Section 3 we describe our algorithms, while the optimality proofs are given in Section 4. Section 5 introduces our methods for unimodality detection. In Section 6 we discuss our experiments, and Section 7 contains a short conclusion.
2. Preliminaries A real-valued sequence X consists of n values (also referred to as points), i.e., X = hx1 , . . . , xn i. The problems of regression and segmentation seek for representing the sequence X in a way that satisfies certain constraints. For the monotonic- and unimodal-
Algorithms for unimodal segmentation with applications to unimodality detection
3
regression problems, monotonicity constraints are imposed on the regression values of the sequence. For the segmentation problem, the representation of the sequence is constrained to a small number of piecewise constant segments. Monotonic regression: The goal of increasing monotonic regression is to map each point xi of the sequence X to a point x ˆi , so that xˆ1 ≤ x ˆ2 ≤ . . . ≤ x ˆn , and the regression error ER =
n X
(xi − xˆi )2
(1)
i=1
is minimized. Decreasing monotonic regression is defined in a similar fashion. Equation (1) defines the regression error with respect to the L2 norm. In general, any Lp norm, 1 ≤ p ≤ ∞, can be used; the norms L1 (sum of absolute values) and L∞ (max) are also commonly used in statistics. Unimodal regression: Similarly to monotonic regression, a unimodal regression maps each value xi of the sequence X to a value x ˆi . However, in this case it is required that the regression values increase up to some point xˆt and then decrease for the rest of the points. In other words, the goal is to minimize the regression error E R defined by Equation (1), subject to the unimodality constraints xˆ1 ≤ . . . ≤ x ˆt−1 ≤ x ˆt ≥ x ˆt+1 ≥ . . . ≥ x ˆn . Note that only one point of the sequence (or multiple points with the same value) is mapped to the point xˆt , called the top of the unimodal regression. If this was not the case, the regression error could be reduced by making a point mapped to the top, with a value higher than xˆt , a new top of the regression. We can also define unimodal regression that is first decreasing and then increasing. In the rest of the paper, without loss of generality, we refer by ”unimodal” to regressions whose values are first increasing and then decreasing. Since monotonicity is a special case of unimodality, all our results hold also for monotonic segmentations. Let us denote by rj , j = 1, . . . , m, a set of consecutive points xi ∈ X that are represented by the same value x ˆ in the regression. That value x ˆ, common for all the points in rj , is denoted by rˆj . The subsequence rj can be represented by the indices of its first and last points, denoted here by fj and lj , respectively. Thus the regression U can be written as h(f1 , l1 , rˆ1 ), . . . , (fm , lm , rˆm )i, with fj ≤ lj , f1 = 1, lm = n, and fj = lj−1 + 1, for j = 2, . . . , m. By definition, the values rˆj are strictly increasing up to xˆt and then strictly decreasing. The pairs of indices (fj , lj ), for j = 1, . . . , m, will be referred to as regression segments. The value of m is not specified as input to the regression problem; any solution that satisfies the regression constraints is a feasible one, and the optimal solution is defined over all feasible solutions. The solutions improve as m increases, up to some sequence-dependent value, after which the error does not decrease anymore when increasing the number of regression segments. k-segmentation: The task of k-segmentation is to represent the sequence with k piecewise constant segments with as small an error as possible. A segmentation of the original sequence is a sequence of k segments sj , j = 1, . . . , k, with sj = (fj , lj , s¯j ). Thus each segment sj is specified by two boundary points fj , lj and a value s¯j . Naturally we have fj ≤ lj , f1 = 1, lk = n, and fj = lj−1 + 1, for j = 2, . . . , k. A point xi in the
4
N. Haiminen, A. Gionis, K. Laasonen
original sequence is represented by the value of the segment s j to which the point xi belongs, that is, x ¯i = s¯j where fj ≤ i ≤ lj . Given a value for k, the goal is to find the segmentation that minimizes the total error, defined as ES =
n X
(xi − x ¯ i )2 .
(2)
i=1
Unimodal segmentation: We now define the problem of unimodal segmentation, which is the focus of this paper. Unimodal segmentation combines the two previous problems in a natural way: we seek to represent a sequence by a small number of segments and subject to the unimodality constraints. Problem 1. (U NIMODAL SEGMENTATION ) Given a sequence X = hx1 , . . . , xn i, and an integer k ≤ n, find the k-segmentation S of X that minimizes the error of the segmentation ES as defined by Equation (2), and satisfies the unimodality constraints s¯1 ≤ . . . ≤ s¯p ≥ . . . ≥ s¯k . The unimodal regression and segmentation problems can also be considered in higher dimensions with appropriately defined error metrics, but the algorithms and results presented in this paper apply only to univariate sequences. We are not aware of any algorithms for monotonic or unimodal regression for higher-dimensional data.
3. Algorithms In this section we describe our main algorithm for the problem of unimodal segmentation. The algorithm combines in a natural way two previously known algorithms: (i) the PAV algorithm for unimodal regression, and (ii) the dynamic-programming algorithm for k-segmentation. In fact, the algorithm is quite simple: it first applies PAV on the original sequence, and then uses dynamic programming on the resulting unimodal sequence to obtain a k-segmentation. Our algorithm runs in time O(n 2 k) for a sequence of n points. In the next section we prove that this simple algorithm produces the optimal unimodal k-segmentation. To provide relevant background, in Sections 3.1 and 3.2 we describe in slightly more detail the PAV and the dynamic-programming algorithms. In addition to the optimal algorithm, we describe in Section 3.4 a more efficient greedy-merging heuristic that is experimentally shown to give solutions very close to optimal. In Section 3.5 we also briefly discuss a naive algorithm that can easily be shown to produce the optimal result, but with a prohibitively expensive running time.
3.1. Regression algorithms Computing a monotonic regression of a sequence can be done in linear time by the classic algorithm of “pool-adjacent violators” (PAV) by Ayer et al. (Ayer et al. 1955). The PAV algorithm is surprisingly simple: it starts by considering each point of the sequence as a separate regression value, and as long as two adjacent values violate the monotonicity constraint they are merged and replaced by their weighted average. The process continues until no violators remain. It can be shown that the process computes the optimal regression regardless of the order in which the pairs are merged (Robertson et al. 1988). Based on the PAV algorithm, the unimodal regression can be easily computed in
Algorithms for unimodal segmentation with applications to unimodality detection
5
O(n2 ) time (Frisen 1986, Geng et al. 1990): just try all the points in the sequence as candidate top points, find the optimal monotonic regression left and right of the top point, and select the best solution from among all the candidates. However, Stout (Stout 2000) was able to devise a linear-time algorithm for the problem of unimodal regression. He realized that in the above quadratic-time algorithm some information is recomputed due to independent calls to monotonic regression, and he showed how to cleverly organize these calls to achieve linear-time computation of the regression.
3.2. Segmentation algorithm From the classic result by Bellman (Bellman 1961), the problem of k-segmentation can be solved by dynamic programming in time O(n2 k). The solution is based on computing in incremental fashion an (n × k)-size table ES , where the entry ES [i, p] denotes the error of segmenting the sequence hx1 , . . . , xi i using p segments. The computation is based on the equation ES [i, p] = min (ES [j − 1, p − 1]) + E[j, i]), 1≤j≤i
(3)
where E[j, i] is the error of representing the subsequence hxj , . . . , xi i with one segment.
3.3. The O PT algorithm In this section we discuss our main algorithm, which we call O PT. Given a sequence X, O PT first finds the unimodal regression U of X, resulting in m regression segments r j . The values of the regression segments are then segmented to reduce the number of segments from m to k. Pseudocode is given in Figure 1. If the number of regression segments is smaller than k, i.e., m ≤ k, the m unimodal segments in U are taken to be the output of the algorithm. One can see that the m segments of U form the best k-segmentation. The reason is that, since using more segments can only help, the error of the optimal k-segmentation can not be greater than the error of the m-segmentation U . On the other hand, U is the optimal over all possible segmentations, and in particular its error can not be greater than of the optimal k-segmentation, therefore the two have to be equal. If it is required that the final segmentation has exactly k segments, then we can insert “artificial” segment boundaries and increase the number of segments from m to k without changing the error of the solution. More interesting is the situation in which the number of required segments k is smaller than the number of the regression segments, i.e., m > k. In this case, O PT views the m segments of U as weighted points, and it applies the dynamic programming segmentation algorithm on those points. In the next section we will show that the resulting segmentation is unimodal, and in fact it is the optimal unimodal k-segmentation. In other words, the optimal k-segmentation will never need to split any regression segment rj into two segments rjp and rjs ; it will only need to combine regression segments into larger segments. As far as the running time of the algorithm is concerned, the unimodal regression algorithm runs in O(n) time, and the dynamic programming algorithm for producing k segments on a sequence of m points runs in time O(m2 k). Since m = O(n), the overall complexity of O PT is O(n2 k). In all of our experiments m was much smaller than n, so in practice the actual running time of O(n + m2 k) might be significantly less than
6
N. Haiminen, A. Gionis, K. Laasonen
O PT Input: X = hx1 , . . . , xn i Output: Unimodal segmentation S = s1 , s2 , . . . , sk U ← U NIMODAL - REGRESSION(X) % U has m regression sets if m ≤ k then S←U else S ← k- SEGMENTATION(U ) end if Fig. 1. O PT algorithm.
the worst case of O(n2 k). Note that one can also apply the approximate segmentation technique of Guha et al. (Guha et al. 2001) and obtain a (1 + )-approximation to the optimal unimodal k-segmentation in time O( 1 k 2 n log n).
3.4. The G REEDY algorithm For large sequences the quadratic running time of O PT can be a bottleneck. In this section we describe a more efficient algorithm, called G REEDY. The G REEDY algorithm consists of the same two steps as O PT: unimodal regression and segmentation. The difference is that in the segmentation step, instead of applying the expensive dynamic-programming algorithm, a greedy merging process is performed: starting with m regression segments, we iteratively merge the two consecutive segments that yield the least error, until reaching k segments. Since the error of a segment s = (f, l, s¯) can be computed by the formula E(s) = Pt Pl Pl ¯)2 = l−f1+1 i=f x2i − ( l−f1+1 i=f xi )2 , the error of each potential i=s (xi − s merging can be computed in constant time by keeping two precomputed arrays of the sequence: the sum of the values and the sum of squares of the values of all prefixes in the sequence. We can store the error values of merging each two consecutive segments in a priority queue. Initially, each point is a segment on its own, and the queue contains the errors associated with merging any two consecutive points. As segments are merged, the entries of the segments adjacent to the newly merged segments are also updated. This structure yields the overall running time O(n log n) for G REEDY. As we have already mentioned, the G REEDY algorithm does not produce optimal solutions. Consider a simple example of m = 4 monotone regression points with values h1, 2+, 3−, 4i, for vanishingly small , and segmentations with k = 2. The algorithm G REEDY will generate segmentation h1, 2 + , 3 − ih4i or h1ih2 + , 3 − , 4i with error ES = (1 − 2)2 + (2 + − 2)2 + (3 − − 2)2 + (4 − 4)2 ≈ 2, while the optimal solution is the segmentation h1, 2 + ih3 − , 4i with error ≈ 1. In fact, we have not been able to find a counterexample with an error ratio worse than 2, even when looking at much more complicated situations. Furthermore, in all of our experiments the ratio of the errors was smaller than 1.2. An interesting open problem is to examine if there exists some guarantee for the quality of the results produced by G REEDY.
Algorithms for unimodal segmentation with applications to unimodality detection
7
3.5. A dynamic-programming alternative For comparison with our main algorithm we briefly sketch an alternative “brute force” algorithm. We first explain how this algorithm works for the monotonic segmentation. The idea is to extend the dynamic programming solution of Equation (3), by considering that each segment takes a value from a finite set L = {l1 , . . . , lN }, where li ≤ lj for i < j. If we denote by ES [i, p, h] the segmentation error for the sequence hx1 , . . . , xi i, with exactly p segments, and where no segment exceeds the h-th value of the set L, we have the equation ES [i, p, h] = min (ES [j − 1, p − 1, h − 1]) + E[j, i, lh ]), 1≤j≤i
where E[j, i, lh ] is the error of the sequence hxj , . . . , xi i using at maximum the value lh as its level. The crucial observation is that the admissible values for the set L are only the averages of subsequences of the original sequence, and thus |L| = n(n−1)/2 = O(n 2 ). Computing the above equation by dynamic programming gives the optimal monotonic k-segmentation, but it requires time O(n4 k), which is prohibitively expensive in practice. The unimodal k-segmentation can be computed by performing two monotonic ksegmentations, one increasing from left to right and one decreasing from right to left. In fact, the computation can be done with no additional overhead by utilizing information stored in the dynamic programming tables (we omit the details).
4. Analysis of the algorithms In this section we prove some properties of our algorithms, namely that they always produce unimodal segmentations, and that the O PT algorithm is indeed optimal. Our first lemma is quite intuitive and it shows that the algorithms O PT and G REEDY both produce unimodal segmentations. Lemma 1. Let X be a sequence and U a unimodal regression of X. Any possible way of combining consecutive segments of U into larger segments yields a unimodal segmentation for X. Proof: We prove the Lemma by induction, by showing that combining any two segments of a unimodal k-segmentation gives a unimodal (k − 1)-segmentation. Consider merging segments sj and sj+1 . By our assumption, the subsequences s1 . . . sj−1 and sj+2 . . . sk are unimodal. There are three possible cases of merging, with respect to the top of the regression rt : 1. Neither sj−1 , sj , sj+1 nor sj+2 contain rt . All the values in sj and sj+1 are then in between the levels of their neighboring segments, and merging does not violate unimodality. 2. The top rt is contained in either sj+2 or sj−1 . In the first case, the average of the points in sj sj+1 is ≥ sˆj−1 , in the second it is ≥ sˆj+2 . Thus sˆj+2 , or sˆj−1 , can be arbitrary without violating unimodality. 3. The top is contained in sj or sj+1 . All the points in sj sj+1 are either ≥ sˆj−1 or ≥ sˆj+2 (or both). Thus the level of the new segment can not be both < sˆj−1 and < sˆj+2 , so again merging does not violate unimodality. Thus, merging any two consecutive segments gives a unimodal (k − 1)-segmentation.
8
N. Haiminen, A. Gionis, K. Laasonen
b0h
bh sˆh+1
sˆh rjp
rjs rj
Fig. 2. Second case in Theorem 1.
Next we show that the O PT algorithm indeed gives the optimal solution. Lemma 2. Let X be a sequence and R be an optimal increasing monotonic regression of X. For any regression segment rj of R, and any split of rj into a prefix segment rjp and a suffix segment rjs , we have r¯jp ≥ r¯js . Proof: The Lemma follows from the definition of increasing monotonic regression. Let us assume that the points in some prefix rjp of rj ∈ R has an average apj that is smaller than the average asj of the points in the suffix rjs . Now the error associated with the regression can be decreased by assigning the points in rjs to any level between rˆj and asj . If the new level satisfies asj < rˆj+1 , then we can simply separate rjs into a new regression segment and assign it level asj . If asj ≥ rˆj+1 , we can separate rjs into a new segment with level rˆj+1 . Both of these operations clearly preserve the monotonicity of R by adding one regression segment rjs in between rj and rj+1 , whose level rˆjs is in the interval (ˆ rj , min{asj , rˆj+1 }]. The resulting regression R0 has error ER0 < ER , which contradicts the optimality of R. Theorem 1. Given a sequence X and an integer k, the algorithm described in Section 3.3 yields the optimal unimodal k-segmentation. Proof: Let R be the unimodal regression computed by PAV and S be the optimal unimodal k-segmentation. We prove the optimality of the algorithm by showing that the segment boundaries in S never split any regression segment r j ∈ R. First consider the increasing monotonic regression. Let us assume that there exists a segment boundary bh , such that for some regression segment rj its prefix rjp and suffix rjs belong to two different segments sh and sh+1 . Let the levels of the segments be sˆh and sˆh+1 . There are now two possible cases. Either r¯js is closer to sˆh or it is closer to sˆh+1 (when sˆh and sˆh+1 are equally close, we can choose either one). In the first case, the error of the segmentation is reduced when the points in r js are assigned to segment sh instead of segment sh+1 (level values sˆh and sˆh+1 can remain unchanged). In the second case, the error is reduced when the points in r jp are assigned to segment sh+1 . This holds, because if the average of the suffix r¯js is closer to sˆh+1 than to sˆh , then also r¯jp is closer to that level, since r¯jp ≥ r¯js by Lemma 2, and sˆh+1 ≥ sˆh . This case is demonstrated in Figure 2, with b0h marking the new segment boundary yielding a smaller error than the boundary bh . The argumentation goes similarly for the decreasing case. The top regression segment rt does not cause any difficulties, as it will always consist of a single point, and can therefore not be divided between segments. If rt contained more than one point,
Algorithms for unimodal segmentation with applications to unimodality detection
9
the error of the regression could be reduced by making the highest point into a new regression segment (this would not violate the unimodality constraint). In each case, the error of the segmentation S can be reduced by moving some segment boundary that occurs within a regression segment. This means that S was not an optimal segmentation. Thus the segment boundaries can never split any regression segment.
5. Unimodality detection In the previous sections we showed how to construct the best unimodal model of a given sequence. Next we approach the task of deciding how good this model is, i.e., deciding whether the best unimodal segmentation of a sequence is a significantly better way to describe the sequence than the best unrestricted segmentation. Unrestricted here means again the segmentation with no unimodality constraints. Deciding if a sequence is better described by a unimodal or an unrestricted model is, in effect, equivalent with deciding if the sequence is unimodal or not. In this section we present three alternative approaches to unimodality detection: comparing segmentation errors, a method based on permutation tests, and a scoring scheme based on the Bayesian information criterion (BIC). All methods are based on obtaining the best unimodal and unrestricted segmentations, and comparing them to each other. In the first two approaches, we only compare the errors induced by the unimodal and unrestricted segmentations. In the BIC approach, we compare the errors as well as the costs of the two models. Experimental results from applying the three techniques in real sequences are discussed in the next section.
5.1. Comparing segmentation errors One simple way of studying the relationship between the best unimodal and the best unrestricted model is to compare the segmentation errors induced by the two models. For this we need to fix some suitably large number of segments k. The intuition behind this approach is that if a sequence behaves unimodally, then its best unrestricted segmentation resembles its best unimodal segmentation. In that case, the corresponding segmentation errors are also close to each other. If the ratio of the two segmentation errors is close to one, we decide that the sequence is unimodal, otherwise we declare it to be non-unimodal.
5.2. Permutation tests Another method to identify the existence of unimodal structure is by performing permutation tests. First we obtain a unimodal segmentation, again for some large enough k, and then we randomly permute the resulting segments. If the average segmentation error of the random permutations is significantly higher than that of the original sequence, it means that the unimodal structure of the data was lost in the permutation process. For non-unimodal data, permuting the segments should not have a significant effect to the unimodal segmentation error, as the unimodal segmentation was not a particularly good model of the unpermuted sequence. If the ratio of the original error versus the average error or permuted sequences is close to zero, it means that the sequence is better described as unimodal.
10
N. Haiminen, A. Gionis, K. Laasonen
5.3. BIC-based scoring We can also use a measure inspired by the Bayesian information criterion (BIC) (Hand et al. 2001) to assess the goodness of a segmentation. This score has the form B = −2l + R, where l is the log-likelihood of the fit. The penalty term R favors simplicity— the more complex models will have a larger value of R. The conventional form for the BIC sets R = m log n, where m is the number of parameters and n is the length of the sequence. The penalty is thus proportional to the number of parameters in the model. When we are dealing with regressions, it is not immediately clear what the value of m should be. Even if we could come up with a relationship between m and the number of segments k, it would not account for the differences in the regression models; any k-segment regression would have the same penalty term, whether unimodal or not. A more systematic approach is to consider the minimum number of bits required to encode the segmentation parameters. If there are k segments, the k − 1 segment boundaries take (k − 1) log n bits to encode. But since we are interested only in comparing two different models, we will ignore this term, as it is included in all the different BIC scores. The next task is to find out how many bits are required to encode the segment levels s¯j . As mentioned in section 3.5, there are P = |L| = n(n − 1)/2 possible segment values. In an unrestricted regression, we choose k values from this set, so the encoding cost will be Rfree = k log P . In a unimodal regression the segment levels are first ascending, then descending. This means that we do not need to encode the ordering of the segments, only which group of segment values was used. Making the simplifying assumption that there are k/2 ascending and k/2 descending segments, the encoding cost is P . (4) Runi = 2 log k/2 For our final possible refinement, we observe that whenever two neighboring values xj and xj+1 are equal, no regression method is going to place a segment endpoint between them. Therefore to find the number of levels, we can count fragments, contiguous strings of identical values. If there are s fragments, we substitute P 0 = s(s − 1)/2 for P in (4). This change has a more pronounced effect in binary sequences, where the number of fragments may be much smaller than the number of bits.
6. Experiments In this section we present results from applying our unimodality detection techniques. We experimented both with real timeseries data and randomly generated binary sequences. This section also includes results on the performance of the algorithm G REEDY.
6.1. Experiments on real data The aim of the experiments was to test methods for measuring unimodality of the data. The three methods described in the previous section were applied on 8 sequences of 1600 and 2000 points extracted from exchange-rate data in (Keogh ). The idea was to compare the results of our methods between sequences that seemed unimodal by a visual inspection and sequences that did not seem to exhibit unimodal behavior. The dataset
Algorithms for unimodal segmentation with applications to unimodality detection
11
U1
U2
U3
U4
N1
N2
N3
N4
Fig. 3. Left: Four seemingly unimodal (U) and non-unimodal (N) extracts from exchange-rate dataRight: Sequence (dashed line), regression segments (points), and segmentation by O PT (horizontal lines) with k = 20.
that we experimented on consisted of four sequences that seemed roughly unimodal, and four sequences that did not seem to display unimodal behavior (see Figure 3). Comparing segmentation errors. As we discussed in the previous section, the first approach is based on comparing the errors of unimodal and unrestricted segmentations. For the sequences shown in Figure 3, the errors for unrestricted and unimodal segmentations can be found in Figure 4. There seems to be a clear difference between the unimodal and non-unimodal sequences. For k = 1, both algorithms give the same result (with the average of the sequence as the only level). When k increases, the unrestricted error starts to deviate from the unimodal error. When k reaches the number of the re-
12
N. Haiminen, A. Gionis, K. Laasonen
1000
Unimodal Data
1000
Unrestricted Unimodal
500
0
5
10
15
0
20
10
15
20
500
5
10
15
20
Segmentation error
Segmentation error
5
1000
10
15
20
5
10
15
20
5
10
15
20
5
10
15
20
500
0 1000
500
0
1000
1000
500
500
0
5
1000
500
0
Unrestricted Unimodal
500
1000
0
Non−unimodal Data
5
10
k
15
20
0
k
Fig. 4. Error for unrestricted segmentation and O PT with k = 2, . . . , 20.
gression segments in the sequence, unimodal error does not decrease anymore, while the unrestricted error decreases up to k = n. With non-unimodal sequences, the unimodal error stays at a level clearly higher than the unrestricted error. For, say, k = 20 the difference between the two types of sequences is already clear (as well as for a wide range of values of k). In order to compare the errors of the two segmentations and decide about the unimodality of each sequence, we select large enough value of k in order to identify differences in the behavior of the two types of sequences. The results are shown in Figure 6. For the ratios calculated in Figure 6, we used k = 20 segments, but as it can be deduced from Figure 4, similar behavior is exhibited for a wide range of the parameter k. From Figure 6 it can be seen that, for the data used here, a simple threshold of, e.g, 0.3, can be used to classify the sequences: values higher than the threshold imply unimodality of the sequence. Permutation tests. For the second method of deciding unimodality, we first applied O PT on each data sequence. Then we permuted the discovered segments to obtain sequences consisting of a concatenation of the k segments in a random order, and we applied O PT on the permuted sequences. We chose k to be 20, and performed 100 iterations for each sequence. A histogram of the errors is presented in Figure 5. It is clearly visible that the errors for the unimodal sequences deviate more from the permutation errors than those for the non-unimodal sequences. To obtain a statistic similar to the one in the previous experiment, we measured the ratio of the segmentation error of the original sequence to the average error for the permutations. The results are shown in Figure 6. It seems that this ratio also separates the unimodal and non-unimodal sequences, and we can set a threshold at, e.g., 0.2. Values smaller than the threshold imply unimodality of the sequence. To obtain an alternative statistical measure of unimodality, we performed a stan-
Algorithms for unimodal segmentation with applications to unimodality detection
Unimodal
20
15
10
10
5
5
0
500
1000
0
1500
20
20
15
15
10
10
5
5
0
0
500
1000
0
1500
20
20
15
15
10
10
5
5
0
0
500
1000
20 15 10 5 0
0
1500
Permutations
Permutations
0
0
500
1000
Error
Non−unimodal
20
15
1500
13
0
500
1000
1500
0
500
1000
1500
0
500
1000
1500
0
500
1000
1500
20 15 10 5 0
Error
Fig. 5. Errors in 100 permutations for k = 20, with error for the original sequence (o) and average error of the permutations (*).
dard t-test to see if the original error seems to stem from the same distribution as the permutation errors. The results are also shown in Figure 6. The larger numbers for the unimodal sequences indicate that they differ significantly from the error distribution of the random permutations. Again, a threshold value can be used to distinguish between the two types of sequences. BIC-based scoring. In the last set of experiments we obtained the B-optimal segmentations for the sequences. The BIC-like B scores were calculated by the formula P B = −2l + 2 log k/2 , where P = n(n − 1)/2 and l is the log-likelihood of the data given the model. The scores are plotted as a function of k in Figure 7, along with the values of k that resulted in best scores. The results are similar to those obtained by comparing the segmentation errors only. As k increases, the unimodal and unrestricted B scores start deviating more from each other. For the unimodal sequences, the scores remain close to each other, even as k increases. For the non-unimodal sequences, the scores are clearly far apart for a large range of different values of k. This behavior further validates the intuition that a unimodal model is a good way to model the seemingly unimodal sequences, while for the non-unimodal sequences we should prefer models with no unimodality constraints. The best B scores and the ratios of the best unimodal versus the best unrestricted score are shown in Figure 6. The ratio of the scores shows quite clearly that modeling
14
N. Haiminen, A. Gionis, K. Laasonen m
Ek
EO
EG
Ek EO
EO EP
t-test
Bk
BO
Bk BO
U1
151
51.3
121.7
129.0
0.42
0.18
4.374
308.2
320.6
0.96
U2
158
53.9
91.7
101.3
0.59
0.09
4.929
302.2
302.2
1.00
U3
262
26.3
27.5
36.9
0.96
0.02
4.478
241.8
241.8
1.00
U4
176
57.9
186.1
192.7
0.31
0.17
4.507
339.0
382.8
0.86
N1
124
57.2
417.3
419.8
0.14
0.50
2.215
310.4
549.0
0.57
N2
91
92.3
750.6
751.5
0.12
0.76
1.143
388.1
881.4
0.44
N3
117
111.5
663.2
668.8
0.17
0.59
2.080
405.4
833.5
0.49
N4
110
87.2
433.0
437.3
0.20
0.40
2.301
396.0
611.3
0.65
Fig. 6. Unimodality measures. m: number of regression segments, E k : error of unrestricted k-segmentation, EO : error of O PT , EG : error of G REEDY , EP : average error of O PT for randomly permuted sequences, Bk : B score of best unrestricted k-segmentation, BO : B score of best unimodal k-segmentation.
the unimodal sequences with unimodality constraints does not affect the modeling cost significantly. On the other hand, for non-unimodal sequences, the choice between unimodal and unrestricted model has a large impact on the modeling cost. Thus the ratio of the scores seems to separate the unimodal and non-unimodal sequences correctly. Furthermore, we are not obliged to fix a specific value of k, but we can search through all values of k (up to the size of the regression, at most) to find the best B scores, and then compare those. The only parameter left to decide is a threshold value, for choosing which ratios are high enough to have originated from a unimodal sequence.
6.2. Experiments on binary sequences We experimented with randomly generated binary sequences, in effect, random strings of binary digits. We applied the model selection techniques based on BIC-like scoring, as described in section 5.3. We obtained both the optimal unimodal and unrestricted ksegmentations, and the segmentation with the lowest score B was selected. For brevity, in the following we say that a string is unimodal if it is best represented by its B-optimal unimodal segmentation, as opposed to its B-optimal unrestricted segmentation. With binary strings the log-likelihood takes a simple form. If at position i the value xi is equal to 1, the likelihood is just x ˆi ; for xi = 0, it is 1 − x ˆi . Combining these observations and taking the logarithm we get the log-likelihood of the string: l=
n X
xi log x ˆi + (1 − xi ) log(1 − x ˆi ).
i=1
The BIC-like score is then specified as B = −2l + R, where R is the encoding cost of the model. Figure 8 shows the classification for a set of randomly generated strings of length 50 bits. Each string was first classified manually. The B scoring, with varying penalty terms R was computed. Strings where the B scoring agreed with the manual classifi-
Algorithms for unimodal segmentation with applications to unimodality detection
Unimodal Data
1500
Unrestricted Best unrestr. Unimodal Best unimod.
1000
1500
Non−unimodal Data
Unrestricted Best unrestr. Unimodal Best unimod.
1000
500
500
0
5
10
15
0
20
1500
1500
1000
1000
500
500
0
5
10
15
B score
B score
15
20
1500
0
1000
500
500 5
10
15
0
20
1500
1500
1000
1000
500
500
0
5
10
15
k
10
15
20
5
10
15
20
5
10
15
20
5
10
15
20
1500
1000
0
5
0
20
k
100
Unimod Unimod (false) Other (false) Other
Unimod Unimod (false) Other (false) Other
20
40
20
40
40
60
60
60
80
80
Unimod Unimod (false) Other (false) Other
80
100
100
Fig. 7. B scores for best unrestricted and unimodal segmentations for k = 2, ..., 20.
20
40
60
R = k log n
80
100
40
60
R = 2 log
80
`
P ´ k/2
100
0
20
40
60
R = 2 log
80
100
` P0 ´ k/2
Fig. 8. Model selection for binary strings, with given penalty terms. Horizontal and vertical axes show unimodal and unrestricted B scores, respectively.
cation are shown as “Unimod” or “Other”. The category “Unimod (false)” indicates strings that were erroneously classified as unimodal, and strings marked as “Other (false)” should have been unimodal but were not classified as such. The leftmost plot shows, for comparison, what happens with the standard BIC score; there are 8 errors for 43 strings. The other plots show the results for the modified B score based on encoding cost, which performs better. There are 3 misclassifications in the rightmost plot, and none in the middle one. All mistakes occur in strings which are not clearly unimodal but not clearly non-unimodal, either. Strings that are clearly not unimodal fall farthest from the diagonal dividing line. For example, bimodal strings cannot be represented well by any unimodal regression, so their unimodal score is larger than the unrestricted score. Conversely, strings that are clearly unimodal are placed above the diagonal. Since such strings are usually at a
16
N. Haiminen, A. Gionis, K. Laasonen Real data Random walk
1.15
1.1
1.05
1 0
10
20
k
30
40
50
Fig. 9. Averages of the error ratio G REEDY /O PT for two datasets. Dataset 1: 11 unimodal sequences of 750 points, Dataset 2: 20 non-unimodal sequences of 2000 points.
constant distance from the diagonal, we are not able to determine if a string is “more unimodal” than another.
6.3. Performance of G REEDY We conducted experiments to demonstrate the performance of the algorithm G REEDY compared to O PT, and to validate two methods for measuring the unimodality of a sequence. First we ran both G REEDY and O PT on two different datasets, and measured the average ratio of the errors they produced. Dataset 1 consisted of 11 unimodallybehaving subsequences of 750 points extracted from water-level measurements (Keogh ), and Dataset 2 from 20 generated random walk sequences of 2000 points. We observed that the algorithms give results very close to each other, both for the unimodal and nonunimodal datasets. The error ratios are displayed in Figure 9. The ratio is one with k = 1, then increases up to some small k, and starts to decrease and approach one again. This shows that the greedy algorithm gives very good results in practice.
7. Conclusion We have presented two algorithms for the problem of segmenting a sequence into k pieces with monotonicity or unimodality constraints. The first algorithm is an optimal algorithm based on a well-known regression algorithm and on dynamic programming. The second algorithm, G REEDY, is not optimal, but it is more efficient and we experimentally verified that it gives results close to optimal. Additionally, we described three tests for distinguishing if a sequence is unimodal or not: comparing segmentation errors, performing permutation tests, and BIC-based scoring. Our experiments with both real data sets and generated binary sequences provided evidence that the suggested algorithms and the unimodality tests perform very well in practice. An interesting open problem is to examine if there exists some guarantee for the quality of the results produced by G REEDY. Acknowledgements. We thank Heikki Mannila and Evimaria Terzi for many useful discussions and suggestions.
Algorithms for unimodal segmentation with applications to unimodality detection
17
References Ayer M, Brunk H, Ewing G, Reid W (1955) An empirical distribution function for sampling with incomplete information. The annals of mathematical statistics 26(4): 641–647 Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Communications of the ACM 4(6): 284 Fris´en M (1986) Unimodal regression. The Statistician 35(4): 479–485 Geng Z, Shi N (1990) Isotonic regression for umbrella orderings. Applied Statistics 39(3): 397–402 Guha S, Koudas N, Shim K (2001) Data-streams and histograms, In: Proc. ACM Symposium on Theory of Computing, pp.471-475 Hand D, Mannila H, Smyth P (2001) Principles of Data Mining. MIT Press Hardwick J, Stout Q (2000) Optimizing a unimodal response function for binary variables, In: Atkinson A, Bogacka B, Zhigljavsky A (eds) Optimum Design. Kluwer, Dordrecht, pp 195-208 Kaufman Y, Tamir A (1993) Locating service centers with precedence constraints. Discrete Applied Mathematics 47: 251–261 Keogh E, UCR Time Series Data Mining Archive, http://www.cs.ucr.edu/ eamonn/TSDMA/ Lee C (1983) The min-max algorithm and isotonic regression. Annal of statistics 11: 467–477 Pardalos P, Xue G (1999) Algorithms for a Class of Isotonic Regression Problems. Algorithmica 23(3): 211– 222 Pardalos P, Xue G, Yong L (1995) Efficient Computation of the Isotonic Median Regression. Applied Mathematics Letters 8(2): 67–70 Restrepo A, Bovik A (1993) Locally monotonic-regression. IEEE Transactions on Signal Processing 41(9): 2796–2810 Robertson T, Wright F, Dykstra R (1988) Order Restricted Statistical Inference. Wiley, New York Stout Q (2000) Optimal algorithms for unimodal regression, In: Wegman E and Martinez Y (eds), Computing science and statistics 32. Interface Foundation of North America, Inc., Fairfax Station, VA
Author Biographies Niina Haiminen received her M.Sc. degree from the University of Helsinki in 2004. She is currently a graduate student at the Department of Computer Science of University of Helsinki, and a researcher at the Basic Research Unit of Helsinki Institute for Information Technology. Her research interests include algorithms, bioinformatics and data mining.
Aristides Gionis received his Ph.D. from Stanford University in 2003, and he is currently a senior researcher at the Basic Research Unit of Helsinki Institute for Information Technology. His research experience includes summer internship positions at Bell Labs, AT&T Labs, and Microsoft Research. His research areas are data mining, algorithms, and databases.
18
N. Haiminen, A. Gionis, K. Laasonen Kari Laasonen received a M.Sc. degree in Theoretical Physics in 1995 from the University of Helsinki. He is currently a graduate student in Computer Science at the University of Helsinki and a researcher at the Basic Research Unit of Helsinki Institute for Information Technology. His research is focused on algorithms and data analysis methods for pervasive computing.
Correspondence and offprint requests to: Niina Haiminen, Helsinki Institute for Information Technology, P.O. BOX 68, 00014 University of Helsinki, Finland. Email:
[email protected]