0} Jaccard similarity between x and y for band b: |T b (x)∩T b (y)| b b 6 0, sbxy = |T b (x)∪T b (y)| if |T (x) ∪ T (y)| = b else sxy = 1 Jaccard distance between x and y for band b: dbxy = 1 − sbxy overall similarity for x and y: P score 1 b Sxy = |Bxy | Bxy sxy (average Jaccard similarity between x and y over all bands containing x or y at any time) overall distance between x and y: Dxy = 1 − Sxy P or Dxy = |B1xy | Bxy dbxy
20
Non-negativity We must show Dxy ≥ 0. To this end, note that for any b, |T b (x) ∩ |T b (x)∩T b (y)| T b (y)| ≤ |T b (x) ∪ T b (y)|; thus, sbxy = |T b (x)∪T b (y)| is necessarily less than or equal to 1, b b and so dxy = 1 − sxy must be greater than or equal to zero. Since Dxy is an average of dbxy values, it too is non-negative. Identity of indiscernibles We must show that Dxy = 0 if and only if x = y. First, suppose x = y. Then for all b, T b (x) = T b (y), and so |T b (x) ∩ T b (y)| = |T b (x) ∪ T b (y)|. Then for all b (regardless of whether or not b ever contains x or y), sbxy = 1 and dbxy = 0, and thus Dxy is also zero. To show the reverse implication, suppose that x 6= y. Then without loss of generality we can say that there is some time point t for which x(t) > y(t). Suppose that there exists an observation z ∈ X such that z(t) < x(t), and consider the band b∗ defined by z and y. At t, y is in b∗ ; but x(t) is strictly greater than both y(t) and ∗ ∗ ∗ ∗ ∗ z(t), so x is not in b∗ at t. Then |T b (x) ∩ T b (y)| < |T b (x) ∪ T b (y)|, so sbxy < 1 and ∗ dbxy > 0. Since there is now a positive contribution to Dxy , we know that Dxy > 0. Now suppose there exists an observation z ∈ X such that z(t) ≥ x(t), and consider the band b∗ defined by z and x. At t, x is in b∗ but y is not (since y(t) is less than both z(t) and x(t)). By similar reasoning to the above, we see that Dxy > 0. Thus Dxy is 0 only if x = y. Symmetry We must show that Dxy = Dyx . This is evident from inspection of the |T b (x)∩T b (y)| |T b (y)∩T b (x)| definition; |T b (x)∪T b (y)| = |T b (y)∪T b (x)| , and Bxy = Byx . Triangle inequality We must show that Dxz ≤ Dxy + Dyz . Note that since x, y, and z were selected arbitrarily, it suffices to demonstrate the inequality for this configuration of points. Let Bxyz be the set of all bands that contain x or y or z at any time, and let O = |Bxyz |. Note that we need only be concerned with bands in Bxyz ; if a band never contains either x or y, for example, it does not affect Dxy , and so bands that are not in Bxyz have no effect on either side of the inequality we are trying to show. We also need to denote certain subsets of Bxyz . Specifically, let • Bxz be the set of all bands that contain x or z at any time, and • B−xz be the set Bxyz \Bxz , that is, the set of bands that contain y at some time but do not contain x or z at any time. The sets Bxy , Byz , and so on are defined accordingly. We use Oxy to represent the cardinality of sets of the type Bxy . For convenience, let p be the cardinality of Bxz , and q be the cardinality of B−xz . Note that Bxz and B−xz partition the set Bxyz , so any relevant band will fall into exactly one of these subsets. Also note that O = p + q.
21
Recall the definitions. The bandwise similarity of x and y, sbxy , is: ( |T b ∩T b | |Txb ∪Tyb |
6 0 if |Txb ∪ Tyb | =
1
if |Txb ∪ Tyb | = 0
x
sbxy =
y
The bandwise distance of x and y, dbxy , is simply 1 − sbxy (so if neither x nor y is ever in the band b, then dbxy = 0). Note that, considered for a particular b, dbxy is the Jaccard distance and is itself a distance metric. Finally, the overall distance of x and y, Dxy , is the average of the bandwise distance scores over all valid bands (that is, bands that contain x or y at some time): X −1 Dxy = Oxy dbxy b∈Bxy
X
−1 = Oxy
dbxy
b∈Bxyz
since dbxy = 0 for any b ∈ / Bxy . Initially, we consider the case where q = 0: that is, any band that contains observation y at any time must also contain x or z at some time. Because dbxz ≤ dbxy + dbyz for any given b, we have X X X dbyz . dbxy + dbxz ≤ b∈B
b∈Bxyz
b∈Bxyz
Now, both Oxy and Oyz are necessarily less than or equal to O, and so the quantities O and OOyz are both greater than or equal to one. Then we can multiply terms on the Oxy right-hand side of the inequality by these factors and preserve the inequality: X
dbxz ≤
b∈Bxyz
Then we have:
O X b O X b dxy + dyz . Oxy b∈B Oyz b∈B xyz
xyz
1 X b 1 X b 1 X b dxz ≤ dxy + dyz , O b∈B Oxy b∈B Oyz b∈B xyz
xyz
xyz
but since q = 0, we know that O = Oxz , so the inequality becomes 1 X b 1 X b 1 X b dxz ≤ dxy + dyz , Oxz b∈B Oxy b∈B Oyz b∈B xyz
xyz
xyz
which by definition tells us that Dxz ≤ Dxy + Dyz , as desired. Now, suppose that q 6= 0. We can list the bands in Bxyz as follows: Bxz Bxyz = B−xz
22
We then consider an expanded set of (non-unique) bands, in which we include p copies of Bxz followed by p copies of B−xz : Bxz .. . Bxz ˜ B := B−xz . .. B−xz And finally a set of extended bands, formed by appending a band from Bxz to each ˜ band in B: Bxz Bxz .. .. . . Bxz Bxz B−xz Bxz ˆ := . B .. .. . B xz . .. .. . B−xz Bxz ˆ consists of p copies of Bxz , each reduplicated to form a band of twice The first section of B the original length. In the second section, the first half of each band is some band drawn from B−xz , and the second half of each band is drawn from Bxz . There are a total of pq bands in this section; each band in B−xz appears (as the first half of a band) p times, and each band in Bxz appears (as the second half of a band) q times. Now we examine the behavior of the bandwise distance dbxz on these sets. First, note P P ˜ b ˜ that p b∈Bxyz dbxz = b∈ ˜ B ˜ dxz since each band in Bxyz appears p times in B. ˆ Suppose it is from the first section; that is, it has Now, consider some band b+ ∈ B. the form [ b b ], where b is a band in Bxz . We have: +
+ dbxz
+
|Txb ∩ Tzb | = 1 − b+ |Tx ∪ Tzb+ | 2|Txb ∩ Tzb | =1− 2|Txb ∪ Tzb | |T b ∩ Tzb | = 1 − xb |Tx ∪ Tzb | = dbxz .
as b+ contains two disjoint repetitions of b
Next suppose the band b+ has the form b− b where b is drawn from Bxz and b− is
23
drawn from B−xz . Then we have: +
+ dbxz
+
|T b ∩ Tzb | = 1 − xb+ |Tx ∪ Tzb+ | |T b ∩ Tzb | = 1 − xb = dbxz , b |Tx ∪ Tz | +
+
as b− contributes nothing to either Txb or Tzb , since neither x nor z ever appears in b− . ˆ Each b ∈ Bxz appears (p + q) times, extended in one or the other of these ways, in B. b b+ In either case, dxz = dxz . So X + X dbxz dbxz = (p + q) ˆ b+ ∈B
b∈Bxz
= (p + q)Oxz Dxz . Now we look at the observations x and y, P noting that the same reasoning will apply to P ˜ b b the pair y and z. As before, p b∈Bxyz dxy = b∈ ˜ B ˜ dxy since each band in Bxyz appears p ˜ times in B. ˆ of the form b b , where b ∈ Bxz . We have: Consider a band b+ ∈ B +
+ dbxy
+
|Txb ∩ Tyb | = 1 − b+ |Tx ∪ Tyb+ | 2|Txb ∩ Tyb | =1− 2|Txb ∪ Tyb | |Txb ∩ Tyb | =1− b |Tx ∪ Tyb | = dbxy
provided |Txb ∪ Tyb | = 6 0.
If |Txb ∪ Tyb | = 0, on the other hand, then neither x nor y is ever in b, and therefore neither + is ever in b+ . In this case, then, dbxy = 0 = dbxy . ˆ of the form b− b where b ∈ Bxz and b− ∈ B−xz . We have: Now consider b+ ∈ B + dbxy
+
+
−
−
|Txb ∩ Tyb | = 1 − b+ |Tx ∪ Tyb+ | |Txb ∩ Tyb | + |Txb ∩ Tyb | = 1 − b− |Tx ∪ Tyb− | + |Txb ∪ Tyb |
as b− and b are appended disjointly
≤1
since this fraction is ≥ 0. −
But note that dbxy = 1 −
−
−
|Txb ∩Tyb | |Txb− ∪Tyb− |
−
. Because b− ∈ B−xz , we know Tyb 6= ∅, so that the −
denominator of this fraction is not zero; but the numerator is zero since Txb = ∅. Then
24
−
−
+
the fraction is zero, and dbxy = 1. Thus dbxy ≤ dbxy . + ˆ ˜ of one or the other of these types. Therefore, to a band in B PB maps P Everyb+b ∈ ˜ b ˜ B ˜ dxy . ˆ dxy ≤ b∈ b+ ∈B Combining these results, we have: X (p + q)Oxz Dxz = (p + q) dbxz b∈Bxyz
=
+
X
dbxz
ˆ b+ ∈B
≤
+
+
X
dbxy + dbyz
by triangle ineq. on each b+
ˆ b+ ∈B
=
+
X
dbxy +
ˆ b+ ∈B
≤
X
˜ dbxy
+
˜ B ˜ b∈
=p
X
X
+
dbyz
ˆ b+ ∈B ˜ dbyz ˜ B ˜ b∈
X
dbxy +
b∈Bxyz
X
dbyz
b∈Bxyz
= p(Oxy Dxy + Oyz Dyz ) ≤ p max(Oxy , Oyz )(Dxy + Dyz ), and so Dxz ≤
p max(Oxy , Oyz ) (Dxy + Dyz ). (p + q)Oxz
But p max(Oxy , Oyz ) max(Oxy , Oyz ) = (p + q)Oxz p+q max(Oxy , Oyz ) = O ≤ 1,
as p = Oxz
since Bxy and Byz are both subsets of Bxyz . Then the above inequality becomes Dxz ≤ 1(Dxy + Dyz ) and so Dxz ≤ Dxy + Dyz .
References Charu C. Aggarwal and Philip S. Yu. The igrid index: Reversing the dimensionality curse for similarity indexing in high dimensional space. ACM KDD Conference, 2000. 25
Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior of distance metrics in high dimensional space. International Conference on Database Theory, 2001. Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearest neighbor meaningful? ICDT Conference Proceedings, 1999. Yonghan Feng and Sarah M. Ryan. Scenario construction and reduction applied to stochastic power generation expansion planning. Computers and Operations Research, 40:9–23, 2013. Dimitrios Gunopulos and Gautam Das. Time series similarity measures. Tutorial notes of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining., 2000. Andrew Kusiak and Wenyan Li. Short-term prediction of wind power with a clustering approach. Renewable Energy, 35:2362–2369, 2010. Regina Y. Liu. On a notion of data depth based on random simplices. The Annals of Statistics, 18(1):405–414, 1990. Sara Lopez-Pintado and Juan Romo. On the concept of depth for functional data. Journal of the American Statistical Association, 104(486):718–734, 2009. Martin Maechler, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. cluster: Cluster Analysis Basics and Extensions, 2013. R package version 1.14.4 — For new features, see the ’Changelog’ file (in the package source). David Meyer, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien, 2014. URL http://CRAN.R-project.org/package=e1071. R package version 1.6-2. C.E. Murillo-Sanchez, R.D. Zimmerman, C. Lindsay Anderson, and R.J. Thomas. Secure planning and operations of systems with stochastic sources, energy storage, and active demand. Smart Grid, IEEE Transactions on, 4(4):2220–2229, Dec 2013. Arash Pourhabib, Jianhua Z. Huang, and Yu Ding. Short-term wind speed forecast using measurements from multiple turbines in a wind farm. Technometrics, 2015. William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971. Simon Wood. mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation, 2014. URL http://CRAN.R-project.org/package=mgcv. R package version 1.7-29.
26