HKUST Institutional Repository

1544

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 21,

NO. 11,

NOVEMBER 2009

Efficient Similarity Join over Multiple Stream Time Series Xiang Lian, Student Member, IEEE, and Lei Chen, Member, IEEE Abstract—Similarity join (SJ) in time-series databases has a wide spectrum of applications such as data cleaning and mining. Specifically, an SJ query retrieves all pairs of (sub)sequences from two time-series databases that "-match with each other, where " is the matching threshold. Previous work on this problem usually considers static time-series databases, where queries are performed either on disk-based multidimensional indexes built on static data or by nested loop join (NLJ) without indexes. SJ over multiple stream time series, which continuously outputs pairs of similar subsequences from stream time series, strongly requires low memory consumption, low processing cost, and query procedures that are themselves adaptive to time-varying stream data. These requirements invalidate the existing approaches in static databases. In this paper, we propose an efficient and effective approach to perform SJ among multiple stream time series incrementally. In particular, we present a novel method, Adaptive Radius-based Search (ARES), which can answer the similarity search without false dismissals and is seamlessly integrated into SJ processing. Most importantly, we provide a formal cost model for ARES, based on which ARES can be adaptive to data characteristics, achieving the minimum number of refined candidate pairs, and thus, suitable for stream processing. Furthermore, in light of the cost model, we utilize space-efficient synopses that are constructed for stream time series to further reduce the candidate set. Extensive experiments demonstrate the efficiency and effectiveness of our proposed approach. Index Terms—Stream time series, ARES, similarity join, synopsis.

Ç 1

I

INTRODUCTION

the context of stream time series, similarity join (SJ) has many important applications such as data cleaning and data mining [35], [26]. For example, SJ queries can be used to help clean sensor data collected from various sources that might contain inconsistency [26]. As another example [35], in the stock market, it is crucial to find correlations among stocks so as to make trading decisions timely. In this case, we can also perform SJ over price curves of stocks in order to obtain their common patterns, rules, or trends. In addition to the above two concrete examples, SJ can be applied to a wide spectrum of stream time series applications including Internet traffic analysis [12], sensor network monitoring [36], and so on. Formally, given two time-series databases R and S containing (sub)sequences of length n, an SJ query outputs all pairs hr; si, such that distðr; sÞ ", where r 2 R; s 2 S; distð; Þ is a distance function between two series, and " a similarity threshold. To the best of our knowledge, there is no previous work on SJ in the scenario of multiple stream time series, where new data arrive continuously over time. Although some proposals in the literature studied the join over data streams [28], [22], [27], their focus is on load-shedding stream data to disk, in the case of memory overflow, such that the joined pairs (involving only a few join attributes) are output N

. The authors are with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China. E-mail: {xlian, leichen}@cse.ust.hk. Manuscript received 9 Oct. 2007; revised 7 July 2008; accepted 4 Nov. 2008; published online 8 Jan. 2009. Recommended for acceptance by K. Shim. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-2007-10-0492. Digital Object Identifier no. 10.1109/TKDE.2009.27. 1041-4347/09/$25.00 ß 2009 IEEE

smoothly. In contrast, the SJ problem over stream time series is entirely processed in memory, where the join predicate considers subsequences with extremely high dimensionality (join attributes), which arises indexing and query efficiency issues that cannot be solved by simple joins. Furthermore, previous methods on SJ problem in timeseries databases mainly focus on static data, which can be classified into two categories, SJ with and without indexes. The most related work to the first category is the spatial join [6], [33], [20], which considers each time series of length n as an n-dimensional data point in the spatial database. In particular, this approach builds multidimensional indexes such as R-tree [14] or spatial hash [20] on (reduced) time series and executes the join operator with the help of indexes. In contrast to the first category, the second one does not rely on any index. Instead, a nested loop join (NLJ) is invoked to exhaustively compare each possible pair. These approaches in static databases, however, cannot be applied to SJ over stream time series either, due to the unique requirements of stream time-series processing such as low memory consumption, low processing cost, and adaptivity to data characteristics. Specifically, it is not efficient to build an R-tree [14] for each stream time series, since the required memory size is large, and index maintenance and query costs are high. Moreover, the work of the spatial hash join [20] is designed only for static data, and thus, not adaptive to the change of stream data. Finally, NLJ incurs high (i.e., quadratic) computation cost, which is not tuned to stream time-series processing. Motivated by this, in this paper, we propose an efficient and effective approach that incrementally performs SJ over multiple stream time series with low space and computation cost, yet adaptive to the change of data characteristics. Specifically, we construct space-efficient synopses for stream Published by the IEEE Computer Society

LIAN AND CHEN: EFFICIENT SIMILARITY JOIN OVER MULTIPLE STREAM TIME SERIES

time series, which are used to facilitate pruning candidate pairs, and thus, reduce the computation cost during SJ processing. Furthermore, we propose a formal cost model for the entire SJ procedure, in light of which each incremental step is adaptive to the change of stream data, such that the total number of refined candidates at each step is minimized. In particular, we make the following contributions: We propose in Section 5.1 an Adaptive Radius-basEd Search (ARES) approach to answer SJ on stream time series, which does not introduce false dismissals. 2. We provide in Section 5.2 a formal cost model for ARES, based on which ARES can achieve the minimum number of refined candidates and be seamlessly integrated into SJ procedure. 3. We use space-efficient synopses over stream time series proposed in Section 4 to further prune the candidate set in light of the cost model in Section 5.3, and thus, reduce the computation cost of SJ. 4. We also discuss the batch processing and load shedding for the similarity join over stream time series in Sections 5.5 and 5.6, respectively. In addition, Section 2 briefly overviews the previous work on SJ in static databases as well as the related similarity search problem. Section 3 formally defines our SJ problem over multiple stream time series. Section 4 presents the data structures of the synopses to summarize stream time series and help efficient SJ processing. Section 6 illustrates through extensive experiments the query performance of our proposed approach. Finally, Section 7 concludes the paper. 1.

2

RELATED WORK

2.1 Similarity Join As indicated before, existing works on SJ in static time-series databases can be classified into two categories: SJ with and without indexes. Specifically, the most related work [6], [33], [20] in the first category is the spatial join, which constructs spatial indexes, for example, R-tree [14] or spatial hash [20], on (sub)sequences in time-series databases, and performs the join operation on indexes. Specifically, Brinkhoff et al. [6] build an R-tree on each database, traverse both R-trees in a depth-first manner, and finally, obtain similar pairs in leaf nodes. Huang et al. [33] improve the performance of the spatial join by traversing R-trees in a breadth-first manner. Lo and Ravishankar [20] spatially hash data into buckets for both databases and retrieve as candidates those pairs that fall into the same buckets. In order to answer queries with range predicates, they map each data from one of the two databases into multiple buckets. The second category, SJ without indexes, performs the well-known NLJ, which exhaustively computes the distance for each pair of data from two databases and reports the result if they are similar. In particular, each time, NLJ loads one disk page from a database and joins this page with every page in the other database. Note that NLJ is a generic brute-force approach, which can be applied to join with any predicate. Clearly, these methods on SJ over static databases usually incur large memory consumption of indexes, high processing cost, or inability of adapting to the changing

1545

data, which, thus, cannot be applied directly to SJ processing in the stream scenario. To the best of our knowledge, there is no previous work on SJ problem over multiple stream time series, which involves the unique characteristics of both time series and stream processing. In particular, time series is a sequence of ordered data values usually with long length (i.e., with high dimensionality), for example, 128, which makes indexing and querying inefficient due to the “dimensionality curse” problem [32], [4], [3]. Moreover, in the stream scenario, the SJ query processing has its own properties, such as limited memory, fast arrival rate, and time-varying data. Thus, it is desired to design SJ techniques that can achieve small memory consumption, low processing cost, and yet high query accuracy adaptive to the stream time series data. Thus, all these requirements are challenging problems that we need to solve in order to perform efficient SJ over multiple stream time series. In the data stream literature, there are some proposals to output the result of the equality join between two data streams (e.g., XJoin [28], hash merge join (HMJ) [22], and ratebased progressive join (RPJ) [27]). Specifically, they use a hash function to map join attributes of each stream data into buckets and perform the hash join on data from pairwise buckets of two streams. In these approaches, the focus is on load shedding buckets to disk when the memory is full and outputting the join result as early as possible. Furthermore, they consider only small number (e.g., 1 or 2) of join attributes (i.e., with low dimensionality). In contrast, our SJ processing over multiple stream time series is entirely accomplished in memory and the dimensionality of each subsequence is usually very high, which arises indexing and query efficiency issues that cannot be handled by stream joins.

2.2 Similarity Search In this section, we briefly review previous work on similarity search, which is a very close research problem to SJ. The similarity search problem is one of the most fundamental problems in time-series databases that involves many applications, such as the multimedia retrieval, data mining, Web search, and retrieval. In particular, a similarity query retrieves all the (sub)sequences in the database that are similar to a user-specified query time series, where the similarity between two series is defined by a distance function, for example, Lp -norm [34], Dynamic Time Warping (DTW) [5], Longest Common Subsequence (LCSS) [29], and Edit distance with Real Penalty (ERP) [10]. In this paper, we consider euclidean distance (i.e., L2 -norm) as our similarity measure, which has been widely used in many applications [1], [13]. Agrawal et al. [1] first proposed the whole matching in the static sequence database, where all the data sequences are of the same length. Specifically, they reduce the dimensionality of the entire data sequences from n to d (d n) applying a dimensionality reduction technique, Discrete Fourier Transform (DFT), and insert the reduced data into a d-dimensional R-tree [14]. Given any query time series Q of the same length n and similarity threshold ", we transform Q to a d-dimensional query point q using DFT similarly, issue a range query centered at q with a radius " on the R-tree

1546


VOL. 21,

NO. 11,

NOVEMBER 2009

Fig. 1. Illustration of SJ over stream time series.

Fig. 2. The general framework for SJ.

index, and finally, refine candidates returned from the range query by checking their real distance to Q. It has been proved that the resulting candidate set by using DFT reduction method does not introduce any false dismissals (actual answers to the query that are, however, absent in the candidate set). Faloutsos et al. [13] later proposed the subsequence matching problem in the time-series database, where both data and query time series can have different lengths. Given a query time series Q of arbitrary length n, in order to retrieve subsequences of length n that "-match Q, the proposed method, namely FRM for brevity, preprocesses each time series in the database by extracting sliding windows of size w (assuming that n ¼ s w, for a positive integer s), where w is the minimum possible value of n. Then, FRM reduces the dimensionality of each sliding window from w to f using DFT (f w) and indexes the reduced series in an R-tree [14]. To answer the range query with query series Q, FRM partitions Q into s disjoint windows of equal length w (since n ¼ s w), converts them into sf-dimensional query points, q1 ; q2 ; . . . ; and qs , respectively, using DFT, and issues a range query on R-tree centered at qi with a smaller radius p"ffiffis , for each 1 i s, whose results are finally refined by checking their real euclidean distances to Q. According to the lower bounding lemma [13], since the distance between any two reduced data using any dimensionality reduction technique (e.g., DFT) is never greater than that between the original data, it is guaranteed that no false dismissals are introduced in the query results. In contrast to FRM, the duality-based method (Dual) [24] extracts disjoint windows of size w from data time series and sliding windows of size w from the query time series. Moon et al. [23] have integrated both FRM and Dual methods into the framework for the general match, by introducing the concept of J-disjoint windows and J-sliding windows. Note, however, that FRM can generate much fewer candidates than Dual and general match [23]. Moreover, Faloutsos et al. [13] use MBR to group data converted from consecutive sliding windows in order to reduce the I/O cost. However, since our SJ processing is performed in memory and no I/O is involved, throughout this paper, we always mean FRM without MBR. In the literature of stream time series, Liu and Ferhatosmanoglu [19] build VA-Stream or VAþ -Stream to improve the query performance of a linear scan. Kontaki and Papadopoulos [16] construct R-tree on the reduced series with deferred update policy to facilitate the similarity search. Bulut and Singh [7] use multilevel DWT to represent stream data and monitor pattern queries. Lian et al. [18] propose a

multiscale segment mean (MSM) representation for subsequences to detect static patterns over stream time series. However, applying these methods to directly process stream SJ queries faces the scalability problem, in terms of either time or space. Sakurai et al. [25] propose a SPRING approach to monitor stream time series data and find subsequences that are similar to a given query sequence under DTW measure. In contrast, our work studies the stream SJ problem under euclidean distance.

3

PROBLEM DEFINITION

In this section, we formally define the problem of SJ over multiple stream time series and illustrate its general framework. Assume that we have m stream time series T1 ; T2 ; . . . ; and Tm , which are synchronized in the sense that new data items of all series would arrive at the same time stamp. For each series Ti , we keep the most recent W data Ti ½t W þ 1 : t in memory, where t is the current time stamp. When a new data item, say Ti ½t þ 1, arrives at the next time stamp ðt þ 1Þ, the oldest item Ti ½t W þ 1 is expired, and thus, evicted from the memory. With this sliding window model for each stream time series, an SJ query continuously outputs similar pairs of subsequences ðSi ; Sj Þ of length n from any two series Ti and Tj (for i; j 2 ½1; m), respectively, such that distðSi ; Sj Þ ", where distð; Þ is a distance function between two series and " the similarity threshold. As mentioned before, there are many distance functions to measure the similarity between two series, such as Lp -norm [34], DTW [5], and ERP [10]. In this paper, we use the euclidean distance (i.e., L2 -norm), which has been widely used in many applications including financial, marketing, or production data analysis, and scientific databases (e.g., with sensor data) [1], [13], and leave the discussion of other measures as our future work. Fig. 1 illustrates our stream SJ scenario. In particular, each stream time series Ti maintains a space-efficient synopsis Syni that can facilitate a fast SJ. Whenever a series Ti receives one insertion or expunge an old data item, the synopsis Syni of Ti is incrementally updated. In the case where Ti obtains a new subsequence, say Snew , we join Snew with other stream time series Tj with the help of Synj , where 1 j m, and report the matching pairs as the join result. Fig. 2 illustrates the general framework for the SJ algorithm, which incrementally outputs the SJ result upon an insertion Ti ½t þ 1 to series Ti . Specifically, due to the new insertion, we first obtain a new subsequence Snew in Ti and update the synopsis Syni of Ti accordingly (lines 1 and 2). Then, for each stream time series Tj (1 j m), we output all subsequences in Tj that "-match Snew (lines 3-6). In


1547

Fig. 3. Meanings of symbols used. Fig. 4. Synopsis structure for stream time series.

particular, we utilize a novel approach ARES to obtain a candidate set cand containing subsequences in Tj that are similar to Snew (line 4). Since procedure ARES is based on a formal cost model that we develop, the number of candidates returned is the minimum, compared to FRM approach [13] used in static databases. Next, in procedure Synopsis_Pruning, we further prune candidates in cand with the help of synopsis Synj resulting in a new candidate set cand00 (line 5). Finally, candidates in cand00 are refined by checking their real euclidean distances to Snew and output (line 6). In the following sections, we first illustrate data structures of synopsis and then discuss the detailed procedures ARES and Synopsis_Pruning. Fig. 3 summarizes the commonly used symbols in this paper.

4

DATA STRUCTURES OF SYNOPSIS

In this section, we discuss the synopsis Syni that we maintain for each stream time series Ti . Fig. 4 illustrates the data structure of Syni for the stream time series Ti , which is similar to HistogramBlooms [21], consisting of two parts: an equi-width histogram and bloom filters. In particular, the histogram of Syni has b cells that are used to summarize fdimensional data converted from sliding windows of size w in series Ti (f w), using any dimensionality reduction technique. In contrast to the normal histogram, however, for the kth cell (1 k b), we keep not only the frequency freqik of data in the cell but also a bloom filter BFik with a bits. Bloom filters have been widely used in many applications [21] to check the existence of values. Specifically, a bloom filter BFik is a bit vector initialized with values “0.” A position in BFik is set to value “1,” only if there exists at least one reduced data in the kth cell, such that the start offset of whose sliding window is hashed into this position using a hash function H. Furthermore, we also store start offsets of sliding windows in Syni , which are pointed to by the position they are hashed into. As an example, in Fig. 4, assume that we have a sliding window Ti ½os : os þ w 1 of size w from Ti , which is summarized in synopsis Syni as follows: First, we transform it to an f-dimensional point using any dimensionality reduction method. Then, we find the cell of the histogram (e.g., the first one) into which this point falls, and increase its frequency (i.e., freqi1 ) by 1. Next, we update the bloom filter BFi1 by setting its third position to “1” (assuming that HðosÞ ¼ 3) and storing the start offset os of the sliding window Ti ½os : os þ w 1 in the cell, which is pointed to by the third position in BFi1 . Note that there are many dimensionality reduction techniques, such as Singular Value Decomposition (SVD)

[17], DFT [1], Discrete Wavelet Transform (DWT) [9], Piecewise Aggregate Approximation (PAA) [34], Adaptive Piecewise Constant Approximation (APCA) [15], Chebyshev Polynomial (CP) [8], and Piecewise Linear Approximation (PLA) [11]. Since our SJ processing requires small memory consumption and low processing cost, any reduction approach that satisfies these two requirements can be applied in our synopsis. In this paper, we simply use PAA as our dimensionality reduction method, since it can be incrementally computed in the stream environment and yet without consuming extra space. In particular, PAA takes the mean of values within each sliding window of size w, reducing the dimensionality from w to 1 (i.e., f ¼ 1). Thus, throughout this paper, we assume that synopsis Syni always contains a 1-dimensional histogram (i.e., let f ¼ 1). Note that, like many other dimensionality reduction techniques, the query efficiency on the PAA-reduced data can be improved by increasing the reduced dimensionality. In the case of PAA, this can be achieved by specifying smaller size w of sliding windows (note: not increasing f). For other reduction methods, larger f (> 1) value can be used, and our proposed approaches can be applied on the data structure with arbitrary f value, as will be discussed later in Section 5.2. Memory size. The total memory consumption of m synopses in our SJ scenario is m ðb dlog2 ðW w þ 1Þe þ b dlog2 ae þ ðW w þ 1Þdlog2 ðW w þ 1ÞeÞ bytes, where m is the number of stream time series, b log2 ðW w þ 1Þ is the space for frequencies in each histogram, b log2 a is the space for bloom filters in each histogram, and log2 ðW w þ 1Þ is the space for the start offset of each window of size w in series. As an example, assume that a ¼ 28 ; b ¼ 26 ; W ¼ 210 , and w ¼ 28 . With a total 16Mð¼ 224 Þ bytes of the available memory, our SJ procedure can retain synopses for about 2,000 stream time series of length 1,024, which is very space efficient. We later discuss load shedding, which further reduces the required memory size by discarding start offsets. Incremental updates of synopsis. Since we use an incremental dimensionality reduction technique, that is, the incremental PAA, the update of the synopsis is very efficient. In particular, when a new data item Ti ½t þ 1 arrives, we convert the most recent sliding window Ti ½t w þ 2 : t þ 1 of size w into a 1-dimensional point using the incremental PAA. Then, we insert this point into a cell (e.g., the kth cell) in the histogram, that is, increasing its frequency freqik by 1, setting the Hðt w þ 2Þth position in BFik to 1 with a hash function H, and storing the start offset ðt w þ 2Þ of the sliding window Ti ½t w þ 2 : t þ 1,which is pointed by this position. For an expired sliding window, say T ½t W þ 1 : t W þ w, it accesses the Hðt W þ 1Þth positions of b bloom filters until the start offset ðt W þ 1Þ is found and removed.

1548


VOL. 21,

NO. 11,

NOVEMBER 2009

Fig. 6. The procedure of ARES. Fig. 5. An example of FRM and ARES. (a) FRM. (b) ARES.

The corresponding frequency is decreased by 1 and the position in the bloom filter is set to “0” if no other start offsets are mapped to it; otherwise, set to “1.”

5

SJ OVER MULTIPLE STREAM TIME SERIES

In this section, we discuss the SJ, which outputs all similar pairs of subsequences from stream time series without any false dismissals. As mentioned in Fig. 2, SJ is executed incrementally as follows: Whenever a stream time series Ti receives a new data item Ti ½t þ 1, SJ obtains and then outputs all subsequences from stream time series Tj (1 j m) that "-match with the new subsequence Snew ¼ Ti ½t n þ 2 : t þ 1 from Ti , where n ¼ s w. Due to the high dimensionality n (e.g., n ¼ 128) of the new subsequence Snew , most indexes on high-dimensional data fail to show good query performance, even compared with a linear scan. Previous work on tackling such a problem usually reduces the dimensionality of series before indexing them. In particular, Agrawal et al. [1] reduce the dimensionality of the entire time series directly from n to d (d n), whereas Faloutsos et al. [13] divide each subsequence of length n into s disjoint windows (segments) of size w (n ¼ s w) and reduce the dimensionality of each segment from w to f (f w). Note that the former approach requires large value of d for long subsequences (i.e., with large n), so as to obtain a candidate set containing small number of false positives (candidates that are not the actual answer). However, the resulting indexes, such as R-tree [14] or grid, even on the reduced d-dimensional data, are not so efficient (due to the “dimensionality curse” problem), in terms of both memory consumption and query performance, making them unsuitable for stream processing. On the other hand, if we consider a fixed memory size available for the reduced data, the latter approach ([13], FRM) has a much lower dimensionality f, since d ¼ f s, indicating that f ¼ ds . Thus, FRM can construct an f-dimensional index with s times lower dimensionality than the former method. However, FRM uses the same radius value for all range queries, regardless of data distributions, which is, thus, not adaptive to SJ over stream time series, where data distributions continuously change. Therefore, in the next section, we propose an ARES approach, which is adaptive to the data distribution in answering SJ over multiple stream time series. Most importantly, Section 5.2 provides a cost model to formalize the number of candidates obtained by ARES, based on which an efficient and effective approach for SJ processing is proposed to minimize the total number of refined candi-

dates. Section 5.3 utilizes space-efficient synopses to prune candidates of SJ result such that the cost of SJ processing can be further reduced. Section 5.4 illustrates parameters tuning. Finally, Sections 5.5 and 5.6 discuss the SJ batch processing and load shedding, respectively.

5.1 Adaptive Radius-Based Search Recall that, in FRM [13], we obtain s f-dimensional query points, q1 ; q2 ; . . . ; and qs , from s disjoint windows of the query series, respectively, and then issue s range queries centered at each query point qi , for 1 i s, all with the same query radius p"ffiffis . Fig. 5a illustrates a simple example of FRM, where s ¼ 2 and f ¼ 1. In particular, the query series is divided into two disjoint windows that are reduced to 2 one-dimensional query points q1 and q2 , respectively. Let point q be a twodimensional point ðq1 ; q2 Þ as illustrated in Fig. 5a. Here, the actual data points we want to obtain are those candidates within " distance from point q (i.e., within the circle). FRM performs the search as follows: For each query point qi where i ¼ 1 or 2, FRM issues a range query centered at qi with a radius p"ffiffi2 , and obtains all candidates within the range (i.e., between two vertical lines l1 and l2 for q1 , or between horizontal lines l3 and l4 for q2 ), in the shaded region of Fig. 5a. It has been proved [13] that FRM does not introduce any false dismissals. However, in the case where all the points fall into the region between lines l1 and l2 (i.e., answer to range query of q1 ), FRM has to access all these data, which is inefficient and not acceptable for SJ stream processing. Therefore, we propose a novel ARES approach, whose intuition is illustrated in Fig. 5b, where different radii 1 and 2 are used for query points q1 and q2 , respectively. With too many candidates that are close to the query point q1 , ARES uses a smaller radius 1 such that the total number of candidates to be refined is reduced. Furthermore, in order to guarantee no false dismissals, a larger radius 2 for query q2 is applied. Thus, ARES always chooses appropriate radii of range queries to minimize the computation cost (i.e., the number of candidates) based on data distribution and is efficient for SJ processing. Before we discuss how to choose the adaptive radii, let us first illustrate the outline of our ARES approach in Fig. 6. Specifically, ARES first chooses radii i for 1 i s adaptive to the data distribution (different from equal radii in FRM; line 1) such that no false dismissals are introduced. Details of choosing radii will be described later in this section. Next, it issues s range queries centered at qi with radii i for all 1 i s (lines 2-4). Finally, the candidates are refined and the actual answers are returned (line 5).


Fig. 7. Illustration for Theorem 5.1.

In the sequel, we illustrate how to choose query radii i ð1 i sÞ without introducing false dismissals. First, we give a theorem showing that as ARES selects s radii Ps as long 2 2 satisfying the condition i¼1 i " , no false dismissals will be introduced. Theorem 5.1. Given s query points q1 ; . . . ; qs reduced from s disjoint windows of size w in the query time series, and data points p1 ; . . . ; reduced from sliding windows of size w in a data time series T , ARES guarantees no false dismissals in the candidate Pset cand if the chosen query radii 1 ; 2 ; . . . ; and s satisfy: si¼1 i2 "2 . Proof. As illustrated in Fig. 7, assume that we have s ordered lists L1 ; . . . ; Ls , which, from bottom-up, contain data points in the ascending order of distances from s query points q1 ; . . . ; qs , respectively. For each list Li , we retrieve all data points with the radius below i in the list and add them to the candidate setPcand. Now we prove, by contradiction, that as long as si¼1 i2 "2 , there are no false dismissals in cand. By contradiction, assume that there exists one subsequence S of T that does not belong to the candidate set cand, however, is an actual answer (i.e., S"-matches with the query series). Without loss of generality, assume that subsequence S contains s disjoint windows of size w, which are reduced to data points p1 ; . . . ; ps , respectively. Since S is not in the candidate set cand, each pi must have distance from qi greater than i , that is, distðqi ; pi Þ > i , for all 1 i s (otherwise, subsequence S must be checked, and thus, included in the candidate set). Therefore, we have Ps P distðqi ; pi Þ2 > si¼1 i2 . Furthermore, since it holds i¼1P P that si¼1 i2 "2 , we have si¼1 distðqi ; pi Þ2 > "2 , indicating that subsequence S is not an actual answer, which contradicts with our assumption. Thus, the theorem holds. u t Theorem 5.1 indicates that ARES can always give the exact answer and obtain all subsequences similar to a query Ps 2 2 series if it holds that i¼1 i " . Note that there might exist many possible radius combinations satisfying the nofalse-dismissal condition mentioned in Theorem 5.1. For example, FRM [13] is a special case of ARES, where 1 ¼ 2 ¼ ¼ s ¼ p"ffiffis . However, different selections of radii might result in different computation costs. In the next section, we will provide a cost model to decide how to select a “good” radius combination such that the cost of retrieving candidates is minimized.

5.2 Cost Model for ARES As a second step, we propose a formal cost model for ARES, in terms of the number of refined candidates, based

1549

on which an efficient and effective approach is presented to achieve the near optimal (i.e., minimum) number of candidates. Specifically to our SJ problem shown in Fig. 2, we assume that each sliding window of size w extracted from any stream time series Tj (1 j m) is converted into a one-dimensional point using the incremental PAA and inserted into its synopsis (consisting of a histogram and bloom filters). The query time series Snew of length nð¼ s wÞ in Ti is partitioned into s disjoint windows of size w and then reduced to s one-dimensional query points q1 ; q2 ; . . . ; and qs , respectively. Next, our goal is to select the appropriate radius i in ARES (line 1 of Fig.P 6) for each query point qi (1 i s), under the constraint si¼1 i2 "2 of Theorem 5.1, such that the total number of candidates is minimized. Note that the case of reducing the dimensionality from w to arbitrary f can easily be extended, which will be discussed at the end of this subsection. In the sequel, we first consider the case, where f ¼ 1. Let gi ði Þ be the number of data points (candidates) that are within i distance from the query point qi in the onedimensional space, and pdfj ðxÞ be the density function of data points converted from stream time series Tj , where x is within a domain of reduced data. Note that the density function pdfj ðxÞ can be obtained from the synopsis (i.e., histogram) of Tj . We have the following equation with respect to gi ði Þ and pdfj ðxÞ: Z qi þi pdfj ðxÞdx: ð1Þ gi ði Þ ¼ qi i

Furthermore, since we have jcandj ¼

s X

gi ði Þ;

ð2Þ

i¼1

where jcandj is the total number of candidates, our goal is to minimize jcandj, under the constraint s X

i2 "2 :

ð3Þ

i¼1

In the sequel, we first assume that data points are uniformly distributed within a search radius " for each query point qi , for each 1 i s. We later extend it to the nonuniform case. Without loss of generality, in the uniform case, let di be the density of data points within " distance from the query point qi . Equation (1) is rewritten as Z qi þi pdfj ðxÞdx ¼ 2 di i ; ð4Þ gi ði Þ ¼ qi i

where di is the density within " distance from the query point qi . Similarly, (2) can be rewritten as jcandj ¼ 2

s X

di i ;

ð5Þ

i¼1

where (3) holds. Thus, we want to select appropriate search radii 1 ; . . . ; s , in order to minimize the total number jcandj of candidates in (5). Fig. 8 illustrates a simple example of

1550


Fig. 8. Choosing optimal search radii.

selecting query radii, where s ¼ 2. Assuming that d1 ¼ 1 and d2 ¼ 0:5, we have jcandj ¼ 21 þ 2 by (5), which is the dash-dot line with cut point ð0; jcandjÞ in 10 -20 space. Furthermore, the circle in Fig. 8 centered at the origin with a radius " in the first quadrant is the border of the shaded region satisfying the constraint 12 þ 22 "2 . Intuitively, when we move the line 21 þ 2 c ¼ 0 upward in parallel by increasing parameter c from zero to ", for the first time, the line will intersect with the shaded region. In other words, there exists one h1 ; 2 i-pair (i.e., h0; "i) that satisfies the constraint 12 þ 22 "2 . Note that, at that point, the number of candidates jcandj is minimized. In general, in order to obtain the minimum number of candidates, we should set to " the radius of the query point (i.e., q2 in the example) that has the minimum density (i.e., d2 , since d2 < d1 ) and 0 for those of other query points (i.e., q1 ). We summarize the above example in the following theorem. Theorem 5.2. Assume that data points are uniformly distributed within " distance from each query point qi with density di , where 1 i s. In order to obtain the minimum number of candidates, we always find a query point qj with the minimum density dj and set i to 0, if i 6¼ j; ", otherwise. Proof. Without loss of generality, assume that d1 d2 ds :

ð6Þ

We want to prove that the total number jcandj of candidates is minimized when 1 ¼ " and i ¼ 0 for all 2 i s. In other words, we only need to prove the following equation: 2

s X

di i 2d1 ";

ð7Þ

i¼1

P where si¼1 i2 "2 . We prove (7) as follows: Based on (6), it holds that 2

s X

di i 2d1

i¼1

s X

i :

Furthermore, since we have vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi u s !2 sffiffiffiffiffiffiffiffiffiffiffiffi s s u X X X t 2 i ¼ i i ; i¼1

ð8Þ

i¼1

i¼1

ð9Þ

i¼1

by combining (8), (9), and (3), we can exactly have (7), which completes the proof. u t

VOL. 21,

NO. 11,

NOVEMBER 2009

Therefore, Theorem 5.2 shows that, under the assumption of uniform data distribution within the " distance from each query point, the minimum number of candidates is achieved when we select a query point qj with the lowest density among all query points and issue one range query centered at qj with a radius ". Now we compare FRM [13] with our theoretical solution under the uniform distribution assumption. Since FRM issues s ranges queries with radii p"ffiffis centered at s different query points qi (1 i s), the expected number of candidates for each range query with query point qi is 2"ffiffi di . Note that although there are duplicates for given by p s candidates retrieved from different range queries, FRM still needs to retrieve these duplicates. Thus, FRM expects to P 2"ffiffi si¼1 di Þ candidates for s range queries, obtain totally ðp s P 2"ffiffi si¼1 di Þ. In contrast, whose retrieval cost is given by Oðp s assuming that d1 dj for j > 1, our ARES approach only needs to retrieve ð2 d1 "Þ candidates with time complexity Oð2 d1 "Þ. Therefore, even if all di are equal to d, ARES can save the computation cost to retrieve as many as 2 P pffiffiffi 2"ffiffi si¼1 d 2 d "Þ candidates, compared d ð s 1Þð¼ p s to FRM. Discussions on nonuniform data distribution. In the case where data are not uniformly distributed within the " distance from each query point qi , the optimal solution might be as follows: First, for each possible value combinaP tion of 1 ; . . . ; and s , satisfying the constraint si¼1 i ¼ "2 , we obtain the total number of candidates falling into these ranges and select one combination that results in the smallest candidate set. This solution is globally optimal; however, the computation cost is rather high. Thus, we are seeking for locally optimal solutions. Specifically, we divide the problem of retrieving candidates that are within " distance from the query series into subproblems of finding candidates that are within intervals ½0; ; ð; 2; . . . ; and ð" ; " distances from the query series, respectively, where ". Here, is a small value such that data points within each interval are assumed to be uniformly distributed (e.g., can be considered as the size of cells). Therefore, for each subproblem, we apply ARES discussed above, which incurs much fewer candidates than FRM. In particular, we initially compute the density d1 ; d2 ; . . . ; ds of s ranges centered at query points q1 ; q2 ; . . . ; qs , respectively, with the same radius ; obtain a query point (e.g., q1 ) with the lowest density (e.g., d1 ); and issue a range query centered at q1 with radius . As a second step, we obtain those candidates that have distances from the query series within ð; 2. In particular, with the help of the histogram, we calculate the increased number of candidates by enlarging the search radius for each query point. Note that, in our example, the increased number of candidates for query point q1 is the number of candidates within ð; 2 distance from q1 , whereas that for query point qi ði 6¼ 1Þ is the number of candidates within qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ2 2


1551

distance from qi . Then, we select a query point with the minimum number of candidates and perform the search with radius in ð; 2 for q1 or qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ2 2 for other qi , where i > 1. This procedure repeats until the total search radius " is finally reached (i.e., (3) holds). Discussions on the general case with arbitrary f. Up to now, we always assume that the reduced dimensionality (via PAA) on each sliding window (extracted from stream time series) is equal to 1, that is, f ¼ 1. Now we discuss the general case, where f 1, using any dimensionality reduction method. In particular, for all the reduced f-dimensional data, we can construct a generalized synopsis on which our proposed ARES methodology discussed above can be applied with small modification. Specifically, the generalized synopsis structure is similar to that defined in Section 4. The only difference is that the histogram has f dimensions, for f > 1 (rather than 1D histogram in the case, where f ¼ 1). The bloom filters (with start offsets of sliding windows) corresponding to cells of the histogram are defined in the same way, as given in Section 4. Given a query sequence Snew of length n, we divide it into s disjoint windows of equal size w (n ¼ w s) and then reduce these disjoint windows to s f-dimensional query points q1 ; q2 ; . . . ; and qs , respectively, where f 1. Our ARES approach also needs to estimate the query radii i for range queries with query points qi (1 i s). In particular, we can obtain the density di around each query point qi from the f-dimensional histogram in the constructed synopsis and decide the query radii similar to that in Theorem 5.1. Note that the space cost of our constructed synopsis for the general case (f 1) is proportional to the number of cells in synopsis (i.e., b ¼ ef cells with frequency information and bloom filters, where e is the number of intervals that divide the data space on each dimension). Thus, compared to the case where f ¼ 1, the number of cells (or bloom filters) increases for f > 1, requiring higher space cost. Moreover, the distance computation between two f-dimensional points requires higher time cost, proportional to f. On the other hand, the size of the resulting candidate set for large f value is expected to be small due to the high pruning power by using more reduced dimensions to prune. Thus, there is a trade-off between the space (constrained by the available memory size) and the pruning power in the stream environment.

5.3 Pruning with Synopsis Although ARES can minimize the total number of candidates obtained from s range queries without introducing false dismissals, there are still many false positives existing in the candidate set. Fig. 9 illustrates an example of such false positives. Suppose that we have a query subsequence Q of length 2w, which is divided into two disjoint windows and transformed to two 1-dimensional query points q1 and q2 , respectively, using PAA. Consider one subsequence Ti ½os1 : os1 þ 2w 1 of length 2w from Ti , which similarly has two converted one-dimensional points p1 and p2 . Without loss of generality, assume that p1 is within 1 ( ") distance from q1 , that is, subsequences

Fig. 9. Pruning heuristics.

Ti ½os1 : os1 þ 2w 1 and Q form a candidate pair. According to ARES, we need to compute the euclidean distance between subsequences Ti ½os1 : os1 þ 2w 1 and Q from scratch. However, we can save this computation cost if p2 has the distance from q2 greater than pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi "2 dist2 ðq1 ; p1 Þ (or simply " in the sequel), where distðq1 ; p1 Þ is the euclidean distance between two 1-dimensional points q1 and p1 . In other words, if it holds that distðq2 ; p2 Þ > ", then we have dist2 ðq1 ; p1 Þ þ dist2 ðq2 ; p2 Þ > "2 , which indicates that this candidate pair is a false positive, and thus, can safely be pruned. This motivates us to further refine the candidate set after ARES, with the help of synopsis (including histograms and bloom filters). Recall that, during SJ processing, we hash the start offset of each sliding window in Ti into a position of a bloom filter in Syni using a hash function H and set this position to value “1.” As in Fig. 9, windows Ti ½os1 : os1 þ w 1 and Ti ½os1 þ w : os1 þ 2w 1 are mapped to positions Hðos1 Þ and Hðos1 þ wÞ of bloom filters BF and BF 0 , respectively. Since distðq1 ; p1 Þ 1 , the range query centered at q1 with a radius 1 would cover the cell containing BF . Thus, the pair hTi ½os1 : os1 þ 2w 1; Qi is a candidate. In order to further refine this pair (i.e., prune it if possible), we retrieve all 0 0 bloom filters BFð1Þ ; BFð2Þ ; . . . ; in those cells that are within " distance from query point q2 and check whether or not their Hðos1 þ wÞth positions contain values “1.” If all these positions are “0,” then data point p2 does not fall into any of these cells, which indicates that distðq2 ; p2 Þ > ", and the candidate pair is a false positive that can safely be removed; otherwise, it remains a candidate. We observe that it is quite inefficient to perform bit checking in bloom filters one by one for each candidate pair. In order to speed up this procedure, we use a special family of hash functions H for bloom filters, which satisfy the condition HðosÞ ¼ Hðos þ wÞ ¼ ¼ Hðos þ ðs 1ÞwÞ, where os is a positive integer representing the start offset of sliding windows and w the window size. One instance H of this hash family is defined as HðxÞ ¼ x A mod a, where A is a random number within ½1; a 1; w A ¼ c a and c is a positive integer. Therefore, in Fig. 9, we have Hðos1 Þ ¼ Hðos1 þ wÞ, that is, disjoint and consecutive windows (e.g., Ti ½os1 : os1 þ w 1 and Ti ½os1 þ w : os1 þ 2w 1) have their start offsets mapped to the same position of (possibly different) bloom filters (i.e., BF and BF 0 ), which can result in efficient bit operations among bit vectors. Without loss of generality, assume that Hðos1 Þ ¼ Hðos1 þ wÞ ¼ 3. As illustrated in Fig. 10a, the third (Hðos1 Þth) position in BF is set to “1,” where BF corresponds to the cell that point p1 from window Ti ½os1 : os1 þ w 1 falls into. Next, we retrieve

1552


VOL. 21,

NO. 11,

NOVEMBER 2009

Fig. 10. Pruning with bit operations (candidate pair hTi ½os1 : os1 þ 2w 1; Qi). (a) Successful case. (b) Unsuccessful case. 0 0 those bloom filters BFð1Þ ; BFð2Þ ; . . . ; whose corresponding cells are within " distance from query point q2 , and want to check whether or not the third (Hðos1 þ wÞth) bit in any of them is zero. Since positions of all bit vectors that need to be verified are at the same place (i.e., the third position), instead of checking these positions one by one, we can perform bit operations efficiently. Specifically, we first use 0 0 ; BFð2Þ ; . . . ; to bit OR operations for filters BFð1Þ S all bloom 0 a vector obtain a vector V2 ð¼ k BFðkÞ Þ, and then calculate T V , which is the bit AND of BF and V2 (i.e., V = BF V2 ). In 0 are “0,” Fig. 10a, since all bits at the third position of BFðkÞ the resulting vector V would also have “0” at this position, indicating that candidates pointed to by the same (third) location of BF can successfully be pruned, since the start offset ðos1 þ wÞ is not mapped to any of those bloom filters 0 . On the other hand, however, as illustrated in Fig. 10b, BFðkÞ 0 , having value “1” at the if there exists a bloom filter, say BFð2Þ third position, then the third position of V is also “1,” implying that candidates pointed to by this (third) position of BF cannot be pruned, since the start offset ðos1 þ wÞ is 0 . possible to be hashed into BFð2Þ Note that Fig. 10 only illustrates the example, where s ¼ 2. In the case where s > 2, when checking candidates in BF obtained from the query of q1 , for each query point qj (1 < j s), we need to retrieve those bloom filters in cells that are " distance from qj , and efficiently perform bit OR operations for all of them, resulting in a bit vector Vj . Then, we compute the bit vector V using bit AND for BF T T T and T all Vj , where 1 < j s. That is, V ¼ BF V2 V3 . . . Vs . The meaning of bits in V is similar to that in the case where s ¼ 2. Fig. 11 illustrates the detailed procedure of pruning with synopsis constructed for stream time series. In particular, we first compute the bit vector Vj , which is the bit OR of all bloom filters whose corresponding cells are within " distances from qj for each 1 j s (line 1). Then, we consider candidates from the range query of each query point qi separately (lines 2-8). In particular, for each qi , we obtain cells ci in synopsis, whose minimum distance from qi is within i , as well as bloom filters in them (line 3). For each bloom filter BF , we obtain a bit vector V , which is the bit OR of BF with all Vj , where 1 j s and j 6¼ i (line 4). Therefore, the resulting bit vector V summarizes the remaining candidates after pruning, where value “1” in V indicates that candidates pointed to by the same position in BF cannot be pruned, and “0” implies that those can be safely pruned. After pruning with bloom filters, for each resulting candidate ti with a start offset osi , we retrieve s cells c1 ; c2 ; . . . ; ci ; . . . ; and cs , which contain s start offsets ðosi ði 1ÞwÞ; ðosi iwÞ; . . . ; osi ; . . . ; and ðosi ði sÞwÞ)

Fig. 11. Pruning procedure with synopsis.

(corresponding to consecutive and disjoint windows in the candidate subsequence (line 6), respectively. If any of these offsets (e.g., ðosi þ ði jÞwÞ in cell cj ) is absent (within " range from qj ) or the summation of minimum (squared) distances mindist2 ðqj ; cj Þ is greater than "2 , then candidate ti is pruned; otherwise, inserted into the candidate set cand00 (lines 7 and 8).

5.4 Tuning Parameters In this section, we discuss how to choose parameters a and s at the beginning of SJ processing, which are the number of bits in each bloom filter and that of disjoint windows in each subsequence of length n, respectively. Our goal is to achieve the lowest computation cost with appropriate values of a and s. Specifically, let cand0 be the candidate set after pruning with bloom filters returned from ARES (i.e., all candidates ti in line 5 of Fig. 11). As we know, each position in a bloom filter of a bits is set to 1 with probability ð1 ð1 a1ÞN Þ, where N is the number of data that have been hashed in the bloom filter so far. According to line 4 of Fig. 11, a position is set to “1” in bit vector V , which is bit AND of s vectors, Q with probability sj¼1 V j6¼i ð1 ð1 a1Þgj ð"Þ Þ ð1 ð1 a1Þgi ði Þ Þ, where gj ð"Þ (gi ði Þ) is the number of data within " (i ) distance from query point qj (qi ). Therefore, the number jcand0 j of candidates after pruning with synopsis is expected to be 0 1 gj ð"Þ ! s s X Y 1 B C jcand0 j ¼ gi ði Þ @ 1 1 A V a i¼1 j¼1 j6¼i ð10Þ ! 1 gi ði Þ ; 1 1 a which can be rewritten as 0

jcand j ¼

s X i¼1

gi ði Þ

s Y j¼1

1 1 1 a

g ð Þ 1 1 a1 i i g ð"Þ : 1 1 a1 i

gj ð"Þ !! ð11Þ

We simplify (11) using Taylor’s Theorem, ð1 ð1 a1Þx Þ xa , where x a, and obtain


jcand0 j ¼

s Y j¼1

1 gj ð"Þ 1 1 a

!!

s X gi ði Þ2 i¼1

gi ð"Þ

;

ð12Þ

where (3) holds. Therefore, (12) provides a formal equation to model the total number jcand0 j of candidates after pruning with bloom filters. At the beginning of SJ processing, we assume uniform distribution of the underlying data, and thus, (4) holds that gi ði Þ2 ¼ 2 di i . Furthermore, assuming that d1 is the lowest density among all di , we set 1 ¼ " and i ¼ 0 for i 6¼ 1. We rewrite (12) as follows: 0

jcand j ¼

s Y 2dj " j¼1

a

ð13Þ

2d1 ";

Cost ¼ jcand0 j s ¼ 2d1 "

s Y 2dj " j¼1

a

s;

ð14Þ

which can be rewritten as s X

logð2dj "Þ s logðaÞ þ logðsÞ:

j¼1

ð15Þ Without loss of generality, we ignore the first constant term. Let logð2dj "Þ be a random number generated from a random variable X with the mean and variance 2 , which can initially be obtained from histograms in synopsis. Therefore, (15) can be simplified as Cost ¼

s X

Therefore, the expected value EðZÞ of Z is Z max Z max 1 F 0 ðÞd ¼ pffiffiffiffiffiffiffiffi d; EðZÞ ¼ 2s min min

ð19Þ

where F 0 ðÞ is the probability density function (PDF) of variable Z, and min and max are the minimum and maximum possible values of Z, respectively. Equation (19) can be simplified as 1 EðÞ ¼ pffiffiffiffiffiffiffiffi 2max 2min : 2 2s

ð20Þ

Thus, substituting (20) for (16), we have

where (3) holds. As in lines 6-8 of Fig. 11, for each candidate ti , the refinement cost of ti is OðsÞ in the worst case, since we need to find ðs 1Þ locations of start offsets and compute the summed distance from s windows. Thus, the total computation cost is jcand0 j s, where jcand0 j is in (13). We model the total cost, Cost, of lines 6-8 in Fig. 11 as

logðCostÞ ¼ logð2d1 "Þ þ

1553

Xj s logðaÞ þ logðsÞ;

ð16Þ

j¼1

where Xj Nð; 2 Þ. Our goal is to minimize Cost in (16) by selecting the best P values of s and a. Let sj¼1 Xj follow a random variable Z with the cumulative density function (CDF) F ðÞ. Assuming that Xj are random numbers independently drawn from a random variable X with mean and variance 2 , we apply the Central Limit Theorem (CLT) [31] and obtain pffiffi Þ, where ðxÞ is the CDF of F ðÞ ¼ P robfZ g ¼ ðs s the normal distribution. We approximate ðxÞ with a linear function: rffiffiffi ! 1 2 ðxÞ 1þ x : ð17Þ 2 By combining (17), we have ! rffiffiffi 1 2 s pffiffiffi 1þ : F ðÞ ¼ 2 s

ð18Þ

Cost ¼ EðÞ s logðaÞ þ logðsÞ 1 ¼ pffiffiffiffiffiffiffiffi 2max 2min s logðaÞ þ logðsÞ: 2 2s

ð21Þ

In practice, s should not be too large, since large s incurs high computation cost of combining the result. Therefore, we choose the value of s no greater than 8. After fixing s, the value of a can easily be derived from (21) so as to achieve the minimum cost.

5.5 Batch Processing In this section, we discuss the batch processing of SJ over multiple stream time series. In reality, stream data may arrive at the system at different rates or can be delayed and then suddenly arrive in a batch due to various reasons, such as network congestion. In such situations, we need to handle SJ queries over a number of (e.g., t) new data (i.e., new ð1Þ ð2Þ ðtÞ ; Snew ; . . . ; and Snew ) at the same time. One subsequences Snew straightforward way to solve this problem is to invoke procedure SJ_Framework (described in Fig. 2) several times, considering new subsequences separately as if they come one after another. This method requires invoking procedure ðiÞ (1 i t), which incurs ARES for each subsequence Snew high search cost in finding similar pairs from stream time series. In contrast, the batch processing can group consecutive ð1Þ ðkÞ to Snew ) that have temporal new subsequences (e.g., from Snew correlations (i.e., close to each other), and handle SJ by invoking procedure ARES only once for each group. Next, we illustrate details of batch processing. Let ðjÞ ðjÞ q1 ; q2 ; . . . ; and qsðjÞ be s query points converted from s ðjÞ disjoint windows of subsequence Snew (1 j t), respectively. For simplicity, we only consider the assumption of the uniform data distribution, that is, data are uniformly ðjÞ distributed within " distance from each query point qi , for 1 i s and 1 j t. Following the strategy of ARES that selects a query point with the lowest density and issues a singlerange query with radius ", we define a new term, the ðjÞ group density, with which we choose the ith query points qi of subsequences in the group that has the lowest group density and issue one range query for the entire group. Specifically, assume that we have a group of k subð1Þ ðkÞ ð1Þ ð2Þ ðkÞ sequences Snew ; Snew ; . . . ; and Snew . Let gi ð½qi ; qi ; "Þ ¼ Pk ðjÞ ðjÞ is the ith query point of new j¼1 gi ðqi ; "Þ, where qi ðjÞ ðjÞ subsequence Snew , and gi ðqi ; "Þ the number of candidates ðjÞ obtained from a range query centered at qi with a radius ". In ð1Þ ðkÞ other words, gi ð½qi ; qi ; "Þ is the total number of candidates

1554


for the group, if we perform a single group range query using the ith query points. The group density ð1Þ

defined as

ðkÞ

gi ð½qi ;qi ;"Þ , kð2"Þ

ð1Þ ðkÞ di ð½qi ; qi ; "Þ

is

that is, the number of group candidates

divided by the total intervals covered by k separate range ðkÞ be the lowest group queries (each with interval 2"). Let dmin i ð1Þ

ðkÞ

density di ð½qi ; qi ; "Þ in the group. We retrieve all data ðjÞ

ðjÞ

within interval ½minkj¼1 fqi g "; maxkj¼1 fqi g þ " as group candidates and then refine them using synopsis, similar to that in Fig. 11, where the only required modification is to let Vj be the bit OR of all bloom filters within the interval ðlÞ

ðlÞ

½minkl¼1 fqj g "; maxkl¼1 fqj g þ " (in line 1 of Fig. 11). Finally, group candidates are checked by computing their real ðjÞ (1 j k). euclidean distances to Snew The only issue that remains to be addressed is how to ð1Þ ð2Þ ðtÞ ; Snew ; . . . ; and Snew into partition new subsequences Snew several groups. In particular, we need to decide the total number of groups as well as group memberships. Our basic idea is to treat consecutive new subsequences as one group in an online fashion, since they tend to be close to each other. Without loss of generality, assume that we have ð1Þ ð2Þ ; Snew ;...; already included k consecutive subsequences Snew ðkÞ and Snew in a group. Then, we need to decide the ðkþ1Þ . In order membership of the ðk þ 1Þth subsequence Snew ðkþ1Þ to include the ðk þ 1Þth subsequence Snew into the same group of previous k subsequences, we require that ðkþ1Þ ðkÞ dmin dmin i j ðk þ 1Þ. Note that after including Snew , the lowest group density may appear when we issue the query with the jth query points of (k þ 1) subsequences, instead of the ith ones of k subsequences. However, our grouping ðkþ1Þ to the group if the average rationale is to add Snew number of candidates to be refined for each subsequence ðjÞ in the group does not increase. Otherwise, if Snew ðkþ1Þ ðkÞ < dmin ðk þ 1Þ, we start a new group with Snew . dmin i i

5.6 Load Shedding In previous sections, we always assume that the memory is large enough to retain all synopsis of stream time series. Now we consider the case, where such an assumption is violated. In particular, assume that the system requires B extra memory. In other words, we have to load-shed (evict) data in synopsis with a total size of B from the memory. Our basic idea of load shedding is to evict those start offsets in synopsis since they consume most of the memory. However, we need to cleverly discard these offsets in synopsis such that the accuracy of SJ query remains high. We propose three load shedding strategies for SJ over stream time series. First, for each synopsis Syni of Ti (1 i m), we randomly evict B m start offsets. In particular, for the kth cell of the histogram in Syni with frequency freqik , a total number of & ’ B freqik Pb m j¼1 freqij start offsets are randomly selected and removed. This method can immediately free the memory when the memory

VOL. 21,

NO. 11,

NOVEMBER 2009

is full, however, without considering any query accuracy issues. The expected ratio of the number of retrieved answers to that of the actual answers is ð1 ðB m Þ=ðW w þ 1ÞÞ, where ðW w þ 1Þ is the number of start offsets in Syni . The second approach always discards those start offsets that are expected to expire the earliest. In particular, we always evict offsets in the order of ðt W þ 1Þ; ðt W þ 2Þ; . . . ; and ðt W þ B m Þ. The rationale is that, throwing away these start offsets would only result in the inaccuracy of SJ query within a short period in the near future. This method needs to find start offsets one by one in cells of the histogram, which incurs some processing cost before load shedding. However, the amortized SJ cost is the same as that without load shedding (since the cost of removing start offsets can be considered as expiration), and moreover the query accuracy is high if load shedding occurs infrequently. In contrast to the second strategy, the third one discards all start offsets in one cell each time. In this case, we need to decide which cells to load shed, in order to reduce the number of false dismissals, and thus, obtain SJ answers with high accuracy. Specifically, we formally model the score of a cell ck that can be shed using a probability Scoreðck Þ as ( ) s X 2 2 ð22Þ mindist ðqij ; ck Þ > " ; Scoreðck Þ ¼ P rob j¼1

where qij is the jth converted query point of new subsequence from series Ti . Intuitively, if the cell ck is far away from all query points qij , the probability that data in ck are not in the SJ result is high, that is, the score Scoreðck Þ is high. Without loss of generality, let Xj (1 j s) be a random variable following the distribution of values mindist2 ðqij ; ck Þ in (22) for all series Ti (1 i m). Assume that j and 2j are mean and variance of variable Xj , respectively. According to CLT, we have 0 1 Ps 2 " j¼1 j C B ffi A; Scoreðck Þ ¼ 1 @ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð23Þ Ps 2 j¼1 j where ðxÞ is the CDF of the normal distribution. Note that, here, we use statistics of query points qij at the current time stamp as that in the near future. This is reasonable since query points at consecutive time stamps have close temporal correlations, that is, they tend to be close to each other. After estimating scores with (23), our third strategy selects those cells that have the highest scores to load shed, since they are very unlikely to affect SJ result in the near future, and moreover, they may be expired even if a range query covers the discarded data.

6

EXPERIMENTAL EVALUATION

In this section, we illustrate through extensive experiments the efficiency and effectiveness of our proposed approach for the SJ query, using ARES and Synopsis Pruning techniques (denoted by ARES+SP for brevity). Specifically, in our experiments, we tested both real and synthetic data sets, including sstock [30], randomwalk [15], sensor, and EEG [2]. The sstock data set [30] contains 193 company stocks’


1555

Fig. 12. The parameter settings.

daily closing price from late 1993 to early 1996; the randomwalk data set [15] is synthetically generated using the random walk model; the sensor data set contains temperature time series collected from 54 sensors deployed in the Intel Berkeley Research lab between 28th February and 5th April 2004, which is available online at [http:// db.csail.mit.edu:80/labdata/labdata.html]; moreover, the EEG data set includes intracranial electroencephalographic (EEG) time series recorded from epilepsy patients during the seizure-free interval from inside and outside the seizure generating area, which can be found at [http://scitation. aip.org/getpdf/servlet/GetPDFServlet?filetype=pdf&id= PLEEE8000064000006061907000001&idtype=cvips]. In order to simulate the SJ scenario, for each of the four data sets, we concatenate time series of small length into longer ones of length 204,800 for which data are assumed to continuously arrive at the system in streams (i.e., stream time series). The similarity threshold " in our stream SJ is set to the one such that the query selectivity is round 0.1 percent (i.e., the size of the result set divided by the total number of possible pairs). Note that for a specific application, the choice of " is determined by experienced data analysts in that domain. We measure the performance of SJ in terms of the total time, which is the time that the system can finish SJ processing over new data at each time stamp. In particular, the total time is defined as the summation of the filtering time (i.e., the time to prune candidates) and refinement time (i.e., (the number of candidates) (a unit of time to refine a candidate pair)). In the sequel, we first evaluate the performance of SJ with FRM and ARES+SP, which issue range queries with the same and different radii, respectively. As a second step, we demonstrate the efficiency and effectiveness of SJ with synopsis pruning after ARES (i.e., ARES+SP), compared with SJ processing with VAþ -Stream [19]. Then, we present in Section 6.3 the experimental result of SJ with batch processing. Finally, we compare the query performance of SJ applying three different load shedding techniques in Section 6.4, in terms of the query accuracy. We conducted our experiments on a Pentium 4 3.2- GHz PC with 1 GB memory. All experimental results are averaged over 50 runs. Fig. 12 summarizes the tested values of parameters in the experiments as well as their default values in bold font.

6.1 Performance of FRM versus ARES+SP In the first set of experiments, we compare the query performance of the SJ using FRM with that using ARES+SP, under different values of parameters w and n. In particular, we evaluate the performance with the measure, total time, which is the average time to process SJ over new data at one time stamp. For example, assuming that there are 1,000 stream time series, at each time stamp, we obtain

Fig. 13. Performance of FRM and ARES+SP (versus w). (a) sstock, (b) randomwalk, (c) sensor, and (d) EEG.

1,000 new subsequences, and each one is joined with 1,000 series, whose processing time in all is the total time. Figs. 13a, 13b, 13c, and 13d illustrate the total time of SJ with FRM and ARES+SP over both real and synthetic data sets, including sstock, randomwalk, sensor, and EEG, respectively, where w ¼ 16; 32; 64; 128, and n; a; b; W , and m are set to their default values (i.e., n ¼ 256; a ¼ 128; b ¼ 64; W ¼ 512, and m ¼ 1;000). When the window size w increases, the total time of both approaches decreases. In the figures, we can see that SJ with ARES+SP always outperforms that with FRM by an order of magnitude. This is reasonable, since ARES is based on the formal cost model that can minimize the number of candidates, and moreover, the synopsis pruning (SP) technique can utilize synopsis to further shrink the candidate set from ARES. Therefore, ARES+SP has much fewer candidates than FRM, which is confirmed by the number on each column in Figs. 13a, 13b, 13c, and 13d.

Fig. 14. Performance of FRM and ARES+SP (versus n). (a) sstock, (b) randomwalk, (c) sensor, and (d) EEG.

1556


VOL. 21,

NO. 11,

NOVEMBER 2009

Fig. 15. Performance of ARES+SP versus VAþ -Stream. (a) sstock, (b) randomwalk, (c) sensor, and (d) EEG.

Fig. 16. Scalability of ARES+SP versus VAþ -Stream. (a) sstock, (b) randomwalk, (c) sensor, and (d) EEG.

Figs. 14a, 14b, 14c, and 14d illustrate the performance of SJ with FRM and that with ARES+SP, over data sets sstock and randomwalk, respectively, by varying the length n of subsequences from 64 to 512, where other parameters are set to their default values. Note that for fair comparison, here, we choose different values of similarity threshold " with respect to n such that SJ queries have the same selectivity (i.e., the output size divided by the maximum possible join size). Similar to the previous experiment, the total time of SJ with ARES+SP is less than that of SJ with FRM by an order of magnitude, and SJ with ARES+SP has much smaller candidate set than SJ with FRM. In a special case where n ¼ w ¼ 64, that is, the entire query subsequence is considered as one window, since ARES has the same candidate set as FRM, the performance of ARES+SP is similar to that of FRM. However, as indicated by the number of candidates for ARES+SP, SP can still prune false positives in the candidate set.

the four tested data sets are similar, in this and subsequent experiments, we will only present the results over two real/ synthetic data sets, sstock and randomwalk, due to the space limit. When parameter a increases, the number of candidates after ARESþSP decreases. However, since large a results in longer bit vectors, it incurs more computation cost to perform bit operations during SJ processing. Figs. 15c and 15d illustrate the performance of SJ with ARES+SP and VAþ -Stream over sstock and randomwalk, respectively, using different values of b. Since large b indicates more accurate range queries and more bloom filters, the number of candidates decreases. On the other hand, however, the total time increases due to the more bit operations. Next, we test the scalability of SJ with ARES+SP, compared that with VAþ -Stream. In particular, Figs. 16a, 16b and Figs. 16c, 16d illustrate the experimental result by varying parameters W from 256 to 1,024 and m from 200 to 2,000, respectively. SJ with ARES+SP always outperforms that with VAþ -Stream by orders of magnitude, in terms of the total time, which indicates the robustness of ARES+SP on these parameters.

6.2 Performance of ARES+SP versus VAþ -Stream After confirming the efficiency and effectiveness of SJ with ARES+SP compared with FRM, as a second step, we also investigate the performance of SJ with ARES+SP and SJ with VAþ -Stream [19]. Specifically, we build an index structure, VAþ -Stream, for all subsequences of length n extracted from each stream time series. When a new data item arrives at a stream time series, we issue a similarity query on the index to retrieve all series similar to the new subsequence. Since each stream time series receives one new subsequence at every time stamp, the total time is obtained by summing up the time of m similarity searches in VAþ -stream. As illustrated in Fig. 15, where m ¼ 1;000 and n ¼ 256, SJ processing with VAþ -Stream requires as many as 30 seconds to handle the new incoming data in 1,000 series at only one time stamp. In contrast, our approach only needs a few seconds to process the pairwise join among 1,000 stream time series. Specifically, Figs. 15a and 15b compare our approach with SJ þ VAþ -Stream, by varying parameter a (i.e., the number of bits in bloom filters) from 32 to 512. Note that, since the experimental results of

6.3 Performance of Batch Processing Next, we demonstrate the performance of the batch SJ processing, compared with single SJ processing. Recall that, for the batch SJ, instead of searching similar subsequences for t new subsequence one by one, SJ batch processing performs the searches for groups of subsequences, looking up synopsis only once for each group. Fig. 17 illustrates the result of SJ with ARES+SP+batch and ARES+SP, where the total time of SJ batch processing is defined as the amortized processing time per time stamp, and other parameters are set to the default values. Since ARES+SP+batch can save the cost of pruning with synopsis for individual query subsequences multiple times, it uses much less total time than ARES+SP. 6.4 Performance of Load Shedding Finally, we illustrate the experimental result of SJ with load shedding. Specifically, we test three methods proposed in


Fig. 17. Performance of ARES+SP+batch versus ARES+SP (versus t). (a) sstock and (b) randomwalk.

Section 5.6. The first approach randomly selects B m start offsets in each synopsis Syni from stream time series Ti . The second one evicts B m earliest start offsets in each series. The last method discards all start offsets in a cell with the highest score that are unlikely to be in the SJ result. Fig. 18a illustrates the accuracy of the SJ result within B m time stamps after load shedding over sstock. In particular, we vary the load shedding ratio from 10 to 50 percent, defined as the percentage of data (start offsets) that are discarded in each stream time series and measure the accuracy of SJ by the percentage of false dismissals in the final SJ result. We find that the query accuracy of the first approach is very sensitive to the number of discarded data, whereas the second one is better and the last one always the best. Considering the shedding time (i.e., the time to discard data), however, as illustrated in Fig. 18b, the first method requires the least time among the three. The second one takes the shedding time between the first and third methods when less than 40 percent data are discarded. When more than 40 percent data are evicted, the second method requires the most time of all, since searching specific offsets is very costly. The results on randomwalk are similar and omitted due to space limit.

Fig. 18. Performance of load shedding techniques (sstock). (a) Query accuracy. (b) Shedding time.

REFERENCES [1] [2]

[3]

[4] [5] [6] [7] [8] [9] [10] [11]

7

CONCLUSIONS

SJ in time-series databases plays an important role in many applications. In this paper, we propose an efficient and effective approach to incrementally perform SJ over multiple stream time series. Specifically, we propose a novel ARES approach, based on a formal cost model to minimize the resulting number of SJ candidates, which, thus, adapts to the stream data. Then, we integrate ARES into SJ processing seamlessly and utilize space-efficient synopses constructed over subsequences from stream time series to further prune candidate pairs. The batch processing and load shedding techniques are also discussed. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approach to answer SJ query over multiple stream time series.

ACKNOWLEDGMENTS This work was supported by Hong Kong RGC Grants under Project 611608, the National Grand Fundamental Research 973 Program of China under Grant 2006CB303000, the NSFC Key Project Grant 60736013, and the NSFC Project Grant 60763001.

1557

[12] [13] [14] [15] [16] [17] [18] [19] [20] [21]

R. Agrawal, C. Faloutsos, and A.N. Swami, “Efficient Similarity Search in Sequence Databases,” Proc. Fourth Int’l Conf. Foundations of Data Organization and Algorithms (FODO), 1993. R.G. Andrzejak, K. Lehnertz, C. Rieke, F. Mormann, P. David, and C.E. Elger, “Indications of Nonlinear Deterministic and Finite Dimensional Structures in Time Series of Brain Electrical Activity: Dependence on Recording Region and Brain State,” Physical Rev., vol. 64, no. 6, pp. 061907-1-061907-8, 2001. S. Berchtold, C. Bo¨hm, D.A. Keim, and H.-P. Kriegel, “A Cost Model for Nearest Neighbor Search in High-Dimensional Data Space,” Proc. 16th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS), 1997. S. Berchtold, D.A. Keim, and H.P. Kriegel, “The X-Tree: An Index Structure for High-Dimensional Data,” Proc. 22nd Int’l Conf. Very Large Data Bases (VLDB), 1996. D.J. Berndt and J. Clifford, “Finding Patterns in Time Series: A Dynamic Programming Approach,” Advances in Knowledge Discovery and Data Mining, Am. Assoc. for Artificial Intelligence, 1996. T. Brinkhoff, H.-P. Kriegel, and B. Seeger, “Efficient Processing of Spatial Joins Using R-Trees,” Proc. ACM SIGMOD, 1993. A. Bulut and A.K. Singh, “A Unified Framework for Monitoring Data Streams in Real Time,” Proc. 21st Int’l Conf. Data Eng. (ICDE), 2005. Y. Cai and R. Ng, “Indexing Spatio-Temporal Trajectories with Chebyshev Polynomials,” Proc. ACM SIGMOD, 2004. K.P. Chan and A.W.-C. Fu, “Efficient Time Series Matching by Wavelets,” Proc. 15th Int’l Conf. Data Eng., 1999. L. Chen and R. Ng, “On the Marriage of Edit Distance and lp Norms,” Proc. 30th Int’l Conf. Very Large Data Bases (VLDB), 2004. Q. Chen, L. Chen, X. Lian, Y. Liu, and J.X. Yu, “Indexable PLA for Efficient Similarity Search,” Proc. 33rd Int’l Conf. Very Large Data Bases (VLDB), 2007. C. Cranor, T. Johnson, and O. Spatscheck, “Gigascope: A Stream Database for Network Applications,” Proc. ACM SIGMOD, 2003. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, “Fast Subsequence Matching in Time-Series Databases,” Proc. ACM SIGMOD, 1994. A. Guttman, “R-trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD, 1984. E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases,” Proc. ACM SIGMOD, 2001. M. Kontaki and A. Papadopoulos, “Efficient Similarity Search in Streaming Time Sequences,” Proc. 16th IEEE Conf. Scientific and Statistical Database Management (SSDBM), 2004. F. Korn, H. Jagadish, and C. Faloutsos, “Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences,” Proc. ACM SIGMOD, 1997. X. Lian, L. Chen, X. Yu, G.R. Wang, and G. Yu, “Similarity Match over High Speed Time-Series Streams,” Proc. 23rd Int’l Conf. Data Eng. (ICDE), 2007. X. Liu and H. Ferhatosmanoglu, “Efficient k-NN Search on Streaming Data Series,” Proc. Symp. Spatial and Temporal Databases (SSTD), 2003. M.L. Lo and C.V. Ravishankar, “Spatial Hash-Joins,” Proc. ACM SIGMOD, 1996. S. Michel, P. Triantafillou, and G. Weikum, “Klee: A Framework for Distributed Top-k Query Algorithms,” Proc. 31st Int’l Conf. Very Large Data Bases (VLDB), 2005.

1558


[22] M.F. Mokbel, M. Lu, and W.G. Aref, “Hash-Merge Join: A NonBlocking Join Algorithm for Producing Fast and Early Join Results,” Proc. 20th Int’l Conf. Data Eng. (ICDE), 2004. [23] Y.S. Moon, K.Y. Whang, and W.S. Han, “Generalmatch: A Subsequence Matching Method in Time-Series Databases Based on Generalized Windows,” Proc. ACM SIGMOD, 2002. [24] Y.S. Moon, K.Y. Whang, and W.K. Loh, “Duality-Based Subsequence Matching in Time-Series Databases,” Proc. 17th Int’l Conf. Data Eng. (ICDE), 2001. [25] Y. Sakurai, C. Faloutsos, and M. Yamamuro, “Stream Monitoring under the Time Warping Distance,” Proc. 23rd Int’l Conf. Data Eng. (ICDE), 2007. [26] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos, “Online Outlier Detection in Sensor Data Using Non-Parametric Models,” Proc. 32nd Int’l Conf. Very Large Data Bases (VLDB), 2006. [27] Y.F. Tao, M.L. Yiu, D. Papadias, M. Hadjieleftheriou, and N. Mamoulis, “RPJ: Producing Fast Join Results on Streams through Rate-Based Optimization,” Proc. ACM SIGMOD, 2005. [28] T. Urhan and M.J. Franklin, “XJoin: A Reactively-Scheduled Pipelined Join Operator,” IEEE Data Eng. Bull., vol. 23, pp. 27-33, 2000. [29] M. Vlachos, G. Kollios, and D. Gunopulos, “Discovering Similar Multidimensional Trajectories,” Proc. 18th Int’l Conf. Data Eng. (ICDE), 2002. [30] C.Z. Wang and X. Wang, “Supporting Content-Based Searches on Time Series via Approximation,” Proc. 12th Int’l Conf. Scientific and Statistical Database Management (SSDBM), 2000. [31] E.W. Weisstein, “Central Limit Theorem,” http://mathworld. wolfram.com/CentralLimitTheorem.html. 2009. [32] D.A. White and R. Jain, “Similarity Indexing with the SS-Tree,” Proc. 12th Int’l Conf. Data Eng. (ICDE), 1996. [33] Y.W. Huang, N. Jing, and E.A. Rundensteiner, “Spatial Joins Using R-Trees: Breadth-First Traversal with Global Optimizations,” Proc. 23rd Int’l Conf. Very Large Data Bases (VLDB), 1997. [34] B.-K. Yi and C. Faloutsos, “Fast Time Sequence Indexing for Arbitrary Lp Norms,” Proc. 26th Int’l Conf. Very Large Data Bases (VLDB), 2000. [35] Y. Zhu and D. Shasha, “StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time,” Proc. 28th Int’l Conf. Very Large Data Bases (VLDB), 2002. [36] Y. Zhu and D. Shasha, “Efficient Elastic Burst Detection in Data Streams,” Proc. ACM SIGKDD, 2003.

VOL. 21,

NO. 11,

NOVEMBER 2009

Xiang Lian received the BS degree from the Department of Computer Science and Technology, Nanjing University, in 2003. He is currently working toward the PhD degree in the Department of Computer Science and Engineering, Hong Kong University of Science and Technology. His research interests include query processing over stream time series and uncertain databases. He is a student member of the IEEE.

Lei Chen received the BS degree in computer science and engineering from Tianjin University, China, in 1994, the MA degree from the Asian Institute of Technology, Thailand, in 1997, and the PhD degree in computer science from the University of Waterloo, Canada, in 2005. He is now an assistant professor in the Department of Computer Science and Engineering at Hong Kong University of Science and Technology. His research interests include uncertain databases, graph databases, multimedia and time series databases, and sensor and peer-to-peer databases. He is a member of the IEEE.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.