Predicting Density-Based Spatial Clusters Over Time - CiteSeerX

0 downloads 0 Views 109KB Size Report
so that objects within a cluster have high similarity or close distance ... checking approach, keep shortening the interval may ... relationships, our first algorithm can then calculate which objects will become .... (velocity) for the object Oi, where vi,j is the rate of change .... The second COT will also be reported at T2 as. COT1 ...
Predicting Density-Based Spatial Clusters Over Time Chih Lai Nga T. Nguyen Graduate Programs in Software Engineering University of St. Thomas St. Paul, MN 55125 [email protected] [email protected] Abstract Most of existing clustering algorithms are designed to discover snapshot clusters that reflect only the current status of a database. Snapshot clusters do not reveal the fact that clusters may either persist over a period of time, or slowly fade away as other clusters may gradually develop. Predicting dynamic cluster evolutions and their occurring periods are important because this information can guide users to prepare appropriate actions toward the right areas during the right time for the most effective results. In this paper we first develop a simple but effective approach in predicting the future distance among object pairs. Object pairs that will be close in distance over different periods of time are preserved for the clustering process. Our clustering algorithms then process these object pairs and their periods to discover density-based clusters that may occur or change over time.

1. Introduction Clustering is the process of grouping data into clusters so that objects within a cluster have high similarity or close distance [1][2][3][5][8]. However, most of existing clustering algorithms are designed to discover snapshot clusters that reflect only the current status of a database. Such snapshot clusters conceal the fact that some clusters may persist over periods of time, and some clusters may slowly fade away as other clusters may gradually develop. If we can compute the change rates of individual objects from past data and consider these change rates in the clustering process, clusters that may expand, shrink, disappear, emerge, or remain unchanged over time can be predicted. Predicting these cluster changes over time have many important applications. For example, if we can predict the periods and areas of future concentration of incoming missiles or enemies, preemptive actions, such as firing intercepted weapons to the predicted areas in a specific time window, can be prepared and executed in advance for the most effective results. Similarly, if we analyze historical eating/drinking habits or exercise records, we may be able to predict widespread of particular diseases in certain periods. Early

prevented actions can then be taken to lower the risk. Other applications include predicting periods and areas of congestions for air traffic control, or city growing plan. One solution in discovering future dense clusters is to repeatedly execute the clustering algorithms discussed in [2][3] on extrapolated data at regular intervals. Unfortunately, if the selected intervals are not small enough, many cluster changes over time may not be detected. Although, at the expense of high computation cost, executing more frequent clustering at shorter intervals can alleviate this problem, there is no short enough interval that can cover infinite time points on a timeline. If no future cluster is detected by this intervalchecking approach, keep shortening the interval may simply waste more system resources because there may not have any future cluster in some databases. It is our goal in this paper to develop a simple but effective approach in predicting density-based clusters over time. We first modify our previous work [6] on air traffic control and develop a simple formula that can check which paired objects will be in the ε−neighborhood [2] of each other. From these pair-wise ε−neighborhood relationships, our first algorithm can then calculate which objects will become Core Objects Over Time (COOTs) during what periods in which areas. COOTs tell users where, when, and how long the concentrations may happen so that effective actions can be taken toward right places at right time. If detailed cluster contents and their periods are needed, our second algorithm can also produce Clusters Over Time (COTs) from these pair-wise ε−neighborhood relationships. Each COT contains two major items: a set of clusters with their containing objects, and a time period in which the COT remains unchanged. On the temporal domain, although some applications may only need to predict COOTs or COTs in the future, our approach is general enough to predict COOTs and COTs that may happen any where in a timeline (i.e. both in the future and/or in the past*). That is the reason we use the general term “over time”, instead of specific terms such as “future” for our algorithms. In fact, our approach also offers an option that allows users to provide a *

Generating COOTs and COTs in the past can be used to verify the correctness of models that are under study.

Specified Prediction Windows (SPW) for predicting COOTs and COTs (i.e. from current time to the next two hours). SPW not only reduces unnecessary computations and space by filtering out uninteresting ε−neighborhood relationships from unwanted periods, it also allows users to guide the prediction process to produce more focused results in the periods where users are able to react to the cluster changes. On the functional domain, our notions and algorithms can be either applied to 2D/3D spatial applications, or it can be generalized to high dimensional feature space. In Section 2, we review the concept of density-based clusters and related work. Section 3 gives the formal definitions for Core Objects Over Time (COOTs) and Clusters Over Time (COTs). In Section 4.1, we first develop a simple formula for predicting the ε−neighborhood of each object over time. Section 4.2 and 4.3 discuss algorithms for discovering COOTs and COTs. In Section 5, we analyze the time and space complexity of our algorithms. We also conduct a series of experiments in this section. Section 6 concludes our study.

2. Related Work Ester et al. [2] proposed a density-based clustering concept and algorithm for discovering clusters in spatial databases. A point p is said to be in the ε−neighborhood of another point q if the distance between p and q is less than or equal to a parameter ε. This ε−neighborhood relationship between p and q is denoted as Nε(p) = {q ∈D | dist(p, q) ≤ ε}, where D is a database and dist() is a distance function. A point p is a core object if there are at least MinPts points in the ε−neighborhood of p, (i.e. |Nε(p)| ≥ MinPts). A point q is directly density-reachable from a core object p if q ∈ Nε(p), and |Nε(p)| ≥ MinPts. A point q is density-reachable from another point p wrt. ε and MinPts if there is a chain of objects p1, …, pn, such that pi = p, pn = q, and pi+1 is directly densityreachable from pi. Two points p and q are densityconnected if there is a point o such that p and q are density-reachable from o wrt. ε and MinPts. A densitybased cluster C can then be defined as ∀p, q ∈ C: p is density-connected to q wrt. ε and MinPts. An algorithm, DBSCAN [2], is then developed to collect all the points that are density-connected into a cluster. Objects that are not part of any cluster are considered to be noise Ester et al. also proposed another algorithm called IncrementalDBSCAN [3] to handle clustering in dynamic databases. Since insertion or deletion of an object p to a database affects only the clusters in the neighborhood of p, the major task of the algorithm is to identify the present clusters that are affected by the update. Objects in these affected clusters are then put into two updated seed sets for insertion and deletion. Based on these seed sets, the

algorithm will then update the affected clusters or create new clusters. A. Lazarevic et al. [5] introduced a hierarchical clustering approach for predicting multiple continuous variables in a high-dimensional heterogeneous database environment. The technique consists of three phases: partitioning, localization and prediction. The partitioning and localization phases are recursively executed to hierarchically group similar variables into partitions. When the number of variables per partition is sufficiently small, the prediction phase takes place. Variables in each partition are predicted by using a localized prediction model that is built based on the objects assigned to that particular partition. The major focus of this study is to predict values of variables, not clustering objects. O. Wolfson et al. [7] discussed issues and solutions for modeling moving objects in a database environment. The major goal of this study is to avoid continuous location updates for each moving object. Hence, each object’s location is represented as a function of time. This function will be updated only when its prediction errors exceed a certain threshold. Clustering objects over time is not discussed in this research.

3. Problem Definition In order to predict cluster changes over time, we first need to predict the periods of ε−neighborhood relationships among objects. Core objects and their present periods can then be identified from these ε−neighborhood relationships among objects. After the core objects and their present periods are identified, objects that are density-connected to the core objects during these periods can then be grouped into clusters that exist over certain periods of time. In this section, we extend the definitions for ordinary core objects and density-based clusters to incorporate timing information. Let D = {O1, O2, … On} be a database that contains n objects at current time. Let each object Oi (1 ≤ i ≤ n) have m ≥ 1 attributes, then the status of Oi can be denoted as an m-tuple Oi = . We use another m-tuple Vi = to denote the rates of changes (velocity) for the object Oi, where vi,j is the rate of change for the jth attribute of Oi. Hence, the status (position) of Oi at any time T will be referred to as OiT, and it can be expressed as a function of time as: OiT = . Definition 1. An object Oi will become a Core Object Over Time (COOT) in the period of [j, k] if |Nε(Oi)| ≥ MinPts is true in [j, k]. We denote such core object over time as COOTi,j,k. If Oi is identified as a COOT during the period of [j, k], that means COOTi,j,k (or Oi) will move from Oij to Oik in

the period of [j, k]. Note that the period of a COOTi,j,k is always a closed period: the period of COOTi,j,k includes the begin and the end time points j and k. This may not be the case for COTs, as we will define COTs shortly. Since different objects may move in or out the ε−neighborhood of Oi over time and Oi itself is moving, Oi may become a COOT in different time periods. To avoid reporting the same object as COOTs for multiple times in overlapped periods, we required that any COOTi,j,k must satisfy the following condition: ∀COOTi,j’,k’, if [j, k] ∩ [j’, k’] ≠ ∅, then j = j’ and k = k’. Example 1. Assume we have 4 objects (O1 to O4). Let O2 to O4 be in the ε−neighborhood of O1 in different time periods as shown in Figure 1.(a). Let MinPts = 3, O1 will become a COOT in two different periods: [T1, T2] and [T3, T4]. If another object O5 is also in the ε−neighborhood of O1 during [T2, T3] period, then O1 will become a COOT only in one period: [T1, T4] (i.e. COOT1,T1,T4). T1

T2

T3

T4

T1

T2

O1/O2

O1/O2

O1/O3

O1/O3 O1/O4 O1/O5

T3

T4

O4/O5 O4/O6

MinPts = 3 (a) (b) Figure 1. ε−neighborhoods in different periods.

At any time T when new COOTs emerge, or when the objects that are within the ε−neighborhood of present COOTs change, we need to report a set of Clusters Over Time (COT) that stay unchanged (wrt. constituting objects) until sometime right before T so that all the different clusters over time can be captured. In other words, we need to first report an existing COT before we can update its contents to reflect the changes occurred at time T. For example, in Figure 1.(a), when O1 and O5 move into the ε−neighborhood of each other at T2, a COT that contains O1, O2, and O3 must be reported before we can update the COT to include O5. We denote COTs that are reported in the period between q and r as COTp,q,r. That is, q and r specify the time segment in which the constituting objects of COTp,q,r remain unchanged, and p ∈{1, 2, 3, 4} is used to indicate the openness of the time segment at time q and r. The openness of this time segment can be in one of the four situations: [q, r], [q, r), (q, r], or (q, r), denoted by 1, 2, 3, or 4, respectively. Note that a bracket indicates a closed time point and a parenthesis indicates an open time point. The openness of a COT period can be determined by the previous and the current ε−neighborhood relationship events as given in Table 1. If multiple ε−neighborhood relationships begin and end at the same time point (e.g. T2

or T3 of Figure 1.(a)), the begin events will be processed before the end events so that a surge of concentration can also be captured. Table 1. Openness rules for a COT’s period. Current ε−Relationship Event

Previous ε−Relationship Event Begin End Begin p=2, [q,r) p=4, (q,r) End p=1, [q,r] p=3, (q,r]

For example, in Figure 1.(a), the first COT will be reported at T2 is COT2,T1,T2 (with O1, O2,O3 in one cluster) and its period will be [T1, T2). This is because the previously processed events at T1 are relationship-begin events (O1/O2 & O1/O3) and the current event at T2 is also a relationship-begin event (O1/O5). The time period is open at T2 because COT2,T1,T2 remains unchanged only until sometime right before T2; and O5 will change the cluster content at T2. The second COT will also be reported at T2 as COT1,T2,T2 (with O1,O2,O3,O5) because the previously processed event is a begin event (O1/O5) and the current event is an end event (O1/O2). Similarly, the third COT reported at time T3 is COT4,T2,T3 (with O1,O3,O5). This is because the previous event at T2 is an end event (O1/O2) and the current event is a being event (O1/O4) at T3. The last two COTs in Figure 1.(a) are COT1,T3,T3 (with O1,O3,O4,O5), and COT3,T3,T4 (with O1,O3,O4). Since COTs can develop or change only when COOTs emerge or change, every cluster that is included in COTp,q,r must satisfy the following condition: it contains only the objects that are density-reachable from some COOTa,b,c such that b ≤ q ≤ r ≤ c. In other words, the period of a COT must be covered by the period of at least one COOT. Let n / MinPts be the maximum number of clusters that can develop in D at any time, we have the following formal definition: Definition 2. COTp,q,r = {Cx | x ≥ 1}, where Cx ⊆ D is a cluster and |COTp,q,r| ≤ n/ MinPts. ∀Oi, Oj ∈ Cx, ∃COOTa,b,c such that at any time T with b ≤ q ≤ T ≤ r ≤ c, OiT and OjT are density-reachable from OaT. As an example, consider Figure 1.(b). We have two COOTs: COOT1,T1,T3 and COOT4,T2,T4. We have three COTs: COT2,T1,T2 = {{O1, O2, O3}}, COT1,T2,T3 = {{O1, O2, O3}, {O4, O5, O6}}, and COT3,T3,T4 = {{O4, O5, O6}}. The periods of three COTs are all covered by the periods of two COOTs.

4. Algorithms In Section 4.1 we develop efficient formulas for predicting which objects will be in the ε−neighborhood of each other over time. From these ε−neighborhood relationships, an algorithm discussed in Section 4.2 can

calculate which objects will become COOTs during what periods. If detailed cluster contents and their periods are needed, another algorithm discussed in Section 4.3 can be used to produce COTs. Section 4.4 discusses a simple approach to further reduce time/space complexity, and produce more focused COOT and COT predictions.

4.1. Predicting ε−Neighborhood Over Time In order to construct clusters that may happen over time, we first need to identify objects that may become core objects over time. To obtain this information, we first need to know the distance between each distinct pair of objects over time. These distance measurements can then be compared against ε to determine which objects are within the ε−neighborhood of each other. More specifically, we will derive efficient formulas in this section to answer the following two questions: 1. Will any two different objects Oi and Oj be within the ε−neighborhood of each other over time? 2. What will be the period the ε−neighborhood relationship exists between Oi and Oj? We first define all the distinct pairs of objects in a database D as a set DP = { | ∀1 ≤ i, j ≤ n, Oi, Oj ∈ D, i ≠ j and i < j}. As we also defined in the previous section that, at any time T, the status (position) of an object Oi with m attributes can be expressed as a function of time as: . The Euclidean distance E between any paired objects ∈ DP at any time T can then be computed as: m

E2 = ∑ ((oi , k + vi , k × T ) − (oj , k + vj , k × T )) 2 k =1

Let ∆ok = oi,k − oj,k and ∆vk = vi,k − vj,k, we now have: m

E2 = ∑ ( ∆ok + ∆vk × T )2 , or k =1

E2=T2×(∆v12+…+∆vm2)+2×T×(∆o1×∆v1+…+∆om×∆vm) +(∆o12 +…+∆om2)

(1)

If E ≤ ε or E ≤ ε , Oi and Oj are within the ε−neighborhood of each other. The difference between E2 and ε2 at any time T can be expressed as a function fij(T) = E2 – ε2. More precisely, this difference function fij(T) describes the ε−neighborhood relationship between Oi and Oj over time. Substituting E2 of fij(T) with the right hand side of equation (1), we have equation (2): T2×(∆v12+…+∆vm2)+2×T×(∆o1×∆v1+…+∆om×∆vm) +(∆o12 +…+∆om2) – ε2 = fij(T) (2) That is, if Oi and Oj are evolving based on their change rates, their ε−neighborhood relationship will begin and end at the time when fij(T) = E2 – ε2 = 0, or simply E = ε. Let A=(∆v12+…+∆vm2), B=2×(∆o1×∆v1+…+∆om×∆vm), and C=e2−ε2, respectively. Note that e=(∆o12 +…+∆om2)1/2 2

2

represents the distance between Oi and Oj at current time. Equation (2) can then be rewritten as equation (3): AT2+BT+C= fij (T) (3) It follows that this ε−neighborhood relationship function can be analyzed as a parabola function. More precisely, the existence of ε−neighborhood relationship between Oi and Oj can be easily tested by equation (4): B2−4×A×C ≥ 0 (4) If B2−4×A×C ≥ 0, the parabola intersects with the fij(T) = 0 axis. That is, if B2−4×A×C ≥ 0, then E ≤ ε (or E2 ≤ ε2) will be true somewhere in a timeline, indicating the existence of an ε−neighborhood relationship between Oi and Oj over time. If this relationship exists, it will occur in the period defined by one of the following equations: (−B± B 2 − 4 × A × C ) / (2×A) if (A ≠ 0) (5) 2 2 −∞ to ∞ if (A = 0) AND (C=e −ε ≤0) (6) When Oi and Oj have the same change rates on all m attributes, A (and B) will be zero and there is no solution for equation (3). Under this case, if the current distance between Oi and Oj is not greater than ε (i.e. C=e2−ε2≤0 or e2≤ε2), the ε−neighborhood relationship between Oi and Oj persists forever from −∞ to ∞ (equation 6). However, if A = 0 and the current distance between Oi and Oj is greater than ε (i.e. C=e2−ε2>0), no ε−neighborhood relationship between Oi and Oj will occur over time. Example 2. Let ε be set to 1 and a database D contain three objects {O1, O2, O3}. Assume each object has two attributes, and their current statuses are given as O1 = , O2 = , O3 = , respectively. Let the change rates for O1, O2, and O3 be V1 = , V2 = , V3 = , respectively. Figure 2 shows the positions and update directions for O1, O2, and O3. Although the moving path of O1 intersects with the moving paths of O2 and O3, it does not mean O2 and O3 will both be in the ε−neighborhood of O1 over time. In fact, due to their relative positions and their relative changing rates, O2 will never be in the ε−neighborhood of O1. The ε−neighborhood relationships among these three objects over time can be studied by fij(T). 6 5 4 3 2 1 0 -1

O3

O2 O1 -1

0

1

2

3

4

5

6

Figure 2. Object moving paths in Example 2.

Based on equation (3), the ε−neighborhood relationship between O1 and O2 over time can be described as f1,2(T)=2T2−12T+25. Similarly, the ε−neighborhood relationship between O1 and O3, O2 and O3 can be expressed as f1,3(T)=2T2−20T+49 and f2,3(T)=15,

respectively. The curves for f1,2(T), f1,3(T), and f2,3(T) are shown in Figure 3. Since only f1,3(T) has B2−4×A×C≥0, we know only O1 and O3 will be in the ε−neighborhood of each other. This can also be observed in Figure 3 where only f1,3(T) has E ≤ ε or E2−ε2≤ 0. Moreover, the period of this ε−neighborhood relationship between O1 and O3 can be computed by equation (5) as [4.29289, 5.70711]. Note that O2 and O3 will never be in the ε−neighborhood of each other. In fact, their distance remains the same over time because O2 and O3 have the same change rate. Hence, the difference between their distance and ε also remains the same as e2 − ε2 = 42 − 12 = 15 > 0. 50 45

O1 & O2 O1 & O3

40 35

f (T)

30 25 20 15

O2 & O3

10 5 0 -5 -2

f(T) = 0 0

2

4

6

8

10

Time

Figure 3. Parabola curves for representing ε−neighborhood relationships over time.

Weighted distance [4] can also be used with our approach to measure the distance between Oi and Oj: m

W2 = ∑ wk × ((oi , k + vi , k × T ) − (oj , k + vj , k × T )) 2 k =1

wk is the weight assigned to each of the m variables according to its perceived importance. If weighted distance is used, then coefficients A, B, and C in equation (3) can be replaced by the followings: A = (w1×∆v12+…+ wm×∆vm2), B = 2×(w1×∆o1×∆v1+…+wm×∆om×∆vm), and C = (w1×∆o12 +…+wm×∆om2)−ε2.

4.2. Finding Core Objects Over Time (COOTs) In the previous section we discussed a simple but effective approach for checking the ε−neighborhood relationships between paired objects Oi and Oj over time. If an ε−neighborhood relationship between Oi and Oj never occurs, no further action is needed. If such relationship exists between Oi and Oj, the period of this relationship, denoted as [x, y], can be computed by equations (5) or (6). The start time (x) and the end time (y) of this period along with the pair will be inserted into a modified B+ tree for sorting and mining. We refer this tree as a B+S (B+Segment) tree in our discussion.

A B+S tree has a similar structure as an ordinary B+ tree except for its leaf nodes. In a B+S tree, each entry on a leaf node has three tuples: . T is used as an indexed key as in an ordinary B+ tree, and it represents a time point at which some ε−neighborhood relationships start and end. BList contains a list of paired objects † that have their ε−neighborhood relationships start at time T. Similarly, EList contains a list of paired objects that have their ε−neighborhood relationships end at T. When inserting x into a B+S tree, the search and insertion methods of an ordinary B+ tree are used to locate a right leaf node for storing x. One of the following two scenarios may happen. The first scenario occurs if x is a new key value. Under this scenario, one entry will be created on the allocated leaf node of the B+S tree with the following settings: T is set to x, and pair is appended to BList. The second scenario occurs if an entry with T = x is found on a leaf node of the B+S tree. Under this scenario, no new entry will be created on the leaf node of the tree. Instead, pair will be appended to BList of the found entry. When inserting y value into a B+S tree, the same procedure will take place as we just described, except EList is used, not BList. After creating a B+S tree, each entry on a leaf node of a B+S tree represents begin and end of ε−neighborhood relationships between paired objects held in BList and EList of the entry. Moreover, these entries, from left to right, are sorted in an ascending order wrt. T. Hence, if we scan these leaf node entries from left to right (referred to as a scanning process), we can count which objects will have at least MinPts objects within their ε−neighborhood over time. These objects are declared as COOTs. More specifically, for each entry being scanned, we process BList before processing EList. For every pair in BList, we perform the following Begin() procedure on both Oi and Oj. Procedure Begin(Oi) Oi.Counter++; if (Oi.Counter ≥ MinPts) && (Oi.StartTime == Undefined) { Oi.StartTime = T; }

In Begin(), we first increase one to an attribute Counter of Oi, indicating Oj will be moving into the ε−neighborhood of Oi at time T. Another attribute StartTime of Oi is used to indicate the time Oi became a COOT. If Oi.StartTime is Undefined, Oi is currently not a COOT. If Oi.Counter ≥ MinPts is true and Oi is currently not a COOT, the value T of the current leaf node entry will be assigned to Oi.StartTime, indicating the start time Oi becomes a COOT. Note that Oi.Counter and Oi.StartTime are initiated to 0 and Undefined, respectively. For every pair in EList, we perform the following End() procedure on Oi and Oj. †

BList can also be a pointer that points to a list of paired objects.

Procedure End(Oi) Oi.Counter--; if (Oi.Counter < MinPts) && (Oi.StartTime != Undefined) { Output Oi as a COOT in the period: [Oi.StartTime, T ]; Oi.StartTime = Undefined; // Oi is no longer a COOT }

In End(), we first decrease one from Oi.Counter, indicating Oj will be moving out the ε−neighborhood of Oi after time T. If Oi.Counter < MinPts is true and Oi is a COOT (i.e. Oi.StartTime != Undefined) before Oj moves away, Oi will be reported as a COOT in the period [Oi.StartTime, T]. Oi.StartTime is then reset to Undefined, indicating Oi is no longer a COOT after time T. If Oi is identified as a COOT by the above method in the period of [Oi.StartTime, T], we know that at least MinPts objects will be concentrated in the period from Oi.StartTime to T. The core of the concentration will be moving from OiStartTime to OiT, and the area of concentration will be anywhere within ε radius from the core. This information can help users to target appropriate actions toward the right places at the right time.

4.3. Discovering Clusters Over Time (COTs) In order to find COOTs, we simply count the number of objects that are within the ε−neighborhood of each other during the scanning process as discussed in the previous section. If each object can keep track exactly which objects are within its ε−neighborhood during the scanning process, cluster contents over time can be constructed from this information. To achieve this goal, each object Oi ∈ D keeps an object list, OList, that will be updated when the scanning process reads paired objects from BList and EList. For every pair read from BList of a leaf node entry, Oj and Oi will be inserted into OList of Oi and Oj, respectively, indicating Oi and Oj move in the ε−neighborhood of each other. However, while processing object pairs from BList, if an insertion event will change the contents of present clusters, we have to output present clusters before processing the insertion event so that all different clusters over time can be captured. The content of present clusters will be changed if the number of objects in OList of an existing COOT increases, or the number of objects in OList of an object reaches MinPts. These two conditions can be easily tested as ++Oi.Counter ≥ MinPts. Similarly, for every pair read from EList, Oj and Oi will be deleted from OList of Oi and Oj, respectively. If one delete event will change the contents of present clusters, we have to output those clusters before processing the delete event. The content of present clusters will be changed if the number of objects in OList of an existing COOT decreases (i.e. Oi.Counter-- ≥ MinPts).

To output present clusters, we loop through all the objects in a database and search for COOTs that have not been included in any cluster. Each time such COOT Oi is found, a new cluster will be created to collect its containing objects. It first collects Oi and all the objects Oj on OList of Oi. This collection process will then be recursively executed if Oj itself is a COOT and it has not been included in other clusters. Finally, a time period is reported based on the rules given in Table 1. In summary, the three major steps of the COT algorithm are (1) scanning all the leaf node entries of a B+S tree, (2) processing object pairs on BList and EList of each entry and update OList of the paired objects, (3) if needed, scanning all objects and their OLists to report clusters. We use the following pseudo code to further explain the COT reporting process. Procedure RptOneCOT(NewEvent, T) If ((LastRptTime, LastEvent) == (T, NewEvent)) { return; } ∀Oi ∈ D do { If (Oi.Counter ≥ MinPts) & (Oi.(LastRptTime, LastEvent) ≠ (T, NewEvent)) { Cluster_Members = ∅; // new cluster Oi.Rpt(LastRptTime, LastEvent) = (T, NewEvent); Cluster_Members.Add(Oi); // recursively collect objects directly reachable from Oi // output Cluster_Members in a user-defined format } } // report a time period as P = from LastRptTime to T // report openness for P based on rules given in Table 1 (LastRptTime, LastEvent) = (T, NewEvent);

The parameters to this procedure RptOneCOT() include an event that triggers the output procedure, and a leaf node entry T from which the event (or an object pair) is read. The event can be either an insertion or a deletion. An insertion indicates RptOneCOT() is triggered by an object pair in BList, while a deletion indicates RptOneCOT() is triggered by an object pair in EList. Since each object pair in BList or EList may trigger RptOneCOT(), same set of clusters may be outputted multiple times. To avoid duplicated outputs, RptOneCOT() first compares its last report time and last triggering event against current requesting time and current event. If these two sets of variables are the same, it will return immediately. Otherwise, the last report time and last event will be updated with the current time and current event at the end of RptOneCOT(), indicating no more output procedure is needed while processing the rest of object pairs on BList or EList. RptOneCOT() then loops through all the n objects in D to find COOTs at time T. If an object Oi is a COOT, Oi will be added into a new set Cluster_Members that contains all the objects that are density-connected to Oi. To prevent Oi from being included in multiple clusters at time T, Oi′s last report time and last triggering event will be compared against the current requesting time and

current event as we discussed above. Since all the objects on OList of Oi are directly reachable from Oi, these objects will also be collected into Cluster_Members. If any of these objects is also a COOT (i.e. it has minimum MinPts objects on its own OList), this collection process will be recursively executed to collect all the objects that are density connected to Oi. After this recursive collection process, RptOneCOT() outputs Cluster_Members in a user-defined format. RptOneCOT() will also generate a time period, as from last COT report time to current time, for the current COT.

4.4. Specified Prediction Window (SPW) If users want to focus the COOTs and COTs predictions in a particular time period (i.e. from current time to next hour), users can provide a time segment [s1, s2], referred to as a Specified Prediction Window (SPW), to filter out ε−neighborhood relationships that do not have their periods intersect with SPW. That is, an ε−neighborhood relationship occurs in a time segment [x, y] will be discarded if y < s1 or s2 < x. If s1 ≤ x ≤ s2 or s1 ≤ y ≤ s2, this relationship with its period set to [x’, y’] = [s1, s2] ∩ [x, y] will be inserted into a B+S tree for prediction. More precisely, x’ = max(x, s1), and y’ = min(y, s2). A set of disconnected SPWs can also be used if needed. SPW not only reduces unnecessary computations and space by filtering out uninteresting ε−neighborhood relationships from unwanted periods, it also allows users to guide the prediction process to produce more focused results in the periods where an application is capable to react to the cluster changes.

5. Evaluation We first analyze the time and space complexity of our approach in Section 5.1. We then present the simulation results in Section 5.2.

5.1. Complexity Analysis As we mentioned before our approach first needs to predict the distance between each distinct pair of objects over time. For a database with n objects, there are (n2−n)/2 distinct object pairs. Hence, the computation complexity for this stage is O(n2). If the distance between a pair of objects is not greater than ε, this object pair along with its relationship period will be inserted into a B+S tree for the COT and COOT predictions. Note that each object pair will be inserted twice into the tree, one for the start time of the relationship period, another one for the end time of the period. Hence, if there are R ε−neighborhood relationships, a B+S tree must maintain 2×R object pairs

or maximum 2×R leaf node entries. Under the rare situation where all the object pairs will be within the ε−neighborhood of each other over time, the maximum of R can be (n2−n)/2. Since the number of leaf node entries dominates the space required by a B+S tree (also B+ tree), the space complexity for maintaining a B+S tree is O(R). The time complexity for inserting each object pair into the tree is the height of the B+S tree logQR, where Q denotes the number of entries per tree node. If we have R ε−neighborhood relationships, the time complexity for building a complete B+S tree is O(R × logQR). For predicting COOTs, our algorithm needs to scan all the leaf node entries of a B+S tree. Since the maximum number of leaf-node entries is 2×R, the time complexity for COOT prediction is O(R). For predicting COTs, each object must keep an OList that contains objects within its ε−neighborhood. If R object pairs have ε−neighborhood relationships, the total space required for maintaining an OList on all objects is O(2×R) = O(R). The time complexity of the COT algorithm is analyzed based on its three major steps summarized in Section 4.3. If there are R ε−neighborhood relationships, a B+S tree will have 2×R object pairs. Hence, the time complexity for the first two steps of the COT algorithm is O(R). The time complexity of the 3rd step can be further divided into two parts. First, to scan the n objects database, the time complexity is O(n). Second, to scan all the objects on each OList, the time complexity is O(2×R) = O(R). Hence, the time complexity for the third step of the COT algorithm is O(R+n). Since each time when the first two steps of the COT algorithm are executed, the third step may need to be executed, the final time complexity for the COT algorithm is O(R×(R+n)).

5.2. Simulation Experiments In this section, we evaluate the performance of our algorithms. We first measure the execution time for predicting ε−neighborhood relationships among paired objects. We then assess the execution time for the COT algorithm. Next, we compare our approach with the intervalchecking method that searches density-based clusters based on extrapolated data at regular intervals. More precisely, at each interval, we first extrapolate individual objects based on their change rates; DBSCAN is then executed on this extrapolated data to search for clusters. Our experiments were tested on various settings. However, due to the space limitation, here we present only one set of experiments where 20 randomly generated databases were used. Each database contains 1,500 objects that are randomly placed on a 10,000-

pixel×10,000-pixel area. Each object is also randomly assigned a moving angle (0-360o) and a moving speed (maximum 50 pixels per one time unit) so it can change its location over time. ε is set to 250 pixels and MinPts is set to 10. Since the interval-checking approach cannot be executed indefinitely, its execution time segment is set to between time 0 and 100. This time segment is also adopted as an SPW in our approach. Table 2 shows the average execution time for predicting ε−neighborhood relationships and the COT algorithm. This table shows that predicting ε−neighborhood relationships among all paired objects takes about 95% of the total execution time, while the COT algorithm takes only 5%. Note that when more paired objects are within the ε−neighborhood of each other, the cost of the COT algorithm will also increase. Table 2. Execution time in different stages of our approach.

ε−relationship COT algorithm Total Exe. Time

Avg. Exe. Time 123.887 sec 6.534 sec 130.421 sec

% of Total Time 94.99% 5.01% 100%

Next table compares the average execution time for performing the interval-checking approach with different interval lengths. Although DBSCAN will be executed less frequently when the interval length is large, the performance of the interval-checking approach is still several times slower than our approach. The Speedup in Table 3 is measured by dividing the average execution time of the interval-checking method by the total execution time of our approach given in Table 2. Table 3. Interval-checking vs. our approach. Interval Length 20 10 1

Avg. Exe. Time (sec) 1366.27 2532.93 23075.62

Speedup 10.48 19.42 176.93

Average Hits 2.5 4.6 41.6

Hit Rate 1.89% 3.48% 31.49%

Note that while the COT algorithm can detect clusters over periods of time, the interval-checking approach can only detect clusters at particular points in time. Hence, whenever the interval-checking method detects clusters at a time point t, we can always find a COT that has its period covers t. But the reverse is not true. We measure how many percentage of clusters detected by our approach can also be detected (hit) by the interval-checking method. We refer this percentage as hit rate in Table 3. Table 3 shows that, when the interval length is 20, the hit rate of the interval-checking approach is only 1.89% of the average 132.1 COTs per database detected by our approach. Although using shorter interval length can improve the hit rate, the execution time also increases dramatically.

6. Conclusion Most of existing clustering algorithms focus on discovering snapshot clusters that do not reveal important information such as how long the present clusters will persist, or where and when the future clusters may occur. This information is important because it can guide users to prepare appropriate actions toward the right places at right time for the most effective results. In this study, we propose a simple but effective approach in predicting density-based clusters over time. Our approach first utilizes efficient formulas in determining the future ε-neighborhood relationships among object pairs. Object pairs that will never be in the ε-neighborhood of each other are filtered out. The COOT and COT algorithms then process the remaining pairs to discover concentrated areas COOTs and detailed cluster contents COTs, respectively. SPW can further reduce unnecessary computations and space by filtering out uninteresting object pairs from unwanted periods. Our experiments confirm that our approach not only has much higher precision in predicting clusters over time than the interval-checking method, it is also much more efficient. Since objects and their change rates may contain uncertainty information, we are investigating the possibility of extending the uncertainty model discussed in [6] and integrating it with our clustering algorithms.

7. References [1] L. Ertoz, M. Steinbach, V. Kumar, “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data”, In Proc. SIAM Int’l Conf. On Data Mining, 2003. [2] M. Ester, H. Kriegel, J. Sander and X. Xu, “A DensityBased Algorithm for Discovery Clusters in Large Spatial Databases with Noise”, In Proc. of 2nd Int’l Conf. On Knowledge Discovery and Data Mining, 1996, pp. 226-231. [3] M. Ester, H. Kriegel, J. Sander, M. Wimmer and X. Xu, “Incremental Cluster for Mining in a Data Warehousing Environment”, In Proc. of 24th Int’l Conf. Of VLDB, 1998, pp. 323-333. [4] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001 [5] A. Lazarevic, R. Kanapady, C. Kamath, V. Kumar and K. Tamma, “Localized Prediction of Continuous Target Variables Using Hierarchical Clustering”, In Proc. Of 3rd IEEE Int’l Conf. on Data Mining, 2003, pp. 139-146. [6] C. Lai, “Method for Determining Conflicting Paths Between Mobile Airborne Vehicles and Associated System”, U.S. Patent 6-564-149, European Patent EP1299742, 2003. [7] O. Wolfson, B. Xu, S. Chamberlain and L. Jiang, “Moving Objects Database: Issues and Solutions”, In Proc. 10th Int’l Conf. on Scientific and Statistical Database Management, 1999, pp.111-122. [8] T. Zhang, R. Ramakrishnan, M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Database”, In Proc. ACM SIGMOD Int’l Conf. On Management of Data, 1996, pp. 103-114.

Suggest Documents