Mining Spatial Association Rules Without the Distance Parameter Robert Bembenik, Henryk Rybiński
[email protected],
[email protected] Institute of Computer Science Warsaw University of Technology ABSTRACT This paper focuses on finding spatial association rules. In the first part of the article the specifics of spatial databases are discussed and the existing methods for finding spatial association rules are reviewed. Next a new approach to mining spatial association rules is introduced that does not require the user to pass parameters regarding distance that define neighborhood. The main idea is that normally when we want to discover associations among objects in space it is not easy to define neighborhood in such a way that it is flexible for different data sets and is unambiguous at the same time. We managed to achieve this goal by grouping objects represented here as points with an algorithm that is based on Delaunay diagram. Once such a diagram is created for the purpose of grouping we use it for determining neighborhoods and then, based on this knowledge, continue with finding association rules. The approach is described in detail and evaluated experimentally. KEYWORDS: knowledge discovery, spatial data mining, spatial association rules, frequent patterns, Delaunay triangulation
1. Introduction A huge amount of spatial data has been collected by various information systems e.g. regional sales systems, remote sensing systems, geographical information systems (GIS), computer cartography, satellite systems. Geographic data consist of spatial objects and non-spatial descriptions of these objects, e.g. coordinates, geometry, and non-spatial attributes like name of town, number of inhabitants, etc. Spatial data can be described using two properties: geometric and topological. Geometric properties include spatial location, area, perimeter etc. Topological relationships include, among other things, adjacency (object A meets object B) and inclusion (A contains B). To perform spatial data mining tasks efficiently spatial data should be stored in dedicated information systems. SDBMS – Spatial Database Management System consists of two parts: DBMS (Database Management System) and a module that can work with spatial data. It supports multiple spatial data models, abstract spatial data types and a spatial query language. SDBMS also supports spatial indexing and efficient algorithms for spatial operations [SC03, GG98]. Spatial data mining can be defined as the extraction of interesting spatial patterns and features, general relationships between spatial and non-spatial data, and other general data characteristics not explicitly stored in a spatial database
system (SDBMS). Spatial properties of objects make knowledge discovery methods in spatial databases different from classical data mining. It is because spatial objects remain in relationships with several or many other objects. The efficiency of algorithms in spatial databases depends heavily on an efficient processing of spatial relationships. The process of computation of these relationships may be time-consuming. For instance, the time to calculate a few thousand exact relationships among complex spatial objects (e.g. detailed borders of lands, countries, etc.) may be very long (measured in days!) even if the calculations are done using powerful machines. For the purpose of our considerations spatial objects (e.g. shops, cinemas, etc.) are represented as points (point represents the center of an object). As mentioned before the core element in the process of mining spatial data is the efficient processing of spatial relationships. Here we propose an efficient method of calculating spatial neighborhoods that is not based on the metric value and is determined unambiguously every time for every relationship. The rest of the paper is organized as follows: Section 2 summarizes the work pertaining to finding spatial association rules. Section 3 discusses proposed approach. In section 4 the necessary terminology is defined. Section 5 tells about the advantages of the novel approach. Section 6 reports briefly on more important implementation issues. In Section 7 results of the experiments are presented. Section 8 concludes the article.
2. Related work In the literature several approaches to discovering frequent patterns in spatial contexts have been proposed. All of them used different methodology, but the question of neighborhood was always dependent on user-specified value or sometimes even vague (could be different depending on the order of grouping). In [KH95] a method for mining spatial association rules was presented. The following rule is an example of a spatial association rule: is_a(x, house)∧close_to(x, beach)→ is_expensive(x). (90%) It says that if x is a house and x is close to a beach then in 90% of all the cases the price of x is high. The method is based on a hierarchy of topological relations (spatial predicates) and a hierarchy of non-spatial data. Hierarchies are explicitly given by the experts, or can be generated automatically by data analysis. For example, g_close_to is a high level spatial predicate covering a set of basic spatial predicates: overlap, meet, contains, close to. Exemplary hierarchies for non-spatial attributes towns and water are: • Town: large_town(big, medium_sized) – small_town(…)- …, • Water: sea(…) - river(…) – lake(…). The main idea of that technique is to find frequent patterns in a high level of the hierarchy and then, only for previously discovered frequent patterns deepen the search to lower levels of the hierarchy. The deepening search process continues until the lowest level of the hierarchy is reached. The mining process proceeds as follows: 1) the set of relevant data is retrieved by performing a query in a spatial query language, which extracts the requested sets of data. 2) The “generalized close_to” (g_close_to) relationship among the appropriate classes of entities is computed at the high level of the hierarchy. The derived spatial predicates
(topological relationships) are collected in a “g_close_to” table, which follows an extended relational model: each field of the table may contain a set of entities (predicates). The support of each entry is computed and entries whose support is below the minimum threshold are removed. Each g_close_to predicate is then replaced by one of a set of lower level predicates and refined computation is performed. Algorithm Apriori [AS94] is used to find frequent predicates. Computations at lower hierarchy levels of non-spatial attributes are continued analogously (the values of support and confidence should be appropriately lower). This model of mining spatial association rules is a ‘reference feature centric model’ [HSX02]. It enumerates neighborhoods to “materialize” a set of transactions around instances of the reference spatial feature (e.g. finding towns that are in the vicinity of some reservoir of water). All discovered association rules are related to the reference feature. If we considered elements A, B and C, where C is the reference feature and both (A,B) and (B,C) are frequent, the set (A,B) would not be found since it does not involve the reference feature. The method cannot be easily generalized to the case where no reference spatial feature is specified. Mining frequent neighboring class sets was studied in [M01]. Considered database consisted of both non-spatial and spatial objects. The latter were represented as points (x and y coordinates) and were members of given classes of objects. Instances of different classes lying close to each other (the distance value is user-defined) form a neighboring class set. ({circles, squares}, 3) is an example of a 2-neighboring class set with support value 3. If the number of instances of a neighboring class set is larger than a specified value (minimum support) the class is a frequent neighboring class set. K-neighboring class sets are computed based on user-specified distance value and support value using a variation of the Apriori algorithm. This approach is partitioning-sensitive [HSX02] (since the groups are created first and based upon this frequent neighboring class sets are calculated – instances of k-neighboring class sets may be different depending on the order of classes as they are grouped [M01]). Different groupings may yield different values of support measure and thus different association rules. [SH01] proposes a method for mining spatial co-location patterns. Co-location patterns represent frequent co-occurrences of a subset of boolean spatial features. An algorithm for mining the mentioned patterns is called Co-location Miner. Firstly co-location row instances are enumerated before measures of prevalence and conditional probability are computed at co-location level. The participation index is calculated. Based on those conditional probabilities are calculated. A colocation rule is of the form [HSX02]: C1→C2(p,cp), where C1 and C2 are colocations C1∩C2=∅, p is a number representing the prevalence measure and cp is the number representing conditional probability. For the rule to be valid the value of conditional probability has to exceed the level specified by the user at the beginning of the process of rules extraction. This approach allows for multiple definitions of neighborhood relations. The neighbor relation may be defined using topological relationships (e.g. connected, adjacent), metric relationships (e.g. Euclidean distance) or a combination. Such definition of neighborhood incurs substantial computational cost when one wants to enumerate all neighborhoods (whose number may be potentially infinite [HSX
02]); moreover multi-resolution pruning needs to be done for spatial datasets with strong auto-correlation (datasets, where instances of each spatial feature types tend to be located near each other). Multi-resolution pruning entails super-imposing d-sized cells on the dataset (where d is a user-defined distance). In this grid two cells are coarse-neighbors if their centers are in a common square of size d×d, which imposes an 8neighborhood on the cells. Co-location rules are generated using coarse neighborhoods and for those items that have prevalence value large enough detailed calculations at the fine level are done. Drawbacks of such approach are the need for multiple calculations (at the coarse and at the fine level) and cell sizes of the imposed grid (the grid itself does not capture the specifics of the spatial data and cell sizes have to be fine-tuned).
3. Our approach to mining spatial associations In the previous chapter several approaches to mining spatial data have been briefly described. The most important aspect (though not always given enough attention by the authors) seems to be, except for the technique of the mining approach, the definition of neighborhood, which is always the basis for spatial calculations. Neighborhood is determined either by certain space partitioning or by enumerating neighbors that are in some distance from the reference object.
Fig. 1 Neighborhood of points. Figure 1 shows points in space with window as a reference defining transactions. Points p1 and p3 are neighbors here, but points p4 and p5 are not considered neighbors. Relation of closeness highly depends on the size of the cell. If we enlarge the size of the cell sixteen times (thick solid line), then it will turn up that points p4 and p5 are neighbors. It can be seen that such definition of closeness is imprecise. Depending on cell sizes different neighborhoods, and as a result different association rules, will be obtained. Similar situation occurs, when objects lie in some, defined by the user, distance from the reference object. If the distance is too small the object will have few neighbors and as a result generated association rules will not reflect real patterns existing in the spatial data. If, on the other hand, the given distance defining neighborhood is too large, then the object may have too many neighbors and calculated rules will be distorted as well. In order to achieve requested result the calculations have to be repeated for several values of distance prolonging exploration time as a result, especially if the investigated space consists of large number of objects. Our approach is focused on reducing the ambiguity in the process of neighborhood enumeration, on eliminating additional parameters (like distance)
and consequently accelerating the whole process of discovering association rules. From all the data representing several different object groups we consider clusters of points as candidates for discovering association rules. Those aggregated spatial concentration groups represent and summarize distribution of considered points. Points representing noise are located far from other objects. They are thus omitted in the process of finding association rules. Mining association rules in this approach includes the following steps: 1. Creating clusters consisting of analyzed points. 2. Enumerating object instances of each type. 3. Determining neighborhoods based on Delaunay diagram. 4. Generating association rules. For clustering we use an algorithm that uses Delaunay diagram, because later we will be able to make use of once built diagram to determining neighborhoods among analyzed points. Clustering algorithm proposed in [ECL00] (AUTOCLUST) for grouping points in space uses dynamic thresholds instead of parameters specified by the user. It removes too long edges from the created Delaunay diagram for all points in space, removes Delaunay edges connecting clusters and discovers groups of different types. Discovering cluster boundaries in AUTOCLUST is based on the fact, that in Delaunay diagram points that build up cluster edge have greater value of standard deviation of their incident edges, since they posses both short and long edges. Short edges connect points inside the cluster, long edges make up connections among the clusters or between a cluster and noise. This observation of edge points’ behavior constitutes the basis for dynamic criterion of edge elimination.
Fig. 2 a) Set of points in space. Different shapes represent various types of objects b) Set of points in space after clustering. Figure 2 shows an exemplary set of points in space (Fig. 2a, and points after clustering – Fig 2b). Various shapes represent different object types. It can be clearly seen, that points form two groups. There are also objects that do not belong to any of the groups; they are far from them. They are considered noise and will not be included in the process of finding association rules. The next step in the process of mining association rules is labeling object instances of the same type. Also neighborhood has to be determined based on the existing (created during the process of clustering) Delaunay diagram. Two objects
are considered neighbors in the Delaunay diagram, if there is an edge connecting those objects. Numbered objects from the first group in Figure 2b) together with marked neighborhoods are depicted in Figure 3.
Fig. 3 Delaunay diagram depicting neighborhood of labeled points. Association rules are calculated in a way similar to the one from [HSX02]. In the next chapter all the necessary terminology will be defined.
4. Definitions Let us define neighborhood first. Generally this term has many meanings and uses [WWW1]. For example, ‘neighborhood’ can be used to refer to the small group of houses in the immediate vicinity of one's house or to a larger area with similar housing types and market values. Neighborhood is also used to describe an area surrounding a local institution patronized by residents, such as a church, school, or social agency. It can also be defined by a political ward or precinct. The concept of neighborhood includes both geographic (place-oriented) and social (peopleoriented) components. Our defisnition of ‘neighborhood’ is the following: Neighbors of a point are those points, with which there exists intermediate linear connection in the diagram (as in Figure 3). For the purposes of creating associations (so that the rules do not depend on the order of elements) we extend this definition with the help of the following theorem. Theorem 1. After enumerating all neighborhoods in clusters (neighbors of a point build up one row) points being in one row are either immediate neighbors or are located at most in 2-neighborhood. Here is an examples of neighborhood from Fig. 3. Object ∆1 has the following neighbors: 1, 2, ∆2, 2, Ο1. According to the above definition and theorem 1 all enumerated elements are neighbors. That means that despite 1 and Ο1 are not connected in the diagram they are neighbors, because they belong to the neighborhood of object ∆1. Neighborhood instance of closely located objects B={t1, t2,…, tk} denoted by I={i1, i2,…, ik}, where ij is an instance of an object of type tj (∀j∈1,…,k) is defined as object of various types among which there is a connection in the Delaunay diagram. An example of neighborhood instance of closely located objects {∆, , } from the Figure 3 is the set {∆3, 3, 2}.
Table instance of closely located objects is defined as a set of all neighborhood instances. Participation ratio Wu(B, ti) for closely located objects B={t1, t2,…, tk} of type ti is a fraction of instances of fi which participate in neighborhood instance of objects located close to B. This ratio is computed from the following relationship:
| unique(all table instances of B ) | | instances{ f i } | } are In Figure 3 neighborhood instances of closely located objects { , {( 1, 1), ( 3, 2), ( 2, 2)}. All objects of type participate in neighborhood instance, so Wu({ ,
}, )=
3 = 1. 3
Participation index of closely located objects B={t1, t2,…, tk} is defined as
min ik=1 {Wu ( B, t i )} . In Figure 3 participation ratio Wu({ , }, )=1, Wu({ , }, )=1. Participation index ( , ) is thus equal min(1, 1)=1. Confidence of an association rule B1→B2 is the probability of finding an instance of an object B2 in the neighborhood of object B1. We calculate it from the following relationship:
| unique(all neighborhood instances of B1 ∪ B2 ) | . | instances of B1 |
5. Advantages of the approach Presented approach to mining spatial association rules has the following advantages: Clustering causes discarding single objects located far from object clusters in the first phase of the mining process. Those objects are too far from other objects to be considered in interesting associations. The grouping algorithm uses Delaunay diagram, which is then used for determining neighborhood relations during the phase of discovering associations. There is thus no need for creating additional data structures, what would slow down the mining process. Delaunay diagram is a structure representing neighborhood of objects in a univocal and concise way. In this structure there is no doubt which objects are neighbors. This is an essential improvement compared to spatial data mining methods described in the literature so far. Definitions of neighborhood used there are not unambiguous and very often cause confusion. Data mining without reliable neighborhood definition returns different results depending on the size of the window or distance in which other objects are considered neighbors and forces to perform data mining process multiple times for different values of suitable parameters.
6. The implementation Association rules produced by our program contain a single item in the consequent. The restriction to single item consequents is due to the following considerations: In the first place, association rule mining usually produces too many rules even if one confines oneself to rules with only one item in the consequent [Bor04]. Allowing more than one item in the consequent merely blows up the output size. Besides there are no real applications of rules with more than one item in the consequent. More complex rules add almost nothing to the insights about the data set. Let us consider the simpler rules that correspond to a rule with multiple items in the consequent, that is, rules having the same antecedent and consequents with only single items from the consequent of the complex rule. All of these rules must necessarily be in the output, because neither their support nor their confidence can be less than that of the more complex rule. That is, if we have a rule (a b c d), we will necessarily also have the rules (a b c) and (a b d) in the output. Of course, these latter two rules together do not say the same as the more complex rule. However, the gains from the additional information the more complex rule gives us are rather small and we decided that this little extra information is not worth having to analyze a much bigger rule set. Finding frequent spatial itemsets is done using a structure described in [BP04]. For this step the Apriori [AS94] algorithm with the described modifications specific for spatial calculations has been adapted; further presentation will be limited to modifications. The discovered frequent itemsets, as well as candidate itemsets (i.e. the sets that are potentially frequent), are stored in a tree-like structure called T-tree. The root of the T-tree is at level 1. A node of the T-tree contains an items table. Each field of the items table consists of an item, a value of support and a pointer to another node at level l+1. The items belonging to a path in the T-tree, composed of fields of the items tables, form a frequent itemset. The support value of this itemset is stored in the last element of the path. Items stored in the root of the T-tree are frequent 1-itemsets. Elements in the root items table are sorted in the descending order of their support.
7. Experiments
8. Conclusions In the paper an approach to discovering spatial association rules was presented. It allows to calculate spatial associations without passing the distance parameter (which was necessary in the previous approaches to define either the window or a radius within which objects were considered neighbors). We achieved this by firstly grouping objects with an algorithm that uses Delaunay diagram and then using once calculated diagram for determining neighbors. The efficiency of our approach was evaluated experimentally. The proposed approach allows to
accelerate the process of discovering spatial association rules, since we do not have to repeat it for different values of the measure defining neighborhood.
References [AS94]
Agrawal R., Srikant R.: Fast Algorithms for Mining Association Rules in Large Databases, Int’l Conf. On VLDs; Santiago, Chile, 1994. [BP04] Bembenik R.., Protaziuk G.: Mining spatial association rules, Proc. of the IIS:IIP WM’04, Zakopane, Poland, Springer, 2004 [KH95] Koperski K., Han J.: Discovery of spatial association rules in geographic information databases, Proceedings of 4th International Symposium on Large Spatial Databases, August 1995. [KHA97] Koperski K., Han J., Adhikary J.: Mining knowledge in geographical data, In Comm. ACM, 1997. [M01] Morimoto Y.: Mining Frequent Neighboring Class Sets In Spatial Databases, KDD’01, San Francisco, USA, 2001 [SC03] Shekhar S., Chawla S.: Spatial Databases: A Tour, Prentice Hall, 2003 [SH01] Shekhar S, Huang Y.: Discovering Spatial Co-Location Patterns: A summary of results, In Proc of SSTD, Redondo Beach, USA, 2001 [HSX02] Huang Y., Shekhar S., Xiong H.: Discovering Co-location Patterns from Spatial Datasets: A General Approach, Technical Report, TR 02033, Computer Science & Engineering, University of Minnesota Twin Cities, 2002 [ECL00] Estivill-Castro, V., Lee, I.: AUTOCLUST: automatic clustering via boundary extraction for mining massive point-data sets, Proceedings of the 5th International Conference on Geocomputation, 2000 [WWW1] http://www.gnocdc.org/def/neighborhood.html [Bor04] Christian Borlgelt’s Apriori web page: http://fuzzy.cs.uni-magdeburg.de/~borgelt/doc/apriori/apriori.html