Mapping Large Spatial Flow Data with Hierarchical ... - Semantic Scholar

bs_bs_banner

Research Article

Transactions in GIS, 2014, ••(••): ••–••

Mapping Large Spatial Flow Data with Hierarchical Clustering Xi Zhu*,† and Diansheng Guo* *Department of Geography, University of South Carolina † School of Hydropower and Information Engineering, Huazhong University of Science and Technology

Abstract It is challenging to map large spatial flow data due to the problem of occlusion and cluttered display, where hundreds of thousands of flows overlap and intersect each other. Existing flow mapping approaches often aggregate flows using predetermined high-level geographic units (e.g. states) or bundling partial flow lines that are close in space, both of which cause a significant loss or distortion of information and may miss major patterns. In this research, we developed a flow clustering method that extracts clusters of similar flows to avoid the cluttering problem, reveal abstracted flow patterns, and meanwhile preserves data resolution as much as possible. Specifically, our method extends the traditional hierarchical clustering method to aggregate and map large flow data. The new method considers both origins and destinations in determining the similarity of two flows, which ensures that a flow cluster represents flows from similar origins to similar destinations and thus minimizes information loss during aggregation. With the spatial index and search algorithm, the new method is scalable to large flow data sets. As a hierarchical method, it generalizes flows to different hierarchical levels and has the potential to support multi-resolution flow mapping. Different distance definitions can be incorporated to adapt to uneven spatial distribution of flows and detect flow clusters of different densities. To assess the quality and fidelity of flow clusters and flow maps, we carry out a case study to analyze a data set of 243,850 taxi trips within an urban area.

1 Introduction Large data on geographic mobility such as migration, commuting, movements of goods and disease spread have become increasingly available due to the wide adoption of location-aware technologies. In our research, we focus on the origin-destination flow data, a specific type of geographic mobility data that concerns the origin and destination of each movement but ignores the actual trajectory route, for example, taxi trip data that has origin and destination points for each passenger ride. This form of data can also be referred to as flow data or spatial interaction data. It is a challenging problem to map and understand patterns in massive origindestination data due to the problem of occlusion and cluttered display, where thousands or millions of flows overlap and intersect each other. Location-based graph measures and summary statistics, such as in-flow, out-flow, and netflow ratios, are often used to understand location characteristics and spatial patterns of mobility data (Guo et al. 2012). Such measures project the data to a certain perspective but do not allow direct understanding of the connections among locations. Flow maps, on the other hand, can directly plot links among locations and show patterns of geographical movements Address for correspondence: Diansheng Guo, Department of Geography, University of South Carolina, 709 Bull Street, Columbia, SC 29208, USA. E-mail: [email protected] Acknowledgements: This work was supported in part by the National Science Foundation under Grant No. 0748813.

© 2014 John Wiley & Sons Ltd

doi: 10.1111/tgis.12100

2

X Zhu and D Guo

(Tobler 1981, 1987). However, traditional flow maps are not capable of mapping large flow data. A typical origin-destination flow data set, such as migration paths among US counties or taxi trips in a city, can easily have millions of origin-destination pairs. A number of new flow mapping approaches have been proposed to visualize and discover patterns in large flow data sets (e.g. Phan et al. 2005; Cui et al. 2008; Guo 2009, Holten and van Wijk 2009; Andrienko and Andrienko 2011; Verbeek et al. 2011), which we will review in the next section. This article presents a new flow-mapping approach based on flow clustering to effectively generalize large point-to-point spatial flow data, discover major flow patterns, and preserve data resolution as much as possible within the map space. The approach can process large origin-destination flows and adapt to skewed spatial distributions, i.e. flow clusters of different spatial densities. Unlike existing approaches that aggregate locations or bundle flow lines, we directly cluster flows based on both origin and destination similarities among flows. The key contribution of the approach is a new agglomerative clustering method specifically designed for origin-destination flows and scalable to large data size. We demonstrate and evaluate the usefulness of the new method with a case study on taxi trip analysis.

2 Related Work A flow map is the most common away to present flow data, with flows represented by straight or curved lines connecting origin and destination locations. Flow mapping has long been used in a wide range of applications to map human migration (Tobler 1981, 1987), commodity flow (Ullman and Dacey 1960), transportation (Gould 1967), commuting patterns (Corcoran et al. 2009) and crime patterns (Groff and McEwen 2006). However, a flow map quickly becomes illegible as data size increases, with severe cluttering problems caused by massive intersections and overlapping of flows. A number of approaches have been proposed to address this problem for flow mapping. One approach to reduce flow map cluttering is through location aggregation, which combines adjacent places with point clustering (Andrienko and Andrienko 2011), graph partitioning (Guo 2009), or simply larger administrative units (such as states or provinces). By partition or aggregation, points or small units are grouped into larger regions; subsequently flows are aggregated based on the new regions. A review of aggregation methods for movement data can be found in Andrienko and Andrienko (2002). This type of location aggregation is effective in reducing the cluttering problem but also has drawbacks. First, in mobility data it is often true that most flows occur between nearby places such as in commuting and migration. Excessive aggregation of locations will inevitably remove a significant amount of short-distance flows and skip flow patterns at local scales. Second, all aggregated flows will use the same set of regions, which can miss flow trends at different scales or flows that involve different types of regions. For example, there can be two flow clusters: one from region A to B and the other from C to D, where A and C overlap but are not identical. For this case location aggregation will either miss the flow pattern entirely or produce less accurate flow patterns. Third, aggregating locations separately from flow aggregation can be affected by the modifiable areal unit problem (MAUP) (Openshaw 1983), where the way to aggregate locations may dramatically alter the result of the analysis. Different from the above location-aggregation strategy, another type of flow mapping approach is to allow movements to take place only between geographically adjacent places. A classic example is presented in (Tobler 1981, 1987), where each origin-destination flow is rerouted and forced to pass nearby locations. For example, a migration path from Maine to © 2014 John Wiley & Sons Ltd

Transactions in GIS, 2014, ••(••)

Mapping Large Spatial Flow Data with Hierarchical Clustering

3

California will be decomposed into a sequence of moves between adjacent states. As such, the flow map does not have crossing flows, similar to traffic maps that show movement along streets but do not show the origin or destination of a moving object (see Andrienko and Andrienko 2011 for recent examples of such traffic maps). The advantage of this type of flow map is that it reveals a general trend surface and completely removes occlusions. However, it loses the information on flow origins and destinations and completely misses the actual connections between specific places that are not adjacent. A third type of method focuses on minimizing edge crossing in flow maps through edge rerouting (Phan et al. 2005) or edge bundling (Cui et al. 2008; Holten and van Wijk 2009; Verbeek et al. 2011), which are similar in spirit to various graph drawing methods. Since geographic locations (or nodes) cannot be moved freely in the layout (i.e. map), these methods reroute and bundle edges instead to improve the visual clarity of flow maps. Edge bundling changes the shape of edges by “visually bundling them together, analogous to the way electrical wires are merged into bundles along their joint paths and fanned out again at the end, in order to make a tangled web of wires and cables more manageable”. The limitation of edge rerouting or bundling approaches is two-fold. First, a flow map of bundled edges cannot effectively communicate origin-destination connections unless each connection (analogous to an electronic wire) is uniquely identifiable (e.g. with a unique color). In other words, edge bundling compromises the perception of flow connections although it improves visual clarity. Second, they are best suited for relatively small datasets, e.g. the edge rerouting approach (Phan et al. 2005) works best for mapping flows to/from one or two selected locations. To handle large amounts of flow data, filtering out small flows in the map is commonly practiced for enhancing visual analytics. For a spatial flow data, e.g. county-to-county migration data, it is very common that there are many more small flows in terms of flow volume (e.g. number of migrants) than large flows. These large flows often represent a major proportion of the total flow volume. For example, Tobler (1987) observes that 75% of flow connections on the small side contain less than 25% of the flow volume (e.g. migrants). Therefore, almost all flow maps for relatively large data will only map the major flows. However, this strategy can be severely flawed if the spatial units (e.g. counties) are dramatically different in size or the grouping of locations is arbitrary and does not reflect true patterns in the data. For example, if one only shows the larger migration flows among US counties, the map would contain mostly flows among large counties in terms of population, which is less informative and even misleading for interpreting the migration patterns in the US. Therefore, we argue that the aggregation of locations should be carried out based on the analysis of flow patterns, as opposed to using predefined units or arbitrary aggregations without considering inherent flow patterns. Spatial clustering is one of the common approaches for aggregating locations in the literature of spatial analysis and spatial data mining (Han and Kamber 2000). There are a large number of different spatial clustering methods (Ng and Han 1994; Ester et al. 1996; Zhang et al. 1996; Agrawal et al. 1998; Guha et al. 1998; Lu and Thill 2008). These methods can be classified into three categories: partitioning, hierarchical, and density-based clustering. Hierarchical clustering seeks to organize data items into a hierarchy, with two general strategies: agglomerative or divisive clustering. In an agglomerative approach, each observation starts as a single-item cluster at the bottom of the hierarchy, and the most similar pair of clusters is merged at each step towards the top of the hierarchy. In a divisive approach, all observations start in one cluster at the top of the hierarchy, and split the cluster recursively as the distance threshold decreases. In general, an agglomerative clustering method such as the complete-linkage or average-linkage has a computational complexity of O(n3), which is not © 2014 John Wiley & Sons Ltd


4

X Zhu and D Guo

scalable to large data sets. To cluster spatial data based on geographic proximity or some neighborhood relationship, a spatial index can be used to speed up the clustering process. More comprehensive reviews are available elsewhere on general clustering (Jain et al. 1999) and spatial clustering (Han et al. 2001; Guo et al. 2003). There is also a variety of methods for interactive visualization of spatial flows with nonspatial views, such as ordered matrices (Guo 2007), or innovative combinations of maps, matrices and other methods, such as Map2 (Guo et al. 2006), interactive OD maps (Wood et al. 2010), and exploratory visualization (Yan and Thill 2008, 2009). For example, Yan and Thill (2009) integrate a self-organizing map with exploratory visualization and maps to analyze spatial interaction data and their associated attributes. Rinzivillo et al. (2008) introduce a visual analytics system that allows interactive clustering of trajectories with difference dissimilarity measures, chosen one at a time to derive interpretable outcomes. However, such interactive systems do not intend to summarize the entire data set in a single flow map. Instead, they rely on the user to interactively select data or cluster for mapping and progressively make sense of data though an iterative process.

3 Methodology In this section, we present our new flow clustering and mapping method that can extract flow clusters and render a visually clear flow map to generalize patterns in massive spatial flows. First, the flow clustering method considers both origins and destinations in determining the similarity of flows, ensures that a flow cluster represents flows from similar origins to similar destinations. Second, with spatial index and search algorithm, the method is scalable to large data sets. Third, it is a hierarchical clustering method and thus has the potential to support multi-resolution flow mapping. Different distance definitions can be incorporated in the method. In this article we focus on a shared-nearest-neighbor distance measure to capture flow clusters of different densities.

3.1 Overview Let T = {Ti} be a set of origin-destination flows, where Ti = {Oi, toi, Di, tdi} is a directed flow that starts at origin location Oi and time toi, and ends at destination Di and time tdi; n = |T| is the number of flows. The time stamp for flows is optional, which can be omitted if all flows are in the same time period. Let O = {Oi = } be all origin locations and D = {Dj = } be all destination locations that are involved in T. Each location has a pair of spatial coordinates and a unique ID. We treat O and D separately in the method as origins and destinations have different meanings for flows. Conceptually, our agglomerative flow clustering method consists of the following steps: 1. Build a spatial index based on a contiguity definition to allow efficient retrieval of the nearest neighbors for a given origin/destination and the nearest flows for a given flow; 2. Define a distance (or dissimilar) measure between flows, which should allow the detection of flow clusters of different spatial densities; 3. Cluster flows with an efficient agglomerative clustering procedure, which iteratively groups flows into a hierarchy and stops at a level controlled by given parameters; 4. Render a flow map with the discovered flow clusters C = {Cl}, which is a generalization of the original set of flows T = {Ti}, |C| ⪡ |T|. © 2014 John Wiley & Sons Ltd


Mapping Large Spatial Flow Data with Hierarchical Clustering

5

3.2 Neighborhood Search for Points and Flows To facilitate presentation and discussion of the spatial index and subsequent clustering steps, the following definitions on neighborhoods for points and flows are needed: Definition 1.1 (Euclidean Neighborhood of a point) The Euclidean neighborhood of an origin point Op is defined as EN(Op, r) = {Oq ∈ O | EuclideanDist(Op, Oq)

Mapping Large Spatial Flow Data with Hierarchical ... - Semantic Scholar

Mapping Large Spatial Flow Data with Hierarchical ... - Semantic Scholar

Suggest Documents

Efficient Hierarchical Clustering of Large Data Sets ... - Semantic Scholar

Efficient Hierarchical Clustering of Large Data Sets ... - Semantic Scholar

HIERARCHICAL SPATIAL REASONING APPLIED TO SPATIAL DATA ...

Hierarchical Characterization and Flow Modeling ... - Semantic Scholar

Managing Socio-spatial Data as Large Graphs - Semantic Scholar

Likelihood Approximation With Hierarchical Matrices For Large Spatial ...

Spatial Data Infrastructures - Semantic Scholar

Hierarchical Visual Mapping with Omnidirectional

Spatial Data Infrastructures - Semantic Scholar

Deep Mapping and Spatial Anthropology - Semantic Scholar

Boundary primacy in spatial mapping - Semantic Scholar

Going with the Flow: Spatial Distributions of ... - Semantic Scholar

Mapping Data-Flow onto Enhanced

Speciation with gene flow in the large white ... - Semantic Scholar

Using Hierarchical Spatial Data Structures for Hierarchical ... - CiteSeerX

151-31: Analysis of Large Hierarchical Data with Multilevel ... - SAS

151-31: Analysis of Large Hierarchical Data with Multilevel ... - SAS

latent perceptual mapping with data-driven ... - Semantic Scholar

Mapping Big Data into Knowledge Space with ... - Semantic Scholar

Mapping Big Data into Knowledge Space with ... - Semantic Scholar

Spatial Indexing of Global Geographical Data with ... - Semantic Scholar

Hierarchical Agglomerative Clustering with ... - Semantic Scholar

Agglomerative Hierarchical Clustering with ... - Semantic Scholar

Hierarchical Design Rewriting with Maude - Semantic Scholar