A MapReduce Algorithm to Create Contiguity Weights for Spatial Analysis of Big Data Xun Li
School of Geographical Sciences and Urban Planning Arizona State University Tempe, AZ, USA
[email protected]
Wenwen Li
School of Geographical Sciences and Urban Planning Arizona State University Tempe, AZ, USA
Luc Anselin
School of Geographical Sciences and Urban Planning Arizona State University Tempe, AZ, USA
[email protected] [email protected] Sergio Rey Julia Koschinsky
School of Geographical Sciences and Urban Planning Arizona State University Tempe, AZ, USA
[email protected]
School of Geographical Sciences and Urban Planning Arizona State University Tempe, AZ, USA
[email protected]
ABSTRACT
Keywords
Spatial analysis of Big data is a key component of CyberGIS. However, how to utilize existing cyberinfrastructure (e.g. large computing clusters) to perform parallel and distributed spatial analysis on Big data remains a huge challenge. Problems such as efficient spatial weights creation, spatial statistics and spatial regression of Big data still need investigation. In this research, we propose a MapReduce algorithm for creating contiguity-based spatial weights. This algorithm provides the ability to create spatial weights from very large spatial datasets efficiently by using computing resources that are organized in the Hadoop framework. It works in the paradigm of MapReduce: mappers are distributed in computing clusters to find contiguous neighbors in parallel, then reducers collect the results and generate the weights matrix. To test the performance of this algorithm, we design experiment to create contiguity-based weights matrix from artificial spatial data with up to 190 million polygons using Amazon’s Hadoop framework called Elastic MapReduce. The experiment demonstrates the scalability of this parallel algorithm which utilizes large computing clusters to solve the problem of creating contiguity weights on Big data.
mapreduce, spatial weights, big data
1.
INTRODUCTION
To tackle a new class of challenging scientific problems raised by big spatial data[4], CyberGIS[8] framework is proposed to provide a spatial middleware that can take advantage of existing the powerful computational resources (e.g. high performance/cloud computing) provided by Cyberinfrastructure (CI)[9]. This framework seamlessly integrates distributed geoprocessing components, including spatial data manipulation, geovisualization, spatial pattern detection, spatial process modeling and spatial analysis, as a spatial middleware that can efficiently utilize the computation capabilities of CI[1]. Among these distributed geoprocessing components, parallel spatial analysis solutions that can handle big spatial data is becoming a prominent component of CyberGIS. Spatial analysis is a process of data pre-processing, visualization, exploration, model specification, estimation and validation[2]. However, the data structures and algorithms of conventional spatial analysis are designed based on and limited to desktop computer architectures. They are not capable of applying spatial analysis on big spatial data due to the limited memory space and computing power. Therefore, it is essential to design and develop a scalable CyberGIS platform to support efficient spatial analysis. In this research, we focus on spatial weights creation in spatial analysis of big data. Spatial weights are an essential part of spatial analysis since it represents the geographical dependency of spatial objects. The spatial weights matrices are widely used in a variety of spatial analysis methods such as spatial autocorrelation or spatial regression. Spatial weights creation is to extract the spatial structure, such as spatial neighboring information (contiguity weights) and spatial distances (distance weights), from spatial data. However, traditional spatial weights creation algorithms [6, 3] are designed and limited by utilizing local hardware (e.g. CPUs, memory and hard disk), and they are not capable of handling very large spatial dataset.
Categories and Subject Descriptors C.2.4 [Distributed Systems]: Distributed applications
General Terms Algorithms, Experimentation
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGSPATIAL BIGSPATIAL ’14 Dallas, Texas USA Copyright 2014 ACM 978-1-4503-2528-8/13/11 ...$15.00. http://dx.doi.org/10.1145/2676536.2676543
50
Therefore, we propose a MapReduce based algorithm to create contiguity spatial weights from big spatial data. Unlike the traditional weights creation algorithm, this proposed algorithm runs distributely and in parallel on a high performance computing cluster that is organized using Hadoop framework[7]. It works in the paradigm of MapReduce: the big spatial data is chunked into pieces, then mappers are distributed to di↵erent nodes in the computing cluster to find neighbors of spatial objects in the data, and then reducers collect the results from di↵erent nodes to generate the weights file. Both mapper and reducers work in parallel to achieve better performance. To test the performance of this algorithm, we designed experiments to create Queen contiguity weights file from artificial spatial data with up to 190 million polygons using Amazon’s Hadoop framework called Elastic MapReduce. The experiment shows the superiority of this algorithm in utilizing high performance computing resources to handle the problem of creating contiguity-based spatial weights from Big spatial data.
Algorithm 1: Map Algorithm of Contiguity Weights Creation 1 point polygon dict {} 2 3 4 5 6 7 8 9 10 11
12 13 14 15
2. 2.1
MAPREDUCE SPATIAL WEIGHTS CREATION
16 17 18 19
Spatial Weights
20
Spatial weights is an essential component in spatial analysis (e.g. spatial autocorrelation test, spatial regression) where a representation of spatial structure is needed. The spatial structure of spatial features is usually described using a spatial weights matrix W , which has n rows and n columns (n is the number of geometric features). When f eaturei and f eaturej are defined as neighbors, the cell value wij 6= 0. For a contiguity weights matrix, the value of wij is either 1 (f eaturei and f eaturej are neighbors) or 0 (f eaturei and f eaturej are not neighbors). In this research, we focus on contiguity weights creation. We define a point as a tuple of two coordinators: point = (x, y) and a polygon as a set of M points: P olygon = {point1 , point2 , ..., pointM }. For a spatial dataset containing n polygons: DS = {P olygon1 , P olygon2 , ..., P olygonN },constructing a contiguity-based weights matrix W is to find all neighboring polygons for every P olygoni , i 2 [1, N ] in DS. There are three types of contiguity that determines the value distribution in the weights matrix: rook contiguity (neighbors have to share an edge), bishop contiguity (neighbors need to share a corner), and queen contiguity weights (neighbors that share either a corner or an edge). Conventional algorithms for creating contiguity weights matrix utilize the geometries to detect if two polygons are sharing any edges or vertices. If the contiguity is obtained by comparing vertices or edges for all pairs of geometries, the process will be very computational intensive, with time complexity O(n2 ). By spatially indexing these geometries, the algorithms in GeoDa and PySAL (https://github.com/pysal/) can achieve a fast O(logN ) search for candidate neighbors. The number of candidates can be narrowed down by incorporating a spatial index (e.g. r-tree). Still, the raw point or edge comparison between candidates and target geometry is needed to determine if two geometries are neighbors. The best performance of this type of algorithm is O(N logN ). Besides, these algorithms rely on computers that can load all geometries into memory. Therefore, they are not capable of
21 22 23 24
/* System Input: Point:Polygons for line 2 sys.stdin do items line.split() poly id items[0] for point 2 items[1 :] do if point 2 / point polygon dict then point polygon dict[point] set() end point polygon dict[point].add(poly id) end end
*/
/* Create output for Reducers */ for point, neighbors 2 point polygon dict.items() do if neighbors.length ⌘ 1 then print neighbors else for master poly 2 neighbors do for neighbor poly 2 neighbors do if master poly 6= neighbor poly then print master poly, neighbor poly end end end end end
creating contiguity weights from very large spatial dataset. In this research, we propose a MapReduce algorithm that works with Hadoop system to create contiguity weights from very large spatial dataset. This algorithm is based on the following strategy: summarizing the polygons based on the vertex/edge they contain. If a point/edge appears in two polygons, these two polygons should be a queen contiguous neighbors. In order to illustrate the algorithm clearly, we describe the MapReduce algorithm for queen contiguity weights creation. This algorithm can be easily modified for rook or bishop contiguity weights creation using the same summary logic. First, for parallelizing the map task with more than one computer nodes, Hadoop will chunk the data equally into several portions, with each portion being processed by one computer node. On each node, the mapper creates a dictionary for every vertex and append associated polygons into the value set. Then, Hadoop system will shu✏e and sort the dictionaries created from all computer nodes for the computation in the reduce phase. The reducers will merge these dictionaries by key (vertex). The values, which are sets of contiguous polygons sharing the same key, in all dictionaries are combined to generate contiguity weights file.
2.2
A MapReduce Algorithm
Map. The main purpose of this mapper algorithm is to create a {key value} dictionary object with a vertex as the key and a set of polygons that contains this vertex as values. This algorithm first reads data from Hadoop’s system standard
51
input. The data were processed line by line. Each line represents the geometry content of a polygon with comma separated format: polyid, point1, point2, ..., pointN. The content will be parsed and stored in the dictionary poly polygon dict. When a mapper finishes processing data, it will iterate all values in the dictionary poly polygon dict and prepare (keyvalue) data for reducers. Since the value in poly polygon dict represents polygons that share a same key (vertex), they are considered as neighbors. The mappers then write key-value pair {polygon:neighbor polygon} as neighboring information for reducers.
Figure 1: Artificial data created by duplicated base map four times in a row.
Reduce. The Hadoop system will monitor and collect the outputs from all mappers. Once the progress of map task reaches a threshold setup by the system or predefined by users, the Hadoop system will start the reduce task. The reduce task has three steps: shu✏e, sort and reduce. In the shu✏e step, the Hadoop system shu✏es and transfers the map outputs to the reducers as inputs. In the next sort step, the map outputs will be sorted by the master polygon id (key) in a {polyid:neighbor poly id} dictionary. The shu✏e and sort steps occur simultaneously to make sure that the input to every reducer is sorted correctly. In the reduce step, the algorithm defined in algorithm 2 is executed to generate the content of weights file in each reducer in parallel. Figure 2: The results of using six computer nodes to create contiguity weights from di↵erent size of data.
Generate Contiguity Weights File. Since each reducer writes its output to local disk only, an additional merge phase is needed to combine all individual results to generate a valid weights file. In this research, a distributed copy tool (DistCp) provided by Hadoop platform is used for this merge task in the MapReduce paradigm. To accelerate this merge task, the reducers are configured to compress its output into GNU zip format so that data transfer between data server node and computing node is fast and the compressed files can be concatenated directly.
3. 3.1
node includes 7.5 GB memory, 14 cores (4 core x 3.5 unit) CPU, 80 GB (2 x 40GB SSD), 64-bit Operating System and 500Mbps moderate network speed. Besides the Hadoop testing system, We also tested the same MapReduce algorithm on a single computer with the following configuration: 2.93 GHz 8 cores CPU, 16 GB memory, 100 GB HD and 64-bit Operating System.
3.3
EXPERIMENT
To test this MapReduce algorithm, we use python to implement a desktop version, and a Hadoop version which can be executed via Hadoop’s streaming pipeline. The first experiment is to run the MapReduce algorithm on a single testing machine. The running time of this algorithm is shown in Figure 2 (red line). This O(N ) algorithm reaches its maximum computation capacity by handling 16x dataset (9,480,338 polygons). We can also see that the running time is increasing exponentially. The second experiment runs the MapReduce algorithm on Amazon EMR Hadoop system. First, we configure a Hadoop system with one Master node and 6 C3.xlarge nodes to test the algorithm with 1x, 2x, 4x, 8x, 16x and 32x data respectively. The runtime for each dataset is shown in figure 2 (blue line). Since Hadoop will spend extra time to deliver program and communicate with running nodes, it is actually slower than running the same program on the desktop computer for dataset less than 4-time of the raw data (about 2 million polygons). However, the bigger the data, the better performance this algorithm can achieve on the Hadoop system. For example, for a 8x data, the algorithm on Hadoop took 167 seconds to compete, and the runtime is much faster than that on a desktop computer (482.67 seconds). We can also observe that the running time increases linearly, which
Sample dataset
The base map used in this experiment is the parcel data of Chicago city in the United States. It’s obtained from city of Chicago. This parcel data contains 592,521 polygons. To simulate large dataset, we use this base map to create artificial big data: the base map is manually duplicated several times and put side by side to generate an artificial big map. For example: a 4x original data with 2,370,084 polygons are shown in figure 1. The largest data we created for this experiment is a 32x original data with 18,960,672 polygons. The overall datasets include 1x, 2x, 4x, 8x, 16x and 32x data.
3.2
Results
Testing System
This research chooses Amazon Elastic MapReduce (EMR) service (http://aws.amazon.com/) to create a Hadoop testing system. The Amazon EMR provides an easy to use customizable Hadoop system. This research uses the default Hadoop configuration provided by Amazon. We select a cluster of “C3 Extra Large(C3.xlarge)” type computer instances scale from 1 node to 18 nodes running on Amazon EMR. Besides the computer cluster, the Hadoop system runs with a Master node to monitor and communicate with all computer instances. The configuration of a C3.xlarge
52
[5] C. Ji, T. Dong, Y. Li, Y. Shen, K. Li, W. Qiu, W. Qu, and M. Guo. Inverted grid-based knn query processing with mapreduce. In ChinaGrid Annual Conference (ChinaGrid), 2012 Seventh, pages 25–32. IEEE, 2012. [6] S. J. Rey and L. Anselin. Pysal: A python library of spatial analytical methods. pages 175–193, 2010. [7] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1–10. IEEE, 2010. [8] S. Wang. A cybergis framework for the synthesis of cyberinfrastructure, gis, and spatial analysis. Annals of the Association of American Geographers, 100(3):535–557, 2010. [9] S. Wang, L. Anselin, B. Bhaduri, C. Crosby, M. F. Goodchild, Y. Liu, and T. L. Nyerges. Cybergis software: a synthetic review and integration roadmap. International Journal of Geographical Information Science, 27(11):2122–2145, 2013.
Figure 3: The results of using di↵erent computing nodes to create contiguity weights from 32x data. indicates this algorithm can be scaled up with growing size of data. Then, in the next test, we create di↵erent Hadoop system with 6, 12, 14, 18 computer nodes to create contiguity weights using 32x data. The running times are shown in figure 3. The best performance we can get from all tests is using 18 computer nodes in Hadoop to create contiguity weights file using 32x data in 163 seconds. The running time in figure 3 does not decline linearly with the increasing number of computing nodes. This phenomenon is reasonable since there will be some extra time used for larger number of computing nodes to communicate inside the Hadoop system.
4.
Algorithm 2: Reduce Algorithm of Contiguity Weights Creation 1 current master poly N one 2 current neighbor set set() 3 temp master poly N one 4 5
CONCLUSION
6
In this paper, we propose a MapReduce algorithm to create contiguity weights matrix for spatial analysis of Big data. We demonstrate the capability and efficiency of this algorithm by generating the weights file for big spatial data (about 190 million polygons), which is not possible to process using desktop computers. From the results, we can conclude that this algorithm can be used to solve the problem of creating contiguity weights on big spatial data by utilizing high performance computing resources, such as Amazon EC2 cloud computing platform. Ongoing work focuses on extending this MapReduce contiguity weights creation algorithm to distance-based weights creation. Distance-based spatial weights are often used when point data is involved in spatial analysis and statistics. However, creating distance-based weights is totally di↵erent from creating contiguity weights: the distance between geometric features has to be calculated. Existing research of MapReduce kd-tree[5] can be applied to build a MapReduce algorithm to create distance weights e↵ectively.
5.
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
REFERENCES
24
[1] L. Anselin. From spacestat to cybergis twenty years of spatial data analysis software. International Regional Science Review, 35(2):131–157, 2012. [2] L. Anselin and S. J. Rey. Spatial econometrics in an age of cybergiscience. International Journal of Geographical Information Science, 26(12):2211–2226, 2012. [3] L. Anselin, I. Syabri, and Y. Kho. Geoda: an introduction to spatial data analysis. Geographical analysis, 38(1):5–22, 2006. [4] M. F. Goodchild. Whose hand on the tiller? revisiting “spatial statistical analysis and gis”. pages 49–59, 2010.
25 26
27
28 29 30 31
53
/* System Input: {polyid:neighbor_poly_id} */ for line 2 sys.stdin do neighbors line.split() temp master poly neighbors[0] temp neighbor poly N one if neighbors.length > 0 then temp neighbor poly = neighbors[0] end if current master poly ⌘ temp master poly then if temp neighbor poly 6= N one then current neighbor set .add(temp neighbor poly) end else if curent master poly ⌘ N one then if neighbor poly 6= N one then current neighbor set set() else current neighbor set set([neighbor poly]) end else WriteWeightsFile curent master poly, current neighbor set end end end /* Process last line if needed */ if current master poly ⌘ temp master poly then /* Write GAL results to output weights file */ num neighbors current neighbor set.length() print curent master poly, num neighbors print current neighbor set.items() end