MapReduce Based Scalable Range Query Architecture for Big Spatial Data # # # Umut Klzgmdere , Silleyman Eken , Ahmet Sayar #Computer Engineering Department, Kocaeli University, izmit, 41380, Turkey #
[email protected], #{suleyman.eken, ahmet.sayar}@kocaelLedu.tr Abstract-Finding all obj ects that overlap a given range query is very important in terms of extraction useful information from big spatial data. In this study, in order to be able to realize range query on large amounts of spatial data, three datasets are created with different size and a MapReduce computation model is set up to test scalability of range queries. Experimental results show that process times for range query reduce with increase of conventional machines. Keywords-Big scalability
spatial
data,
range
query,
MapReduce,
selection query, join query, k nearest neighbor (kNN) query. In this
paper,
we
present
MapReduce
based
range
query
architecture. It enables finding rectangles intersecting with query region (window query) using distributed computing framework. Here, rectangles can be everything that they can be modelled with their bottom left and upper right coordinates. For instance, a rectangle can be a mosaic image belonging to a satellite object. Tectonic plates, biomes, watersheds, sea ice are also examples of rectangle data [8]. This paper is organized as follows: Section 2 introduces the
I.
preliminaries on Apache. Section 3 describes the related works.
INTRODUCTION
Section 4 presents problem definition. Section 5 details the
With every passing day, the size of the spatial big data
proposed MapReduce based range query architecture to find
increases expeditiously. Spatial big data are high volume (scale
intersected
of data), high velocity (how fast data is being generated), high
selection query) and Section 6 concludes the paper.
variety (different variation of data types) and high veracity
II.
(uncertainty of data). Reasons for the data increase are largely owing to new technologies. They generate and collect vast amounts of structured, semi-structured or unstructured data.
rectangles
with
user
defined
region
(spatial
PRELIMINARY KNOWLEDGE OF HADOOP
Hadoop is a distributed master-slave architecture that consists of the HDFS -scalable and reliable file system- for
These sources include scientific sensors such as environmental
storage and MapReduce for computational capabilities. HDFS
monitoring,
has two kinds of nodes: namenode (master) and datanodes
location
based
social
networks
(LBSNS),
geographic information systems (GIS), Internet of Things (loT). These datasets bring with some problems related storage, analysis, and visualization [1-2]. Maintaining, querying and analyzing
of such
big data sets
are
getting harder
and
sometimes impossible with conventional systems. Distributed systems are generally used to overcome these challenges by
(slaves). Namenode manages the file system namespace and stores metadata for all files and directories. Hadoop runs MapReduce jobs in parallel manner. To achieve processing of large amounts of data with MapReduce programming model, the developer has to define two functions: Map and Reduce. Effects of algorithms on high performance computing are
storing and processing large scale data in a parallel manner.
inevitable [9]. Map and Reduce are also such algorithms. Input
These systems have to be scalable for both adding new
and outputs of these functions are records as
conventional processors (computing nodes) and for running different jobs simultaneously [3-4]. Distributed systems begin with a new form of file system, known as a distributed file
pairs. After users upload input data to the HDFS and start jobs by implementing Map and Reduce functions, jobs are executed on worker nodes as MapTask or ReduceTask. Hadoop converts
system (DFS), which manages the storage across a network of
the input files into InputSplits and each task processes one
machines
InputSplit. InputSplit size should be configured carefully,
[5].
Since
they
are
network-based,
all
the
complications of network programming kick in, thus making distributed file systems more complex than regular disk file systems. For example, one of the biggest challenges is making the file system tolerate node failure without suffering data loss. DFS provides replication of data or redundancy to protect against the frequent media failures that occur when data is distributed over potentially thousands of low cost computing nodes. Hadoop uses the Hadoop Distributed File System (HDFS), which is the open source version of Google File System and HDFS is designed for storing very large files with streaming
data
access
patterns,
running
on
clusters
of
commodity hardware [6-7].
data,
spatial
queries
InputSplit size is chosen to be larger than HDFS block size (each block in the HDFS is 64MB by default). After all Map tasks are finished, their outputs are sorted and they become the input of the Reducer. In other words, the output format of the map function and the input format of the reduce function are same. Once the Reduce phase is finished and its output has been written back to HDFS, the user then retrieves the resulting data.
The
content
of
the
records
can
be
changed
by
implementing another derived class from RecordReader class [10]. Main advantages of Hadoop MapReduce framework are scalability, cost effectiveness, flexibility, speed, and resilience
To extract more worthy and useful information from big spatial
because InputSplits can be stored more than one block if
are
mostly
used
in
many
applications. There are different types of spatial queries such as
978-1-5090-0478-2/15/$31.00 ©2015 IEEE
to failures [11].
III.
display spatial data defined by their coordinates in space, and is
RELATED WORK
In the literature, there are some works using the distributed programming framework to process spatial queries. They can be classified into two types: (i) In the first type, high selectivity
used in many science and application domains including GIS, astronomy,
computer
aided
design/manufacturing
and
computer graphics [31].
queries such as selection queries and kNN queries are handled.
The spatial data are defined/queried with 2-dimensional, (x,
After processing spatial query, only a small fraction of spatial
y) Cartesian coordinates. The set of (x, y) coordinate values are
objects are returned. A few techniques, which are utilizing
accessed with range queries. Ranges are called minimum
popular spatial indices such as an R-tree and its variants, have
bounding rectangles (MBR) or minimum bounding boxes
been proposed to process the high selectivity queries in HDFS
(MBB). Both are the same and formulated as R = [(minx,
[12-13]. (ii) In the second type, low selectivity queries such as
miny), (maxx, maxy)].
kNN join. After processing spatial query, many spatial objects
and (maxx, maxy) to the upper right corners. minx, miny, maxx
are returned. Several techniques have been proposed to process
and maxy are assumed to be integers or rational numbers. Fig.
(minx, miny) refers to the lower left,
the kNN (or similar) joins using the MapReduce framework
1 illustrates the MBRs of different types of spatial objects.
[14-18].
MBRs are simple polygons in case its line segments do not
As Hadoop has not been suited to process spatial data, SpatialHadoop [19] has been developed on top of Hadoop.
intersect among themselves. In other words, they are concave. In this study, we tackle intersections of concave polygons.
SpatialHadoop is an open source Hadoop-extended framework for processing spatial data sets efficiently. It has awareness of spatial
constructs
and
operations.
Despite
the
fact
that
SpatialHadoop is well suited and designed for spatial data, it has not been tackled schema-like spatial datasets such as
(maxx, maxy) .-------------• ·
:
(maxx, maxy)
ge O m_ 1 r.
(maxx, maxy)
georeferenced dataset and similar datasets. For this reason, GISQF (Geographic Information System Query Framework) [20] that has been developed on top of SpatialHadoop. It is capable of three types of queries: (a) spatial selection query, (b) circle-area query which gives all events in a specific region, and (c) aggregation query.
(minx, miny) (minx, miny)
In an earlier work, we propose two approaches to overcome mosaic
selection
problem
[21-23]
by
means
of
Fig I. MBRs of different types of spatial objects
finding
rectangular sub regions intersecting with range query. Former one is based on hybrid of Apache Hadoop and HBase and latter
Fig. 2 illustrates the problem regarding range query for
one is based on Apache Lucene. Their effectiveness has been
different spatial data. In Fig. 2, dotted red rectangle represents
compared in terms of response time under varying number of
query window (range query) and others represent different
mosaics [24].
spatial objects. Our goal is to find rectangles from millions
scalability
In both approaches, we focused on vertical
(different
data
sizes)
instead
of
horizontal
scalability.
rectangles coincident with the query rectangle by means of distributed programming framework. According to Fig. 2, two
Rectangles representing boundary of a spatial objects are a
MBRs (geom_2 and geom_3) are intersected with range query.
kinds of polygons. So, polygon (or rectangle) intersection is another topic that we deal with. rectangle
objects
computational geometric
intersect
geometry.
intersection
is
fundamental
Mount
presents
[25].
In
(maxx, maxy)
Detecting whether two
the
problem
in
survey
on
there
are
a
literature,
parallelized versions of several classical polygon intersection
,--------------, I
l
g e om 1 '--
I
-
(maxx, maxy)
l
(maxx, maxy)
algorithms. Parallelizations of Sutherland-Hodgman and Liang Barsky algorithms are done on classic parallel architectures [26-27]. Parallelization of plane-sweep algorithm for multi cores
is
discussed
in
[28-29].
Puri
and
Prasad
present
parallelization of a plane-sweep based algorithm relying only on parallel primitives such as prefix sum and sorting. They
(minx, miny)
tested their multi-threaded algorithms with real world and
Fig. 2. An example of range query
synthetic datasets [30]. IV.
PROBLEM DEFINITION
Rectangles are defined in a 2-D plane as polygons with their Cartesian coordinates, and are queried by 2-D range queries. Range queries are also called as window queries and defined with rectangles. They are used mostly for regional selections. A range query is a general process to analyze and
V.
SCALABLE RANGE QUERY ARCHITECTURE
Details of the proposed range query architecture are as following: All rectangles and their bottom left and upper right coordinates are stored in HDFS. Region (query region) is specified by users. Spatial region query could be resolved with one MapReduce job. This job includes Map and Reduce functions. In the Map function, the filtering strategy can be
used to find the rectangles intersected with of region query.
framework
The results of the Map stage are stored in the distributed file
virtualization causes some performance loss in total execution
system directly. In filtering phase, bottom left and upper right
efficiency, installation and management of Hadoop become
installed
on
a
virtual
machine.
Although
coordinates of every mosaic are examined whether intersects
easier by cloning virtual machines. In order to verifY the
with the query region or not. Pseudo-code for the algorithm to
efficiency of the proposed approach, three datasets are created
filter step is as following:
with different size (lGB, 3GB, and 5GB). Each dataset is composed of millions of rectangle names and their bottom left
Algorithm 1 Map Input: {Init: q: query; r: rectangle; Key: line number of the file; Value: line content of the file} Output: {Key: Coordinates of intersected rectangle; Value: I} 1. begin 2.
and upper right coordinates. Average processing speeds of four different test platforms are compared. The result can be seen as following Fig. 3. According to experimental results, when NameNode and DataNode are in same computer, they spend
splits a line and extracts coordinates of a rectangle as r.minx, r.miny, r.maxx, and r.maxy
3. 4.
more time than traditional java implementation. Because, coordination and data flow between NameNode and DataNode require more time. For example, average process time of 300
if( !(q.mix>r.maxx)) && !(r.minx>q.maxx) && !(q.miny>r.maxy && !(r.miny>q.maxy)) then output (r, 1) end
million of MRBs with traditional java implementation is 9.57 minutes and average process time of 300 million of MRBs with NameNode and DataNode in same computer is 11.01 minutes.
where (q.minx, q.miny) and (q.maxx, q.maxy) show bottom left and upper right coordinates of query region, respectively. (r.minx, r.miny) and (r.maxx, r.maxy) show bottom left and upper right coordinates of a rectangle in rectangle dataset, respectively. Each Mapper processes a file, extracts rectangles (r) intersected with range query (q) and emits the following key/value pair: . Pseudo-code for Reduce function can be defined as follows:
reduce with increase of conventional machines as seen Fig. 3. For example, average process time of 300 million of MRBs with the third test platform is 7.01 minutes and average process time of 300 million of MRBs with the fourth test platform is 6.18 minutes. Ti me (minI
12 ,------
Algorithm 2 Reduce Input: {Init: sum: total number of rectangles intersected with range query; Key: Coordinates of intersected rectangle; Value: I} Output: {Key: unused, Value: unused} I. begin 2. for each (rectangle r in intersected rectangle list) do 3. 4. 5. 6.
Also, experimental results show that average processing times
• T ra dit io nal Java
10 +-------------------��--8
• One Name-Node and o ne DalaN ode wor ked i n same node
6
sum+=1
+------
1m p ie mentation
+----• One Nam e- Node and two
end for each output (sum) end
DataN odes wo rked in
4 +-----
same node
• One Narne-Node and two
The
Reducer
receives
key/value
pairs
that
have
DataN odes wo rked in one
the
following format: . The Reducer simply add up the Is to provide a final count of the rectangles and send the result to the
node and one DataN od e
o
1GB
3GB
w o rked in anot h er node
5GB
Dala S i ze
output as the following value . After defining map and reduce functions,
Fig. 3. Comparison of process times of different test platforms
jobs are executed on worker nodes as MapTaskiReduceTask.
VI.
10bTracker is the main process of Hadoop for controlling and scheduling tasks. 10bTracker gives roles to the worker nodes as
RESULTS AND FUTURE WORKS
Range query, both on point and general geographic object
Mapper or Reducer task by initializing TaskTrackers in worker
datasets, has received important attention in the literature. In
nodes. TaskTracker runs the Mapper or Reducer task and
this paper, we have shown that the Map-Reduce parallel
reports the progress to 10bTracker.
programming paradigm can be used to process range queries
To test the system and evaluate the results, we have set up an HDFS cluster with two nodes of Hewlett-Packard. Each node has an Intel(R) Core(TM) i7-361OQM CPU
@
2.30GHz,
8GB memory, 160GB SATA disk. The operating system is Ubuntu with kernel 3.13.0-37-generic. Hadoop version is 2.6.0 and java version is 1.7.0. We make four different test platforms using these two nodes: (i) one NameNode and one DataNode worked in same node, (ii) one NameNode and two DataNodes
for big spatial data. The performance evaluation demonstrates the feasibility of processing range queries with MapReduce. The proposed approaches can be used in object extraction, object recognition, and image stitching as a preprocessing step. In the near future, we plan to extend the proposed system with other
(totally 3 DataNodes), (iv) traditional java implementation instead of distributed framework. Each node has a Hadoop
for
polygon
example
union,
coverage
difference,
problem
on
big
etc.
for
spatial
datasets.
worked in same node, (iii) one NameNode and two DataNodes worked in one node and one DataNode worked in another node
operations
implementing
REFERENCES [1]
M. Wessler, Big Data Analytics for Dummies, John Wiley & Sons, Lnc.,Hoboken: New Jersey,2013.
[2] [3]
[4]
[5]
[6]
[7] [8]
[9]
[10] [11]
[12]
[13]
[14]
[15]
[16]
[17]
A Rajaraman, JD Ullman, Mining of Massive Datasets, Cambridge United Kingdom: Cambridge University Press,2012. i. Demir, A Sayar, "Hadoop Optimization for Massive Image Processing: Case Study Face Detection", International Journal of Computers Communications & Control, 9(6): 664--671,2014. i. Demir, A. Sayar, "Hadoop plugin for distributed and parallel image processing", 20th Signal Processing and Communications Applications Conference,Mugla,Turkey,pp. 1-4,2012. U. Ergun, S. Eken, A Sayar, "Guncel Dagltlk Dosya Sistemlerinin Kar�lIa�tmnah Analizi", 6. MUhendislik ve Teknoloji Sempozyumu, Ankara,Turkey,pp. 213-218,2013. (in Turkish) K. Shvachko, H. Kuang, S. Radia, R. Chansler, "The Hadoop Distributed File System",IEEEfNASA Goddard Conference on, Mass Storage Systems and Technologies, pp. 1-10,2010. J. Dean and S.Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, 51(I): 107-113,2008. F. Martmez, AJ. Rueda, F.R. Feito, "A new algorithm for computing Boolean operations on polygons", Computers & Geosciences, 35: 1177-1185,2009. G. C. Fox, M. S. Aktas, G. Aydin, H. Gadgil, S. Pallickara, M. E. Pierce, A Sayar, "Algorithms and the Grid", Computing and Visualization in Science,12(3): 115-124,2009. Official Hadoop Web Site,2015,http://hadoop.apache.org!. 2015. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica, "Spark: cluster computing with working set", Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 1-7, 2010. H. Liao, J. Han, J. Fang, "Multi-dimensional Index on Hadoop distributed File System", IEEE Fifth International Conference on Networking,Architecture and Storage,pp. 240-249,20 I O. X. Liu, J. Han, Y. Zhong, C. Han, X. He, "Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS, IEEE International Conference on Cluster Computing and Workshops,p. 1-8,2009. W. Lu, Y. Shen,S. Chen, B.C. Ooi, "Efficient Processing of K Nearest Neighbor Joins Using MapReduce", Proceedings of the VLDB Endowment,5(10): 1016-1027,June 2012. C. Zhang, F. Li, J. Jestes, "Efficient Parallel kNN Joins for Large Data in MapReduce", Proceedings of the 15th International Conference on Extending Database Technology,pp. 38-49,2012. A Akdogan, U. Demiryurek, F. Banaei-Kashani, C. Shahabi, "Voronoi-Based Geospatial Query Processing with MapReduce",IEEE Second International Conference on Cloud Computing Technology and Science,pp. 9-16,2010. M. I. Andreica, N. Tapus, ""Sequential and Mapreduce-Based Algorithms for Constructing an In-Place Multidimensional Quad-Tree Index for Answering Fixed-Radius Nearest Neighbor Queries," Acta Universitatis Apulensis,pp. 131-151,2010.
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27] [28]
[29]
[30]
[31]
P. Lu, G. Chen, B. C. 00, H. T. Yo, S. Wu, "ScalaGiST: Scalable Generalized Search Trees for MapReduce Systems",Proceedings of the VLDB Endowment,7(14): 1797-1808,2014. A Eldawy, M. F. Mokbel, "A demonstration of spatialhadoop: An efficient mapreduce framework for spatial data," Proc. VLDB Endow., vol. 6,no. 12,pp. 1230-1233,2013. K. Mohammed AI-Naami, S. Seker, L. Khan, "GISQF: An Efficient Spatial Query Processing System," 2014 IEEE 7th International Conference on Cloud Computing,pp. 681-688,2014. S. Eken, A Sayar, "An automated technique to determine spatio temporal changes in satellite island images with vectorization and spatial queries",Sadhana,40,Part I, pp. 121-137. A Sayar,S. Eken,U. Mert, "Registering LandSat-8 Mosaic Images: A Case Study on the Marmara Sea", IEEE 10th International Conference On Electronics Computer and Computation,pp. 375-377,2013. A Sayar, S. Eken, U. Mert, "Tiling of Satellite Images to Capture an Island Object", Communications in Computer and Information Science, 459,pp. 195-204,2014. S. Eken, A Sayar, "Big data frameworks for efficient range queries to extract interested rectangular sub regions", International Journal of Computer Applications,119(22): 36-39,2015. D. M. Mount, Geometric Intersection, in The Handbook of Discrete and Computational Geometry, 2nd Edition, eds. J. E. Goodman and J. O'Rourke,Chapman & Hall/CRC,Boca Raton,pp. 857-876,2004. B.O. Schneider, J. van Welzen, "Efficient polygon clipping for an SIMD graphics pipeline", IEEE Transactions on Visualization and Computer Graphics,4(3): 272-285,1998. T. Theoharis, 1. Page, ''Two parallel methods for polygon clipping", Computer graphics forum,Wiley Online Library,8(2): 107-114,1989. M. McKenney,T. McGuire,"A parallel plane sweep algorithm for multi-core systems", Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems,pp. 392-395,2009. A B. Khlopotine, V. Jandhyala, D. Kirkpatrick, "A variant of parallel plane sweep algorithm for multicore systems", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32(6): 966-970,2013. S. Puri, S.K. Prasad, "Output-Sensitive Parallel Algorithm for Polygon Clipping", 43rd International Conference on Parallel Processing, pp. 241-250,2014. A Sayar, S. Eken, O. 0ztork, "Kd-tree and Quad-tree Decompositions for Declustering of 2-D Range Queries over Uncertain Space", Frontiers of Information Technology & Electronic Engineering, 16(2): 98-108,2015.