MapReduce Based Scalable Range Query Architecture for Big Spatial ...

56 downloads 20827 Views 454KB Size Report
Architecture for Big Spatial Data. Umut Klzgmdere#, Silleyman Eken#, Ahmet Sayar#. #Computer Engineering Department, Kocaeli University, izmit, 41380, ...
MapReduce Based Scalable Range Query Architecture for Big Spatial Data # # # Umut Klzgmdere , Silleyman Eken , Ahmet Sayar #Computer Engineering Department, Kocaeli University, izmit, 41380, Turkey #[email protected], #{suleyman.eken, ahmet.sayar}@kocaelLedu.tr Abstract-Finding all obj ects that overlap a given range query is very important in terms of extraction useful information from big spatial data. In this study, in order to be able to realize range query on large amounts of spatial data, three datasets are created with different size and a MapReduce computation model is set up to test scalability of range queries. Experimental results show that process times for range query reduce with increase of conventional machines. Keywords-Big scalability

spatial

data,

range

query,

MapReduce,

selection query, join query, k nearest neighbor (kNN) query. In this

paper,

we

present

MapReduce

based

range

query

architecture. It enables finding rectangles intersecting with query region (window query) using distributed computing framework. Here, rectangles can be everything that they can be modelled with their bottom left and upper right coordinates. For instance, a rectangle can be a mosaic image belonging to a satellite object. Tectonic plates, biomes, watersheds, sea ice are also examples of rectangle data [8]. This paper is organized as follows: Section 2 introduces the

I.

preliminaries on Apache. Section 3 describes the related works.

INTRODUCTION

Section 4 presents problem definition. Section 5 details the

With every passing day, the size of the spatial big data

proposed MapReduce based range query architecture to find

increases expeditiously. Spatial big data are high volume (scale

intersected

of data), high velocity (how fast data is being generated), high

selection query) and Section 6 concludes the paper.

variety (different variation of data types) and high veracity

II.

(uncertainty of data). Reasons for the data increase are largely owing to new technologies. They generate and collect vast amounts of structured, semi-structured or unstructured data.

rectangles

with

user

defined

region

(spatial

PRELIMINARY KNOWLEDGE OF HADOOP

Hadoop is a distributed master-slave architecture that consists of the HDFS -scalable and reliable file system- for

These sources include scientific sensors such as environmental

storage and MapReduce for computational capabilities. HDFS

monitoring,

has two kinds of nodes: namenode (master) and datanodes

location

based

social

networks

(LBSNS),

geographic information systems (GIS), Internet of Things (loT). These datasets bring with some problems related storage, analysis, and visualization [1-2]. Maintaining, querying and analyzing

of such

big data sets

are

getting harder

and

sometimes impossible with conventional systems. Distributed systems are generally used to overcome these challenges by

(slaves). Namenode manages the file system namespace and stores metadata for all files and directories. Hadoop runs MapReduce jobs in parallel manner. To achieve processing of large amounts of data with MapReduce programming model, the developer has to define two functions: Map and Reduce. Effects of algorithms on high performance computing are

storing and processing large scale data in a parallel manner.

inevitable [9]. Map and Reduce are also such algorithms. Input

These systems have to be scalable for both adding new

and outputs of these functions are records as

conventional processors (computing nodes) and for running different jobs simultaneously [3-4]. Distributed systems begin with a new form of file system, known as a distributed file

pairs. After users upload input data to the HDFS and start jobs by implementing Map and Reduce functions, jobs are executed on worker nodes as MapTask or ReduceTask. Hadoop converts

system (DFS), which manages the storage across a network of

the input files into InputSplits and each task processes one

machines

InputSplit. InputSplit size should be configured carefully,

[5].

Since

they

are

network-based,

all

the

complications of network programming kick in, thus making distributed file systems more complex than regular disk file systems. For example, one of the biggest challenges is making the file system tolerate node failure without suffering data loss. DFS provides replication of data or redundancy to protect against the frequent media failures that occur when data is distributed over potentially thousands of low cost computing nodes. Hadoop uses the Hadoop Distributed File System (HDFS), which is the open source version of Google File System and HDFS is designed for storing very large files with streaming

data

access

patterns,

running

on

clusters

of

commodity hardware [6-7].

data,

spatial

queries

InputSplit size is chosen to be larger than HDFS block size (each block in the HDFS is 64MB by default). After all Map tasks are finished, their outputs are sorted and they become the input of the Reducer. In other words, the output format of the map function and the input format of the reduce function are same. Once the Reduce phase is finished and its output has been written back to HDFS, the user then retrieves the resulting data.

The

content

of

the

records

can

be

changed

by

implementing another derived class from RecordReader class [10]. Main advantages of Hadoop MapReduce framework are scalability, cost effectiveness, flexibility, speed, and resilience

To extract more worthy and useful information from big spatial

because InputSplits can be stored more than one block if

are

mostly

used

in

many

applications. There are different types of spatial queries such as

978-1-5090-0478-2/15/$31.00 ©2015 IEEE

to failures [11].

III.

display spatial data defined by their coordinates in space, and is

RELATED WORK

In the literature, there are some works using the distributed programming framework to process spatial queries. They can be classified into two types: (i) In the first type, high selectivity

used in many science and application domains including GIS, astronomy,

computer

aided

design/manufacturing

and

computer graphics [31].

queries such as selection queries and kNN queries are handled.

The spatial data are defined/queried with 2-dimensional, (x,

After processing spatial query, only a small fraction of spatial

y) Cartesian coordinates. The set of (x, y) coordinate values are

objects are returned. A few techniques, which are utilizing

accessed with range queries. Ranges are called minimum

popular spatial indices such as an R-tree and its variants, have

bounding rectangles (MBR) or minimum bounding boxes

been proposed to process the high selectivity queries in HDFS

(MBB). Both are the same and formulated as R = [(minx,

[12-13]. (ii) In the second type, low selectivity queries such as

miny), (maxx, maxy)].

kNN join. After processing spatial query, many spatial objects

and (maxx, maxy) to the upper right corners. minx, miny, maxx

are returned. Several techniques have been proposed to process

and maxy are assumed to be integers or rational numbers. Fig.

(minx, miny) refers to the lower left,

the kNN (or similar) joins using the MapReduce framework

1 illustrates the MBRs of different types of spatial objects.

[14-18].

MBRs are simple polygons in case its line segments do not

As Hadoop has not been suited to process spatial data, SpatialHadoop [19] has been developed on top of Hadoop.

intersect among themselves. In other words, they are concave. In this study, we tackle intersections of concave polygons.

SpatialHadoop is an open source Hadoop-extended framework for processing spatial data sets efficiently. It has awareness of spatial

constructs

and

operations.

Despite

the

fact

that

SpatialHadoop is well suited and designed for spatial data, it has not been tackled schema-like spatial datasets such as

(maxx, maxy) .-------------• ·

:

(maxx, maxy)

ge O m_ 1 r.

(maxx, maxy)

georeferenced dataset and similar datasets. For this reason, GISQF (Geographic Information System Query Framework) [20] that has been developed on top of SpatialHadoop. It is capable of three types of queries: (a) spatial selection query, (b) circle-area query which gives all events in a specific region, and (c) aggregation query.

(minx, miny) (minx, miny)

In an earlier work, we propose two approaches to overcome mosaic

selection

problem

[21-23]

by

means

of

Fig I. MBRs of different types of spatial objects

finding

rectangular sub regions intersecting with range query. Former one is based on hybrid of Apache Hadoop and HBase and latter

Fig. 2 illustrates the problem regarding range query for

one is based on Apache Lucene. Their effectiveness has been

different spatial data. In Fig. 2, dotted red rectangle represents

compared in terms of response time under varying number of

query window (range query) and others represent different

mosaics [24].

spatial objects. Our goal is to find rectangles from millions

scalability

In both approaches, we focused on vertical

(different

data

sizes)

instead

of

horizontal

scalability.

rectangles coincident with the query rectangle by means of distributed programming framework. According to Fig. 2, two

Rectangles representing boundary of a spatial objects are a

MBRs (geom_2 and geom_3) are intersected with range query.

kinds of polygons. So, polygon (or rectangle) intersection is another topic that we deal with. rectangle

objects

computational geometric

intersect

geometry.

intersection

is

fundamental

Mount

presents

[25].

In

(maxx, maxy)

Detecting whether two

the

problem

in

survey

on

there

are

a

literature,

parallelized versions of several classical polygon intersection

,--------------, I

l

g e om 1 '--

I

-

(maxx, maxy)

l

(maxx, maxy)

algorithms. Parallelizations of Sutherland-Hodgman and Liang­ Barsky algorithms are done on classic parallel architectures [26-27]. Parallelization of plane-sweep algorithm for multi­ cores

is

discussed

in

[28-29].

Puri

and

Prasad

present

parallelization of a plane-sweep based algorithm relying only on parallel primitives such as prefix sum and sorting. They

(minx, miny)

tested their multi-threaded algorithms with real world and

Fig. 2. An example of range query

synthetic datasets [30]. IV.

PROBLEM DEFINITION

Rectangles are defined in a 2-D plane as polygons with their Cartesian coordinates, and are queried by 2-D range queries. Range queries are also called as window queries and defined with rectangles. They are used mostly for regional selections. A range query is a general process to analyze and

V.

SCALABLE RANGE QUERY ARCHITECTURE

Details of the proposed range query architecture are as following: All rectangles and their bottom left and upper right coordinates are stored in HDFS. Region (query region) is specified by users. Spatial region query could be resolved with one MapReduce job. This job includes Map and Reduce functions. In the Map function, the filtering strategy can be

used to find the rectangles intersected with of region query.

framework

The results of the Map stage are stored in the distributed file

virtualization causes some performance loss in total execution

system directly. In filtering phase, bottom left and upper right

efficiency, installation and management of Hadoop become

installed

on

a

virtual

machine.

Although

coordinates of every mosaic are examined whether intersects

easier by cloning virtual machines. In order to verifY the

with the query region or not. Pseudo-code for the algorithm to

efficiency of the proposed approach, three datasets are created

filter step is as following:

with different size (lGB, 3GB, and 5GB). Each dataset is composed of millions of rectangle names and their bottom left

Algorithm 1 Map Input: {Init: q: query; r: rectangle; Key: line number of the file; Value: line content of the file} Output: {Key: Coordinates of intersected rectangle; Value: I} 1. begin 2.

and upper right coordinates. Average processing speeds of four different test platforms are compared. The result can be seen as following Fig. 3. According to experimental results, when NameNode and DataNode are in same computer, they spend

splits a line and extracts coordinates of a rectangle as r.minx, r.miny, r.maxx, and r.maxy

3. 4.

more time than traditional java implementation. Because, coordination and data flow between NameNode and DataNode require more time. For example, average process time of 300

if( !(q.mix>r.maxx)) && !(r.minx>q.maxx) && !(q.miny>r.maxy && !(r.miny>q.maxy)) then output (r, 1) end

million of MRBs with traditional java implementation is 9.57 minutes and average process time of 300 million of MRBs with NameNode and DataNode in same computer is 11.01 minutes.

where (q.minx, q.miny) and (q.maxx, q.maxy) show bottom left and upper right coordinates of query region, respectively. (r.minx, r.miny) and (r.maxx, r.maxy) show bottom left and upper right coordinates of a rectangle in rectangle dataset, respectively. Each Mapper processes a file, extracts rectangles (r) intersected with range query (q) and emits the following key/value pair: . Pseudo-code for Reduce function can be defined as follows:

reduce with increase of conventional machines as seen Fig. 3. For example, average process time of 300 million of MRBs with the third test platform is 7.01 minutes and average process time of 300 million of MRBs with the fourth test platform is 6.18 minutes. Ti me (minI

12 ,------

Algorithm 2 Reduce Input: {Init: sum: total number of rectangles intersected with range query; Key: Coordinates of intersected rectangle; Value: I} Output: {Key: unused, Value: unused} I. begin 2. for each (rectangle r in intersected rectangle list) do 3. 4. 5. 6.

Also, experimental results show that average processing times

• T ra dit io nal Java

10 +-------------------��--8

• One Name-Node and o ne DalaN ode wor ked i n same node

6

sum+=1

+------

1m p ie mentation

+----• One Nam e- Node and two

end for each output (sum) end

DataN odes wo rked in

4 +-----

same node

• One Narne-Node and two

The

Reducer

receives

key/value

pairs

that

have

DataN odes wo rked in one

the

following format: . The Reducer simply add up the Is to provide a final count of the rectangles and send the result to the

node and one DataN od e

o

1GB

3GB

w o rked in anot h er node

5GB

Dala S i ze

output as the following value . After defining map and reduce functions,

Fig. 3. Comparison of process times of different test platforms

jobs are executed on worker nodes as MapTaskiReduceTask.

VI.

10bTracker is the main process of Hadoop for controlling and scheduling tasks. 10bTracker gives roles to the worker nodes as

RESULTS AND FUTURE WORKS

Range query, both on point and general geographic object

Mapper or Reducer task by initializing TaskTrackers in worker

datasets, has received important attention in the literature. In

nodes. TaskTracker runs the Mapper or Reducer task and

this paper, we have shown that the Map-Reduce parallel

reports the progress to 10bTracker.

programming paradigm can be used to process range queries

To test the system and evaluate the results, we have set up an HDFS cluster with two nodes of Hewlett-Packard. Each node has an Intel(R) Core(TM) i7-361OQM CPU

@

2.30GHz,

8GB memory, 160GB SATA disk. The operating system is Ubuntu with kernel 3.13.0-37-generic. Hadoop version is 2.6.0 and java version is 1.7.0. We make four different test platforms using these two nodes: (i) one NameNode and one DataNode worked in same node, (ii) one NameNode and two DataNodes

for big spatial data. The performance evaluation demonstrates the feasibility of processing range queries with MapReduce. The proposed approaches can be used in object extraction, object recognition, and image stitching as a preprocessing step. In the near future, we plan to extend the proposed system with other

(totally 3 DataNodes), (iv) traditional java implementation instead of distributed framework. Each node has a Hadoop

for

polygon

example

union,

coverage

difference,

problem

on

big

etc.

for

spatial

datasets.

worked in same node, (iii) one NameNode and two DataNodes worked in one node and one DataNode worked in another node

operations

implementing

REFERENCES [1]

M. Wessler, Big Data Analytics for Dummies, John Wiley & Sons, Lnc.,Hoboken: New Jersey,2013.

[2] [3]

[4]

[5]

[6]

[7] [8]

[9]

[10] [11]

[12]

[13]

[14]

[15]

[16]

[17]

A Rajaraman, JD Ullman, Mining of Massive Datasets, Cambridge­ United Kingdom: Cambridge University Press,2012. i. Demir, A Sayar, "Hadoop Optimization for Massive Image Processing: Case Study Face Detection", International Journal of Computers Communications & Control, 9(6): 664--671,2014. i. Demir, A. Sayar, "Hadoop plugin for distributed and parallel image processing", 20th Signal Processing and Communications Applications Conference,Mugla,Turkey,pp. 1-4,2012. U. Ergun, S. Eken, A Sayar, "Guncel Dagltlk Dosya Sistemlerinin Kar�lIa�tmnah Analizi", 6. MUhendislik ve Teknoloji Sempozyumu, Ankara,Turkey,pp. 213-218,2013. (in Turkish) K. Shvachko, H. Kuang, S. Radia, R. Chansler, "The Hadoop Distributed File System",IEEEfNASA Goddard Conference on, Mass Storage Systems and Technologies, pp. 1-10,2010. J. Dean and S.Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, 51(I): 107-113,2008. F. Martmez, AJ. Rueda, F.R. Feito, "A new algorithm for computing Boolean operations on polygons", Computers & Geosciences, 35: 1177-1185,2009. G. C. Fox, M. S. Aktas, G. Aydin, H. Gadgil, S. Pallickara, M. E. Pierce, A Sayar, "Algorithms and the Grid", Computing and Visualization in Science,12(3): 115-124,2009. Official Hadoop Web Site,2015,http://hadoop.apache.org!. 2015. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica, "Spark: cluster computing with working set", Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 1-7, 2010. H. Liao, J. Han, J. Fang, "Multi-dimensional Index on Hadoop distributed File System", IEEE Fifth International Conference on Networking,Architecture and Storage,pp. 240-249,20 I O. X. Liu, J. Han, Y. Zhong, C. Han, X. He, "Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS, IEEE International Conference on Cluster Computing and Workshops,p. 1-8,2009. W. Lu, Y. Shen,S. Chen, B.C. Ooi, "Efficient Processing of K Nearest Neighbor Joins Using MapReduce", Proceedings of the VLDB Endowment,5(10): 1016-1027,June 2012. C. Zhang, F. Li, J. Jestes, "Efficient Parallel kNN Joins for Large Data in MapReduce", Proceedings of the 15th International Conference on Extending Database Technology,pp. 38-49,2012. A Akdogan, U. Demiryurek, F. Banaei-Kashani, C. Shahabi, "Voronoi-Based Geospatial Query Processing with MapReduce",IEEE Second International Conference on Cloud Computing Technology and Science,pp. 9-16,2010. M. I. Andreica, N. Tapus, ""Sequential and Mapreduce-Based Algorithms for Constructing an In-Place Multidimensional Quad-Tree Index for Answering Fixed-Radius Nearest Neighbor Queries," Acta Universitatis Apulensis,pp. 131-151,2010.

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27] [28]

[29]

[30]

[31]

P. Lu, G. Chen, B. C. 00, H. T. Yo, S. Wu, "ScalaGiST: Scalable Generalized Search Trees for MapReduce Systems",Proceedings of the VLDB Endowment,7(14): 1797-1808,2014. A Eldawy, M. F. Mokbel, "A demonstration of spatialhadoop: An efficient mapreduce framework for spatial data," Proc. VLDB Endow., vol. 6,no. 12,pp. 1230-1233,2013. K. Mohammed AI-Naami, S. Seker, L. Khan, "GISQF: An Efficient Spatial Query Processing System," 2014 IEEE 7th International Conference on Cloud Computing,pp. 681-688,2014. S. Eken, A Sayar, "An automated technique to determine spatio­ temporal changes in satellite island images with vectorization and spatial queries",Sadhana,40,Part I, pp. 121-137. A Sayar,S. Eken,U. Mert, "Registering LandSat-8 Mosaic Images: A Case Study on the Marmara Sea", IEEE 10th International Conference On Electronics Computer and Computation,pp. 375-377,2013. A Sayar, S. Eken, U. Mert, "Tiling of Satellite Images to Capture an Island Object", Communications in Computer and Information Science, 459,pp. 195-204,2014. S. Eken, A Sayar, "Big data frameworks for efficient range queries to extract interested rectangular sub regions", International Journal of Computer Applications,119(22): 36-39,2015. D. M. Mount, Geometric Intersection, in The Handbook of Discrete and Computational Geometry, 2nd Edition, eds. J. E. Goodman and J. O'Rourke,Chapman & Hall/CRC,Boca Raton,pp. 857-876,2004. B.O. Schneider, J. van Welzen, "Efficient polygon clipping for an SIMD graphics pipeline", IEEE Transactions on Visualization and Computer Graphics,4(3): 272-285,1998. T. Theoharis, 1. Page, ''Two parallel methods for polygon clipping", Computer graphics forum,Wiley Online Library,8(2): 107-114,1989. M. McKenney,T. McGuire,"A parallel plane sweep algorithm for multi-core systems", Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems,pp. 392-395,2009. A B. Khlopotine, V. Jandhyala, D. Kirkpatrick, "A variant of parallel plane sweep algorithm for multicore systems", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32(6): 966-970,2013. S. Puri, S.K. Prasad, "Output-Sensitive Parallel Algorithm for Polygon Clipping", 43rd International Conference on Parallel Processing, pp. 241-250,2014. A Sayar, S. Eken, O. 0ztork, "Kd-tree and Quad-tree Decompositions for Declustering of 2-D Range Queries over Uncertain Space", Frontiers of Information Technology & Electronic Engineering, 16(2): 98-108,2015.