Dec 18, 2016 - plot geospatial data in multiple platforms such as MAC, Windows, and Linux. It also supports ...... Figure 5-9. Runtime for Large-scale geospatial data Spatial JOIN ..... However, the UPS Company has 60000 trucks in operation.
Spatial Data Mining Analytical Environment for Large Scale Geospatial Data
A Dissertation
Submitted to the Graduate Faculty of the University of New Orleans in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Engineering and Applied Science Computer Science
by Zhao Yang
M.S. University of New Orleans, 2012 December 2016
ACKNOWLEDGEMENT After an intensive period of four years, today is the day: writing this note of thanks is the finishing touch on my dissertation. It has been a period of intense learning for me, not only in the scientific area, but also on a personal level. Writing this thesis has had a big impact on me. I would like to reflect on the people who have supported and helped me so much throughout this period. I would like to express my heartfelt gratitude to my major professor Dr. Mahdi Abdelguerfi for his belief in me and the support extended by him throughout my work. It has been a wonderful experience to work under his guidance. I would like to thank Dr. Shengru Tu who admitted me and taught the “Big Data” course which is the foundation of my dissertation. I would like to thank Dr. Elias Ioup. We spent five years to do the research project together, which will be grateful memory in my life. I would like to thank Dr. Christopher M. Summa and Dr. Dimitrios Charalampidis for being my thesis committee members. Lastly, I would like to thank my cat, my friends and family for their love and support throughout. I would also like to thank my family for their wise counsel and sympathetic ear. You are always there for me. Finally, there are my friends. We were not only able to support each other by deliberating over our problems and findings, but also happily by talking about ii
things other than just our papers. Thank you very much, everyone!
iii
Table of Contents ABSTRACT ........................................................................................................................ v CHAPTER 1 INTRODUCTION ........................................................................................ 1 CHAPTER 2 BACKGROUND .......................................................................................... 4 2.1 Big Data and R ...................................................................................................... 4 2.2
Panoply, NetCDF and R ........................................................................................ 8
2.3
Hybrid Cloud ........................................................................................................ 9
2.4
In-Memory Computing ....................................................................................... 10
2.5
Spatial Data Warehouse and Spatial Data Mining .............................................. 12
2.6
Previous work ..................................................................................................... 14
CHAPTER 3 THE FRAMEWORK.................................................................................. 21 3.1 Framework Overview ......................................................................................... 21 3.2
Bi-Directional Spatial ETL Server ...................................................................... 24
3.3
In-Memory Spatial Index and Spatial Query ...................................................... 27
CHAPTER 4 FRAMEWORK IMPLEMENTATION ...................................................... 32 4.1 Framework Structure ............................................................................................... 32 4.2
R with package Spatstat ...................................................................................... 33
4.3
R with Hadoop database (RHbase solution) ....................................................... 36
4.4
R with Hadoop database (In-memory extension solution) ................................. 40
4.5
R with Map-Reduce ............................................................................................ 46
4.6
4.5.1 R with package Plyrmr ........................................................................ 46 4.5.2 R with package Rmr2 .......................................................................... 47 Estimating the System Resource ......................................................................... 50
CHAPTER 5 APPLICATIONS OF THE FRAMEWORK ............................................... 53 5.1 Spatial data warehouse ........................................................................................ 53 5.2
Spatial data mining case study ............................................................................ 65
5.3
Random forest method to process large file........................................................ 74
5.4
Shark alert map ................................................................................................... 79
CHAPTER 6 CONCLUSION AND FUTURE WORK ................................................... 85 Bibliography ..................................................................................................................... 88 Appendix ........................................................................................................................... 92 VITA ................................................................................................................................110
iv
ABSTRACT We propose a framework for processing and analyzing large-scale geospatial and environmental data using a “Big Data” infrastructure. Existing Big Data solutions do not include a specific mechanism to analyze large-scale geospatial data. In this work, we extend HBase with Spatial Index(R-Tree) and HDFS to support geospatial data and demonstrate its analytical use with some common geospatial data types and data mining technology provided by the R language. The resulting framework has a robust capability to analyze large-scale geospatial data using spatial data mining and making its outputs available to end users.
Keywords Big Data, Hadoop, Hbase, MapReduce, R, Data Mining, Geographic Information System
v
CHAPTER 1 INTRODUCTION
Nowadays, many applications are continuously generating large-scale geospatial data. For example, vehicle GPS tracking data, aerial surveillance drones, LiDAR (Light Detection and Ranging), world-wide spatial networks, and high resolution optical or Synthetic Aperture Radar imagery data all generate a huge amount of geospatial data. For instance, the geospatial image data generated by a 14-hour flight mission of a General Atomics MQ-9 reaper drone with a Gorgon Stare sensor system produces over 70 terabytes [1]. However, as data collection increases our ability to process this large-scale geospatial data in a flexible fashion is still limited. The ability to analyze large-scale geospatial data is a requirement for many geospatial intelligence [2] users, but current techniques for analyzing this data are overly specialized or ad hoc. These techniques are not designed to allow for user defined analytics methods. Commercial analytical products are incapable of fitting the customer’s special requirements. GIS users are expected to use raw datasets with unknown statistical information—the implicit statistics information is insufficient for analytical purposes. In order to successfully analyze statistics information, users require the various analytical functions on an integrated environment. In the field of ocean and acoustic modeling, there is still limited use of data mining measures on geospatial data. Our framework will provide a new approach for analyzing statistical information and distributions on
1
large-scale geospatial data. The raw geospatial data with unknown error information and uncertainty information [3] can be analyzed over a heterogeneous infrastructure in a well-defined manner. Existing user-oriented geospatial analytic environments are limited to smaller datasets. Much of the current work in large scale analytics focus on automating analysis tasks, for example detecting suspicious activity in wide area motion imagery [4]. Neither approach provides geospatial analysts with the flexibility to employ creativity and discover new trends in data while still efficiently operating over extremely large data sets. Currently, large-scale user defined analytics must be created by users with expertise in distributed computing frameworks, programming, data processing, and storage in addition to the geospatial and statistical subject matter. A new approach is necessary which hides the details of the distributed computing frameworks behind common geospatial analysis tools while still supporting large-scale analytics. When undertaking this project, the following factors need be considered. First, raw geospatial data comes in many forms. Therefore, the framework must be able to represent all forms of raw data—in a file or in a database, for instance. Second, the language must support user-defined data mining analysis. Lastly, the data mining functions must be able to run in parallel. The framework must support complete features with reliability, availability, and scalability of processing statistical information from end-to-end. In this project, we have developed a framework to store and process large-scale geospatial data over a “Big Data” [5] infrastructure while providing users with a common 2
geospatial analysis front-end that hides these infrastructure details. The geospatial data is stored in various data stores including file-based HDFS [6] and Hbase [7] (NoSQL database running on top of HDFS). The spatial data mining function is implemented by the R system. The framework provides high-performance analytical features. Also the framework is flexible and easily extendable. Compared to the SAS [8] system, our framework supports a variety of user generated geospatial data mining application instead of limited SAS analytical functions. It is convenient to transform current large-scale geospatial data analytical tasks to our framework. The framework is being constructed on an open source software platform, however, use of the framework does not require comprehensive programming skills; the spatial data mining task is easy to share, access, and visualize. This paper is organized as follows: Section 2 deals with the background knowledge necessary to comprehend and undertake the project; the concepts and definitions of Hadoop, R system, and Hybrid Cloud are presented in this section. Section 3 presents the framework and its architecture and advantages. The designs of the various components that make up the framework are presented in detail in Section 4. The implementation of all the framework components will be discussed in detail in this section. Section 5 presents examples on how to use the framework to analyze large-scale geospatial data and three such applications are presented. Section 6 discusses the conclusions we have drawn and future directions for our work.
3
CHAPTER 2 BACKGROUND 2.1 Big Data and R In recent years, Big Data has become a hot topic. Big Data deals with large-scale and complex data more efficiently than traditional database management systems and data storage tools. Big Data provides various methods to extract, transform, load, manipulate, share, and analyze large-scale data with high performance. Hadoop [4] is the most popular solution for distributing and processing large-scale data sets. Hadoop, as data storage infrastructure, can easily scale up from single node to thousands of nodes. To design a reliable, scalable infrastructure with high availability, we have chosen the Apache Hadoop as the data storage layer software in our framework. Here is a sample structure of a Hadoop cluster:
4
Figure 2-1. Multi-node Hadoop cluster [9]
A Hadoop cluster allows users to process large-scale data on distributed computing environments with simple methods, such as the map-reduce model. The Hadoop clusters may consist of thousands of data nodes. With Hadoop as the data storage infrastructure, the framework can handle petabyte level data. The Hadoop cluster provides high availability features including autonomous handling hardware or software failures. It is proved to be reliable as a high-availability commercial level infrastructure, like Microsoft Azure and Amazon EC2/S3. A Hadoop cluster consists of Client nodes, Master nodes, and Slave nodes. The Master nodes are the management part of Hadoop and the Name nodes are used to manage the HDFS (Hadoop File System) storage infrastructure. The job tracker, on the other hand, is used to manage MapReduce tasks which are parallel running. Finally, the Slave nodes are
5
used to store the data and processing work. Slave nodes consist of Data nodes and a Task Tracker daemon, and are controlled by Master nodes. The job tracker controls the Task Tracker daemon and the Name node controls the data node daemon.
Figure 2-2. Hadoop Server Roles [10]
Users can administrate and access the Hadoop cluster through various interfaces and methods. First, the user can load data into the cluster using shell script commands. Second, the user can submit Hadoop commands to interact with Hadoop data. Hadoop supports Java, C++, and Python. The output of a Hadoop command will be downloaded to the user machine for analytical purposes.
6
HBase, designed on the basis of Hadoop and ZooKeeper [11], is a centralized service that provides configuration and communication services between different products. The framework uses Apache HBase because the system needs random, real-time read/write access to the geospatial data. This framework will support large-scale geospatial data from the Hadoop infrastructure.
Figure 2-3. The Hadoop Ecosystem [12]
The Hbase’s design is based on Google’s BigTable [13] model. As the above figure shows, the Hbase uses HDFS as the data storage layer and provides access to the data using data structures like the Big Table format. In our framework, Hbase is the NoSQL DBMS to be used to store and process large-scale geospatial data. The R software [14] provides functions like SAS and SPSS in the fields of statistical 7
computing and data analysis. R has become very popular among statistician programmers, and its software has been developed from the S programming language. The S language is a statistical programming language designed by John Chambers from Bell Labs[15]. The statistician uses various statistical analysis methods, for instance, linear and nonlinear modeling, classification, clustering, and graphical procedures. R packages can be written using C, C++, FORTRAN, Java, and Python. In the R community, many programmers have developed numerous packages for new functions and extensions.
2.2 Panoply, NetCDF and R In our framework, we have chosen Panoply [16] as the visualization tool. Panoply can plot geospatial data in multiple platforms such as MAC, Windows, and Linux. It also supports multiple source file formats like NetCDF [17], HDF, and GRIB. NetCDF [17] is used to share large-scale multidimensional data, which includes bioinformatics data, climate data, and geospatial data. The NetCDF format includes the metadata of the content in the file—for example; both the coordinate value and the attribute’s information are stored in the metadata. As a result, a NetCDF file is said to be self-describing [17]. Additionally, these files can be shared between different platforms such as Mac, Windows, and Linux. NetCDF is supported by many platforms and features machine-independence [17], so that data will not be compromised in a shared process. With the ncdf and ncdf4 packages, R has the capability to read and write the netCDF
8
source file. The netCDF3 and netCDF4 formats are both widely used; however, the netCDF4 format has better support for large-scale data and better compression performance.
2.3 Hybrid Cloud Hybrid cloud is a mixed concept of public cloud and private cloud. One organization could use both public cloud and private cloud to integrate a universal cloud service. As a cloud computing service, private cloud may focus on a specific purpose for departments while public cloud may focus on efficiency and cost. The concept of public cloud is not limited to public corporations; it can be corporation level public cloud, which is still private to the outsider. The hybrid cloud may consist of heterogeneous architecture and platform.
9
Figure 2-4. Hybrid Cloud [18]
Figure 2-3 shows a hybrid architecture which is presently being used as commercial solutions. This hybrid cloud provides a platform as a service (PaaS) ability which allows a user to run cloud base applications. For large-scale geospatial data, public cloud can be used as data warehouse while private cloud can be used as department cloud.
2.4 In-Memory Computing In the era of Big Data, In-Memory Computing has become a popular solution for processing large amounts of data in a server’s RAM. The purpose of in-memory computing is to get the fastest speed possible.
10
Figure 2-5. In-Memory Computing [19]
The traditional way of computing reads and writes database records on the disk, which means lots of I/O cost. Another approach is to let application programs load the entire data set into memory. This avoids high I/O cost database access. Multiple concurrent users and programs can access the data and enhance the performance of query. The application code and the application data are all in memory. This makes the computation task run at a faster speed.
11
Figure 2-6. IBM: What is In-Memory Computing? [20]
In-memory computing provides a real-time response for Big Data applications including data analytics, business intelligence reporting tasks, etc. [21].
2.5 Spatial Data Warehouse and Spatial Data Mining
Central to a spatial data warehouse system is the effective handling of large-scale geospatial data. Spatial data warehouse stores large amount of information about the coordinates of individual spatial objects in space. For OLTP spatial databases, the spatial computation is expensive. The cost of rendering online processing is not acceptable [22]
12
for most users. Driven by the Internet Company’s paradigm such as Google, spatial data warehouses can be a solution to accelerate spatial data mining operations. Spatial Data Warehouse is so crucial to the enablement of an enterprise system that its effective usage will be the technical centerpiece of this work. On-Line Transaction Processing (OLTP) is the traditional model for enterprise data processing. OLTP databases focus on transactions involving the input, update, and retrieval of data. On-Line Analytical Processing (OLAP) data warehouses focus on queries that collate, summarize, and analyze its contents. Sample data mining techniques in OLAP process include applying statistics, artificial intelligence, and machine learning techniques to find previously unknown or undiscovered relationships in the data [23]. These methods provide different perspectives from analytical techniques, in which the goal is to prove or disprove an existing hypothesis. Spatial data mining is the process of discovering potentially useful patterns from large-scale geospatial datasets [24]. Discovering geospatial patterns from geospatial datasets is more difficult than recognizing the statistical patterns from traditional analysis object such as numeric and categorical data. The complexity of spatial data types, spatial relationships, and spatial autocorrelation is still an open problem to be solved. This work seeks to optimize large-scale geospatial data handling by addressing the remaining open research problems regarding spatial data warehouse and spatial data mining that will most likely impact a future, large-scale enterprise system implementation. 13
2.6 Previous work
In the paper “From Databases to Dataspaces: A New Abstraction for Information Management [25],” the author highlights the demands of accessing data from anywhere and in any format. The author proposes the concept of “data space” to provide universal API and interface for any kind of data. Their work has inspired the industry’s successive Big Data concepts.
Figure 2-7. Space Filling Curves [26]
14
In “Digital halftoning with space filling curves [27],” the author presents a method to access POIs with space filling curves. However, this method focuses on algorithms in computing the average intensities of regions and determining the aperiodic dot patterns. These algorithms provide a solution for handling dispersed dot error. Still, the author has only scratched the surface of the problem. For example, we cannot yet say definitively whether space filling curves are better for querying geospatial data, or under what conditions they are better; nor do we understand the impact of space filling curves on the process of data mining or decision making. In the paper “Efficient Spatial Query Processing for Big Data [28]”, the author defines a lightweight and scalable spatial index on Big Data. The result of this experiment shows that the index is both effective and efficient. The author defines some spatial Operators like containing, containedIn, intersects, and withinDistance. Such a system would provide a possible solution for improving the performance of spatial query on Big Data. In our framework, we use a similar design of spatial index which is built on Hbase to improve the performance of spatial query. In “MAD Skills: New Analysis Practices for Big Data [29],” the authors propose the Deep data analytic method, referred to as Magnetic, Agile, and Deep (MAD). Their work is based on the Greenplum parallel database [29]. They provide algorithms that perform both SQL and MapReduce analysis on Big Data. The paper gives a general direction for analysis of Big Data. However, Greenplum is built on PostgreSQL which is on the Relational DBMS model. NoSQL is now the mainstream solution for Big Data analytics. 15
Their solution is not based on a NoSQL platform, like our solution. The company Revolution Analytics [30] provides RHadoop project, which is the mainstream solution for Big Data analytics. The prototypes of current Big Data solutions like Google, Yahoo, etc. are based on a web system; they use Hadoop to store web pages and R to analyze customer behavior. Our research object, which is not based on web data but on geospatial data, is totally different from theirs. To use the R-Hadoop solution, a lot of customization and optimization work for geospatial data would need to be done. The RHbase package is also limited by the memory capability of the R client; therefore an improvement of spatial query for R is definitely needed for geospatial Big Data analytics to work. The RHbase package is the example of connectivity tools to access HBase through R. The RHbase is designed for the general type of Big Data not for the best performance of spatial query. Our work provides optimization for spatial query in R with HBase connectivity. In “SpatialHadoop: towards flexible and scalable spatial processing using mapreduce [31]”, the authors propose the framework combined MapReduce and spatial data. Their work provides a new language called Pigeon [32] for spatial query and data mining support on Hadoop. The SpatialHadoop solution requires that the user be a professional programmer, who ought to know the technical details of MapReduce. The SpatialHadoop solution defined their own data mining function but missed the well-established R analysis functions. Our work comparing with SpatialHadoop is different; we chose R, which is popular for statisticians, as the solution for data mining and as a visualization 16
tool. The mainstream data analytics solution, like Teradata Corporation Aster Analytics [33] which is a popular Big Data Analytics solution in the industry, chose a similar R-Hadoop platform to the one we have used, which has been proven to be robust in a real industry environment. Moreover, our in-memory spatial query is also designed and optimized for the R system. In conclusion, our work is designed as a solution with simplicity, robust and user-friendly for R analyst, business analysts and data scientists. There is a commercial solution, called the SAS Intelligence Platform architecture [34], to perform the analytical processing of large-scale data. It also supports concurrent access by multiple users. However, this commercial solution is based on multi-tier machines and lacks the flexibility to achieve high-performance as Hadoop solution. There are four tiers of machines in the SAS approach:
Data sources -- The storage layer that supports various types of data sources.
SAS servers -- The SAS server layer that performs the analytical processing. Each type of workload must map to a different SAS server.
The middle tier -- The web interface for an analytical job
The client tier -- The desktop client to generate analysis reports in a web browser environment.
17
Figure 2-8. Architecture of the SAS Intelligence Platform [34]
In the SAS approach, the machines perform specific roles at different layers. Each machine must be installed and configured by the administrator. For large-scale data analysis, the user should add enough machines to handle the task. In contrast, Hadoop allows the autonomous management of its nodes. Moreover, a Hadoop cluster will automatically handle the workload management task. Therefore, the storage support of SAS approach is not as scalable as the Hadoop cluster. The SAS solution provides several analytical methods such as the following procedures: KRIGE2D, SIM2D SPP, and VARIOGRAM for spatial analysis. User defined analytical packages are hard to add and implement in this commercial platform. Currently, there are no commercial systems to support spatial data mining over a Big Data infrastructure. Data providers are mandated to use existing geospatial data systems such as traditional DBMS. All current systems support sophisticated geospatial and 18
environmental constructs, such as Datums, Projections, Topology, and Grids by default. However, none of these systems support a state-of-the-art Big Data framework. The RHadoop package, developed by Revolution Analytics, will provide a means to undertake this kind of analytical task. First, RHadoop provides various data sources such as file system, HDFS, and traditional DBMS. Second, R is not originally designed for processing Big Data analytical work. The analytical task of R runs in memory. However, there is a limit to the amount of memory R can access. RHadoop provides a workaround that allows R to support parallelism with acceptable performance [35].
Figure 2-9. R and Hadoop [36]
The solution of Big Data Analytics has two advantages other than its analytical framework. First, they define a simple analytical model on Big Data, which does not fit 19
the current statistical analysis software. The model also maintains high performance and accuracy with large-scale data. Second, the solution supports user defined analytical packages to avoid expensive commercial solutions. Revolution Analytics provides several R packages for the Big Data analytics solution. For example, the RHipe package merges the environment of R and Hadoop, which allows users to divide data into subsets and form multiple divisions. The Rmr package is used to perform MapReduce [37] tasks between R and the Hadoop environment. The RProtoBuf package is used to load serialized data from other MapReduce [37] environments and communicate among different MapReduce jobs. R is a good choice for Big Data analytical tasks, and Hadoop is a popular solution for Big Data infrastructure. As a result, in this framework we have chosen R as the spatial data mining tool to analyze geospatial Big Data.
20
CHAPTER 3 THE FRAMEWORK 3.1 Framework Overview Hadoop has gained significant acceptance in the Big Data user community. However, none of the existing Big Data frameworks supports spatial data mining over large-scale geospatial data. In this project, we have developed a method which makes it possible to integrate spatial data mining technology with geospatial data. We outline the process of analyzing large-scale geospatial data using the R language. The resulting data is compatible with existing Big Data solutions and compliant software. The Hadoop software provides the data storage and access capabilities for the geospatial data used in this framework. Our primary goal is to support retrieving large-scale geospatial data. Environmental data is represented as both gridded data and vector data. The R language provides spatial data mining capabilities for geospatial data as well as commercial products. As such, it is a perfect solution for the large-scale geospatial data analytical environment. However, as mentioned above, the current R environment does not provide any analytical processing method for Big Data due to the limit of R. As a result, a vital performance issue occurred when using R to load large-scale raw data: adding large-scale data into R usually results in a system halt or low system performance.
21
Some of the current open research problems noted by the spatial data warehouse with spatial data mining tools that are especially relevant to an enterprise system for large-scale geospatial data are:
R analytical function runs in memory and has limits for data frame size;
No spatial index definition on HBase to improve query performance;
No standard mechanisms to access geospatial grid files stored on the Hadoop Distributed File System (HDFS);
No clear understanding of how to run a spatial data mining analytical package through R to access geospatial data stored on Hadoop;
No standard representations that support ETL process of homogeneous and heterogeneous geospatial data across different sources;
No broadly applicable approaches for dealing with spatial data mining packages.
Substantial work has been done by this work in the large-scale spatial data mining area that was detailed in the existing work section, as well as developed to show the methodology selected and technical approach of this work, that builds upon substantial progress in recent years regarding large-scale spatial data mining. Our aim is a new approach to the spatial data mining analytic method for large-scale geospatial data. As discussed above, Hadoop provides a data infrastructure to store and process geospatial data. Combining Hadoop with the existing R system provides a comprehensive capability to analyze and share environmental data without removing its associated variable information. The following objectives are set for our framework: 22
1. Provide a distributed method to run statistical computation among large-scale test data; 2. Improve large-scale data querying performance; 3. Extend R application to large-scale data; 4. Provide various data sources and methods for geospatial data analysis; 5. Support a wide variety of data formats from different data sources; 6. Easy to define, control, manipulate, and manage the asymmetrical data; sources through unique system interface; 7. Provide various analytical methods for a geospatial data analyst, with the ability to return a visualized analytical result; 8. Efficient data storage management; 9. Provide a reliable, available, and scalable service. Here is the design of our framework:
23
Figure 3-1. Geospatial Big Data analytical framework
In this project, we have developed a framework to implement spatial data mining over Big Data infrastructures. The next chapter describes the implementation of these capabilities using an analytical model. We then expand to three use cases and describe our overall environmental data analytical architecture implementation.
3.2 Bi-Directional Spatial ETL Server The purpose of this framework is to store and analyze large volumes of geospatial data. The first analytical step is to store the geospatial data in the spatial data warehouse. The
24
second analytical step is to run ETL (Extract, Transform and Load) over the raw data. The third analytical step is to perform analytical functions by spatial online analytical processing (SOLAP) environment. To design the spatial data warehouse, we built conceptual physical and logical models, effective spatial indexes, SOLAP, and other analytical functions. We used Hadoop as the platform for the data warehouse. The spatial query HPC works as one component of the ETL server. The basic function of spatial ETL is:
Use HBase as Spatial Data Warehouse
Extract geospatial data from different geospatial data sources, especially grid files; The data maybe in homogeneous or heterogeneous format;
Transform the geospatial data to a universal format which fits HBase;
Load the geospatial data into the HBase.
The traditional ETL concept is one-way, which extracts, transforms, and loads data from the source database to the data warehouse. We designed an infrastructure called “Bi-Directional ETL” which can perform ETL work between a public cloud and a private cloud. For instance, The Figure below shows a public cloud for a large corporation and a private cloud for a specific department within the corporation.
Public Cloud:
Corporation Cloud
Private Cloud:
Ocean Investigation Vessel Cloud
25
Figure 3-2. Bi-Directional ETL
The ELT process can be Bi-Directional.
Public to Private:
Vessel downloads (ETL) the navigation data for a
specific area to the private cloud
Private to Public:
Vessel uploads (ETL) the ocean investigation data to
the public cloud Both the Public and Private clouds can utilize the ETL server to process the geospatial data, which means the ETL process, works bi-directionally.
26
3.3 In-Memory Spatial Index and Spatial Query We choose HBase as the platform for the Spatial Data Warehouse. The spatial index is an efficient method for indexing geospatial data. To ensure the performance of spatial query we chose to build the spatial index in memory. Comparing random read versus sequential read is one way of assessing database query efficiency. Sequential read allows DBMS to access large amounts of data from adjacent locations on the physical disk. Random read allows DBMS to access data which can be accessed in any sequence. The read path of HBase is shown in the following figures.
Figure 3-3. HBase Read Path [38]
27
Figure 3-4. HBase Read Path Detail [39]
HBase has two kinds of caches: memory store and block cache. The performance of sequential read relies on cache hits ratio. When HBase runs a query it will firstly look at the block cache or memory store, then HFiles. The block cache or memory stores are all in the memory cache. During the HBase performance evaluation experiment [40] sequential read performed better than random read. We used in-memory spatial index for the spatial data warehouse (HBase). The spatial data warehouse is designed to store historical data and run read-only analytical functions. Most administrators of a data warehouse perform “regular” incremental load every one to
28
three months. Therefore, even if ETL process takes about two or three days which is still considered acceptable. Our design has no physical index stored on HBase. The geospatial raw data is stored as a grid file, and each grid file contains POIs in a specific area. This means the cluster ratio of the geospatial raw data is very high. The best query performance to access the HBase record is to use rowkey. The distribution of HBase data is determined by the rowkey generation function. The hashes as rowkey will ensure the best spread of the data. The sequential keys as rowkey will ensure the locality of the data. When HBase locates the first row of geospatial data, the following read has a high chance of being a sequential read. To get a sequential read in range scan mode, the rowkey of the data must be sequential. In our use case in chapter 5.1, we chose pre-sorting grid files as data sources. The generated sequential keys ensure that the range scan of HBase performs as sequential read. This experiment [40] shows that the performance of sequential read is better than random read in HBase. So we have chosen to build the spatial index in memory. The process of spatial query is depicted in the following figure.
29
Figure 3-5. In-Memory R-Tree & Spatial Query
The first step of spatial query is to filter the spatial data from HBase. Users should assign four coordinates such as upper left, upper right, lower left, and lower right as boundary conditions of the spatial query. HBase will retrieve this area of geospatial data as a data window. The second step is building spatial index in memory. The R-tree is built based on the data filter from HBase by the boundary conditions at the first step. The last step is 30
to run spatial query in memory. The spatial query, which can access both the data and spatial index, can be single or multiple in one process. The design methods of this framework are a combination of scale-out (horizontal) and scale-up (vertical). Hadoop, in our solution, works as a scale out storage system: it expands the scalability of geospatial data storage. Spatial query HPC works as a scale up computation system: it accelerates the performance of the ETL process. Designing computer architecture is an art; one must find the balance between cost, speed, and performance. The basic rule is to save money on the Hadoop data node, which is used as the storage system. On the other hand, users should invest more on spatial query HPC because the spatial query task is memory intensive and data intensive.
31
CHAPTER 4 FRAMEWORK IMPLEMENTATION 4.1 Framework Structure The Hadoop software provides the core data accessing capabilities for the data used in this work. Our primary goal is to support accessing large-scale geospatial data. Geospatial data is represented as both gridded and vector data. The R software product provides spatial data analytical capabilities for geospatial data as well as the solutions provided by SAS. As such, it is a perfect solution for large-scale geospatial data in an analytical environment. However, as mentioned above, it does not provide an analytical method for Big Data. As a result, a severe performance bottleneck problem occurs when the R software is used to load large-scale raw data. Adding that large-scale data back into R usually results into low system performance or even system halt. We chose various methods to access the data stored in HDFS. The first method is the spatial data mining package SPATSTAT [41], which can analyze raw geospatial data. The second method is the package RHbase, which provides a path to access data stored in the Hadoop database. The third method is R-MapReduce packages. The programmer can perform MapReduce tasks to HDFS objects directly through the packages Plyrmr or Rmr2.
32
Figure 4-1. Analytical environment structure
4.2 R with package Spatstat We chose Spatstat as the R package to perform statistical analysis, especially for spatial point patterns. Spatial point patterns can be stored as two-dimensional data formats. Spatstat packages use various analytical methods to discover useful data patterns in large-scale spatial data. Compared to common large data, geospatial data is hard to extract exact patterns from, because of the nature of specific geospatial data sources and its associated data structures. The package Spatstat uses real numbers (e.g. geospatial POIs with uncertainty information), categorical values (e.g. fishery production by species) and logical values (e.g. saline water/freshwater) to mark the point patterns [41]. These point patterns are capable of analyzing a huge number of points. The region of spatial data can be a complicated shape, such as an arbitrary polygon or a binary pixel image mask. The
33
package Spatstat is capable of analyzing three or more dimensional point pattern datasets. The following are the data formats which Spatstat can analyze:
Two dimensional space data regions;
Pixel images in two-dimensional space;
Spatial patterns of line segments in two-dimensional space;
Tessellations in two-dimensional space [41].
The package Spatstat supports a variety of statistical analysis methods such as model fitting, spatial data sampling, and statistical formulation. The package provides methods to perform Gibbs point process models, spatial inhomogeneity, and dependence analysis and Cluster process models. We can use maximum likelihood or approximations as frequentist statistical methods to run a point process fit model such as Poisson point process models, Gibbs point process models, and random cluster process models. Spatial randomness is a Poisson process, which is measured by the point’s intensity. Random and uniform distribution of points is called a homogeneous Poisson point process. Unevenly distributed intensity is called an inhomogeneous Poisson point process. Thus, the spatial model can be spatially homogeneous or inhomogeneous. The spatial trend is modeled as functions of the Cartesian coordinates and spatial covariates. Clustering or repulsion interaction can be included and marked as Gibbs models.
34
Here is the sample code. The following example covers loading geospatial raw data into R. The example provides a variety of ways to create a point pattern as the object of package Spatstat. Chapter 5.1 shows the detail of an analytics function on point patterns. > library(spatstat) # Here is a simple recipe to create a point pattern(ppp) object from raw data in R. #geo_data$"poi_cf:latitude" #geo_data$"poi_cf:longitude" #geo_data$"poi_cf:mean" #geo_data$"poi_cf:standarddeviation" > east north X X planar point pattern: 1000 points window: rectangle = [-130, -126] x [49, 51] units >
df=
data.frame(geo_data$"poi_cf:mean",
geo_data$"poi_cf:standarddeviation") > colnames(df) X library(rhbase) #initialize hbase connection > hb.init() attr(,"class") [1] "hb.client.connection" #create new table in hbase #> hb.new.table("geos_rhbase","poi_cf",opts=list(maxversions=5,x=list(maxv ersions=1L,compression='GZ',inmemory=TRUE))) [1] TRUE > hb.list.tables() $student_rhbase maxversions compression inmemory bloomfiltertype bloomfiltervecsize
37
info:
5
NONE
FALSE
NONE
0
bloomfilternbhashes blockcache timetolive info:
0
FALSE
-1
> hb.describe.table("geos_test") maxversions
compression
inmemory
bloomfiltertype
bloomfiltervecsize poi_cf:
3
NONE
FALSE
NONE
0
bloomfilternbhashes blockcache timetolive poi_cf:
0
TRUE
-1
#insert POI to hbase table > hb.insert("geos_test",list(list("row2",c("poi_cf:latitude","poi_cf:long itude","poi_cf:mean","poi_cf:standarddeviation"), list(49.0000,130.0000, 2698,26.98)))) [1] TRUE > hb.get('geos_test','row2') [[1]] [[1]][[1]] [1] "row2" [[1]][[2]] [1] "poi_cf:latitude" [3] "poi_cf:mean"
"poi_cf:longitude" "poi_cf:standarddeviation"
[[1]][[3]] [[1]][[3]][[1]] [1] 49 [[1]][[3]][[2]] [1] 130 [[1]][[3]][[3]] [1] 2698 [[1]][[3]][[4]] [1] 26.98 #ONLY If you want to clean the table > hb.delete.table('geos_test') [1] TRUE # scan from beginning >
iter
hb.scan("geos_test",startrow="row2",end="row2",colspec="poi_cf") > while( length(row 0){ print(row)} > iter$close() > geos_data