International Conference on Computing, Communication and Automation (ICCCA2017)
Investigation into the efficacy of geospatial big data visualization tools Rabindra K. Barik KIIT University, Bhubaneswar, India
[email protected]
Noopur Gupta IIIT-Bhubaneswar, Bhubaneswar, India
[email protected]
Rakesh K. Lenka
Syed Mohd Ali
IIIT-Bhubaneswar, Bhubaneswar, India
[email protected]
IIIT-Bhubaneswar, Bhubaneswar, India
[email protected]
Ananya Satpathy
Ankit Raj
IIIT-Bhubaneswar, Bhubaneswar, India
[email protected]
IIIT-Bhubaneswar, Bhubaneswar, India
[email protected]
1 Abstract— Big Data is a new emerging field as well as a big challenge. A vast amount of data is generated and stored daily and only 20% of data is structured. It is difficult to analyze and work on unstructured data. Geospatial data has already exceeded the storage capacity and is now considered as a big data problem. Visualizing data makes things a bit easier as data visualization helps in finding patterns and relationships in the data. There exist several visualization tools that are especially designed for geospatial data. In the present paper, it has been investigated and reviewed some of the popular tools for geospatial big data visualizations. Whitebox GAT, ArcMap, GeoMesa, HadoopViz and GRASS GIS are the tools which have been critically analyzed for geospatial big data visualization. Finally, it has been summarized with suitable recommendation as per the various parameters like code availability, desktop processing, online processing, mobile client processing, online course availability and various API compatibilities according to the requirements.
Keywords—Big Data; visualization; visualization; geospatial data
GIS;
interactive
I. INTRODUCTION In the present scenario, Data are rising exponentially due to digitalization of everything which starts from satellite images to social media posts likes comments, maintenance of medical records, police records, online shopping details, log files generated on every login of the users and data stored from many other sources for future reference. That leads the concept of big data which is one of the latest trends for future emerging technology. The demand of big data has been increasing in Government agencies, Finance, IT companies, Trade and Commerce. Big Data are characterized by 5V’s Volume, Velocity, Variety, Value and Veracity. Thus, it is very important to handle big data in efficient manner and also it requires special tools for proper analysis and visualization. Therefore, data visualizing has been an integral part of every discipline particularly in GIS (Geographical Information Systems) and Remote Sensing (RS). It involves creation and study of visual representation of data. The main goal of data visualization is to communicate information efficiently through statistical and information graphics. It is one of the important steps of data analysis or data science.
In today’s world of technology data visualization has become an active area of research, teaching and development [1][22]. The problem of these huge data sets is to storage, analyzing, processing, sharing and visualization. The purpose of analyzing and visualizing the big data is to draw interesting patterns from these datasets and help people understand and draw co-relations between different aspects of a dataset and represent it in a visual context [2][20]. In GIS and RS, it has been found that geospatial data are playing an important role in big data visualization. So the next section describes about the geospatial data in details. II. GEOSPATIAL DATA Till date big data is defined differently from academic, research, technological and industrial perspectives. Geospatial data describes objects and things with relation to longitudinal and latitudinal coordinates. This particular data are traditionally collected using photography, ground surveying and remote sensing. It can also collect geospatial data through laser scanning, mobile mapping and geotagged web contents [1]. The increasing amount of geospatial data sets hints upon the upcoming challenges of analyzing, managing, processing, storing, visualizing and verifying these data sets [2]. Geospatial data collection has been shifting from a data sparse to a data rich paradigm. These data sets are categorized into three parts as raster data, vector data and graph data. Raster Data are digital aerial photographs taken by cameras and satellites. It consists of matrix (grid) where information is stored in each cell. It is best suited for continuous data. Vector data are built using points, lines and polygons which are best for storing data in categorical or discrete boundaries. Graph data are mainly appear in the form of city maps containing roads and landmarks. Roads being represented by edges and landmarks by nodes. Typical graphs are drawn on a Cartesian grid whose scales are shown on two axes (x and y) [3]. Geospatial big data can be integrated and use in many areas i.e. wearable sensors, health and big data/ smart data analytics [17][4][22][23]. Thus, the visualization of geospatial data are the need of the hours for better analytics and it has been illustrated in the next section.
1
-91-
International Conference on Computing, Communication and Automation (ICCCA2017) III. GEOSPATIAL DATA VISUALIZATION Data visualization is an easy and quick way to represent complex datasets. Interesting patterns are drawn from these representations which are easy to analyze. From years together maps have been used to visualize vital information on the subject of geography. The way we had seen the world then was largely shaped by the images of physical and political maps. Nowadays, geospatial data are being analysed which include the numerous techniques that study their topological, geometric or geographic properties where most of the techniques use place and route algorithms. Complex issues arise in geospatial analysis which form the basis for current research regarding the available of geospatial data. In recent years, there has been an explosion in the amounts of geospatial data generated through mobile phones, space telescopes among other device. For example, space telescopes generate upto 150 GB of geospatial data weekly, medical devices produce spatial images (X-rays) at a rate of 50 PB every year while a NASA archive of satellite images produce 1 PB data which increases by 25 GB daily[6]. Both storing and searching a particular data from the whole lot is time taking and tedious. Actual challenge is not only to process this massive data, but to process it with high diversity which has been discussed in the next section. IV. CHALLENGES Old techniques of visualising geospatial datasets are no more a better way to handle the growing datasets. The visualization tools should be interactive with low latency. To reduce latency, it can manipulate to use the pre computed data, parallelize data processing and rendering and Use a predictive middleware. For faster execution of data, parallelization is required. The challenge over here is to break the entire big problem into many small problems so that all of them can run simultaneously [5]. These are the following challenges which are facing by geospatial big data analysis. • Rapid advancement in computing and networking technologies • Addressing temporal dimension through analyzing dynamic data • Various noises like Visual Noise due to high rate of image change and resulting in information loss
transportation and demographics. However, combining such varied data is a challenge because of the difference in coverage, quality, compatibility, and update frequency. Thus, developing technologies for handling such inconsistencies is critical. It has been reviewed some of the most popular visualization tools [6][7][8]. There are few number of geospatial big data visualization tools have been investigated and been summarized. A. Whitebox GAT. The Whitebox Geospatial Analysis tools (Whitebox GAT) Project began in 2009 and later it was stated as the replacement for the Terrain Analysis System (TAS). It is mainly intended to provide a platform for advanced geospatial analysis. It is open source GIS and cross-platform software, targeting all major operating systems having JRE 8.0 or above. The user interface consists of tool bar, side panel and a menu for accessing and manipulating tools and different data layers. It has a central area which allows the visualization of various data layers. The Whitebox tools list has a tree-view structure under the tools tab present in the side panel. The software is unique to the concept of open access software as it has View Code Button which illustrates the above mentioned concept [9]. Whitebox GAT mostly works with geospatial data that are structured using the vector or raster data models. Whitebox raster format has combinee with the two different files. The raster data (*.tas) is a row-major, flat binary file containing either 32-bit or 64-bit floating point data, 32-bit integer data, or unsigned byte values. This data file must be accompanied by an ASCII header file (*.dep), which contains information about the grid structure, geographic properties and other salient characteristics of raster data. Figure 1 has been shown the hydrological routes of Guelph in Whitebox GAT environments.
V. OBJECTIVE OF THE CURRENT STUDY To analyze various big data visualization tools used for geospatial data so that various details about the tools provided are useful for wide range of users. Any user can choose a particular tool which is best appropriate for the size and type of dataset which has been taken. The present paper gives a brief idea of some efficient visualization tools. VI. VISUALIZATION TOOLS The collection and publication of digital data about places, people and phenomena has drastically increased in the recent years. The increased availability of data offers great opportunities for improving understanding of the world by integrating previously distinct areas such as weather,
-92-
Fig. 1. Hydrological routes of Guelph.
International Conference on Computing, Communication and Automation (ICCCA2017)
B. ArcMap. It is one of the desktop applications of ArcGIS and is used to create maps, perform geospatial analysis and manage geographic data and cast results. ArcMap mainly represents geographic information a collection of layers and other elements in a map. Common map elements include the data frame containing map layers, scale bar, north arrow, title, descriptive text, a symbol legend. The data that we save using ArcMap is saved as a file with .mxd extension. ArcMap has mainly data and layout view for the user applications. Typical tasks performed by ArcMap are compiling and editing geospatial datasets. It also uses geo-processing to automate work and performs various analysis, organize and manage the geospatial databases and ArcGIS documents. It can also publish map documents as map services using ArcGIS for server and document the geographic location. Moreover sharing the created data models and database is very easy in the ArcMap. ArcMap runs only on Windows 8 or higher versions operating systems and minimum of 4GB of disc space is required. It also requires Python 2.7.12 and Numerical Python 1.9.3 or above. Microsoft .Net Framework 4.5 or above is required to be installed prior to installing ArcMap [10]. Figure 2 has illustrated the hydrological routes of western region of India in ArcMap Environment.
based data storage technologies, including Apache Accumulo, Apache HBase, and Google Cloud Big table. GeoMesa environment takes advantage of Apache Spark to do large-scale analytics of stored and streaming data. It even streams data using the Apache Kafka message broker. For smooth running of GeoMesa application, it requires Java JDK 8 along with Apache Maven 3.2.2 or better. The GeoMesa API (geomesa-native-api in the source distribution) is necessary for developers who just want to geo index their data [11]. Figure 3 shows the interface about catalogue of human societal-scale behavior.
Fig. 3. Catalogue of human societal-scale behavior.
Fig. 2. Hydrological routes of western region of India.
C. GeoMesa. It is the Apache licensed open source software which enables large scale geospatial analytics on cloud and distributed computing systems. It renders help in analyzing the huge geospatial temporal datasets. It provides geospatial terminal data persistence on top of the Accumulo, HBase and Cassandra distributed column oriented datasets. It has a geographical information server i.e. GeoServer which it facilitates integration with a various range of existing mapping clients. GeoMesa has capable of storing gigabytes to petabytes of geospatial data. It serves up to tens of millions of point in seconds. It ingests data faster than 10,000 records per second per node. It also supports Apache Spark big data analytics. It drives a map through GeoServer or other OGC (Open Geospatial Consortium) clients. In terms of architectural overview, GeoMesa supports scalable, cloud-
D. HadoopViz. It is an extensible MapReduce based framework used for visualizing of big geospatial data. It has various advantages over some of the existing systems. It has been used a smoothing technique which allows it to produce more image types that require fusing nearby records together. HadoopViz employs a three-phase approach, partition-plot-merge, which enables autopartitioning. Moreover, it proposes novel visualization abstraction which allows same efficient core algorithms to be used with dozens of image types (scatter plot, road networks, brain neurons). Equipping a system without HadoopViz, one needs to implement an algorithm for visualizing satellite data, one more algorithm for visualizing tweets, another one for heat map visualization, and so on. There are different kinds of functions in HadoopViz. It has Create-canvas, Plot, Merge, Smooth and Write function. Create-canvas initializes the empty drawing on which records are plotted. Plot function does the actual drawing of input records. Merge brings in the various canvases to form the final picture. Smooth function has been used to combine nearby records together. Write function generates final picture out of the canvas [12]. HadoopViz has experimented real datasets which include world road networks (165 million poly lines), NASA satellite data (14 billion points). The NASA dataset through Hadoop Viz can be visualized only in 90 seconds [13]. Figure 4 illustrates the temperature Heat Map of 14 billion points of NASA Satellite using Hadoop Viz.
-93-
International Conference on Computing, Communication and Automation (ICCCA2017) reduces network overhead. It computes new attributes from combining original attributes. It is open source hence codes can be easily modified and used as per the requirements. It has no valid tutorial videos available for learning this tool. GRASS GIS works on Windows, Linux and Mac OS. It is managed by SQL based DBMS. It manipulates only raster and vector data. Hadoop Viz is a map reduce framework and open source, it is faster than the other specified tools. It can process large amount of data simultaneously. It can be deployed on large clusters. Hadoop Cluster has to be setup before using Viz. User has to have a good knowledge about the usage of Hadoop[16]. Fig. 4. Temperature Heat Map of 14 billion points of NASA Satellite using Hadoop Viz.
E. GRASS. It is a free and open source software tool used for geospatial data management and analysis, image processing, map generation, geospatial modeling and visualization. GRASS is recently used in academic and commercial as well as government and environmental consulting companies in the world. General system requirement for running GRASS is required C-compiler, GNU make 3.81 or above, zlib, PROJ4, GDAL/OGR and Python 2.6 or above [14][15]. Figure 5 shows the Dune migration at Jockey’s Ridge State Park, NC using GRASS GIS.
B. Comparative Analysis. The various geospatial data visualization tools are compared based on the open source, API, interactive visualization, desktop clients, online clients, MOOCS and mobile applications parameters. The investigated geospatial visualization tools have seen that some of the tools are open source and one of them is closed one. Some tools have some API, whether object classes, sub- routines of the tools to facilitate the usage of the tools by users are available or not. In terms of Interactive visualization, if the result of visualization is interactive with the users or not. Integrations of tools with popular sources like Hadoop, Apache, Accumulo etc. Some tools are working in desktop based or online based, has to be investigated. Some of the tools have been supported by MOOCS, whether online tutorials and videos of how to use the particular visualization tool is available. Some tools have also mobile applications based which has made interactive for users. Table I has shown the comparison of the above discussed tools on these parameters. VIII. CONCLUSION
Fig. 5. Dune migration at Jockey’s Ridge State Park, NC using GRASS GIS.
VII. RESULT AND DISCUSSIONS A. Limits/Demerits. Whitebox GAT works in a multiple number of operating systems like Mac, Windows, and Linux. It has the extension known as GoSpatial which has been developed using Go programming language and is a command line interface program for analyzing and representing geospatial data. GoSpatial can run independent of any software and it is supposed to provide additional analytical support for Whitebox GAT tool. The related files can be manipulated and displayed based on classification, scan angle on GPS time along with elevation and intensity. Due to changes in the Google code practices, Whitebox Gat had become read-only and no longer accepts new code. It mainly works on raster dataset. The visualization of ArcMap is interactive and easy to use. It is not available free of cost. GeoMesa allows relational projections on query results. It has performed the subset to specify columns i.e.
The use of Big Data in this world of growing data is a new data management and analytic software that has proven to create history in the field of data analytics. It has addressed the volume and diversity of data and has decreased the access time of information. In this paper, it has discussed why big data visualization is important, emphasizing mainly on geospatial big data visualization. Further, it has identified the various challenges faced while visualizing the huge datasets and analyzing them. To overcome these challenges some newer techniques have been developed some of which it has given the overall information through this paper. HadoopViz, GeoMesa, Grass GIS, Whitebox Gat and ArcMap are some of the geospatial data visualization tools about which it has discussed briefly above. The merits and demerits of every tool is stated so that it becomes convenient for the user to choose the right tool for visualizing a particular geospatial data type i.e whether the data is a raster data or a shape file or in a .csv format. All the tools discussed here are quite efficient to work with and gives an accurate visualization of data. No tool can be declared the best as it depends on the requirements and the volume and type of data taken. This paper can help businesses choose their tool of interest.
-94-
International Conference on Computing, Communication and Automation (ICCCA2017)
TABLE I. COMPARISONS OF GEOSPATIAL BIG DATA VISUALIZATION TOOLS Parameters
Open Integration Source with popular sources
Interactive Visualization
Desktop Online Mobile MOOCS API Client Client Application
Free of Cost
GeoMesa
Y
Y
Y
Y
N
N
N
Y
Y
ArcMap
N
N
Y
Y
Y
Y
Y
N
N
Whitebox Gat HadoopViz
Y
Y
Y
Y
N
N
N
N
Y
Y
Y
Y
Y
N
N
Y
N
Y
Y
N
Y
Y
N
N
N
Y
Y
GRASS GIS
REFERENCES [1]
[2] [3] [4] [5] [6]
[7]
[8] [9] [10] [11]
[12]
[13] [14] [15]
[16]
[17] [18]
[19]
Li, S., Dragicevic, S., Castro, F.A., Sester, M., Winter, S., Coltekin, A., Pettit, C., Jiang, B., Haworth, J., Stein, A. and Cheng, T., “Geospatial big data handling theory and methods: A review and research challenges,” ISPRS Journal of Photogrammetry and Remote Sensing, Vol. 115, pp.119-133, 2016. van Manen N, Scholten HJ, van de Velde R, “Geospatial technology and the role of location in science,” Springer Netherlands; 2009. Sui, D., “Opportunities and impediments for open GIS,” Transactions in GIS, Vol. 18, No.1, pp.1-24, 2014. Chang, K.T., “Introduction to geographic information systems,” Boston: McGraw-Hill Higher Education, pp.117-122, 2006. Ali, S.M., N Gupta, et al, “Big Data Visualization: Tools and Challenges,” 2nd International Conference on Contemporary Computing and Informatics(IC3I), 2016. Cartwright, W., Crampton, J., Gartner, G., Miller, S., Mitchell, K., Siekierska, E. and Wood, J., “Geospatial information visualization user interface issues,” Cartography and Geographic Information Science, Vol. 28, No. 1, pp.45-60, 2001. Lenka, R.K., Barik, R.K., Gupta, N., Ali, S.M., Rath, A. and Dubey, H., “Comparative Analysis of SpatialHadoop and GeoSpark for Geospatial Big Data Analytics,” arXiv preprint arXiv:1612.07433, 2016. Rink, K., Bilke, L. and Kolditz, O., “Visualisation strategies for environmental modelling data,” Environmental Earth Sciences, Vol. 72, No. 10, pp.3857-3868, 2014. Lindsay, J.B., “Whitebox GAT: A case study in geomorphometric analysis,” Computers and Geosciences, Vol. 95, pp.75-84, 2016. Shaner, J. and Wrightsell, J., “Editing in arcMap,” Esri, 2000. Hughes, J.N., Annex, A., Eichelberger, C.N., Fox, A., Hulbert, A. and Ronquest, M., “Geomesa: a distributed architecture for spatio temporal fusion,” In SPIE Defense+ Security, pp. 94730F-94730F), 2015. Eldawy, A., Mokbel, M.F. and Jonathan, C., “HadoopViz: A MapReduce framework for extensible visualization of big spatial data,” In Data Engineering (ICDE) 2016 IEEE 32nd International Conference on , pp. 601-612, 2016. Taylor, R.C., “An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics,” BMC bioinformatics, Vol. 11, No. 12, p.S12010, 2010. Neteler, M. and Mitasova, H., “ Open source GIS: a GRASS GIS Approach, ” Springer Science and Business Media, , Vol. 689, 2013. Clerici, A. and Perego, S., “A Set of GRASS GIS-Based Shell Scripts for the Calculation and Graphical Display of the Main Morphometric Parameters of a River Channel,” International Journal of Geosciences, Vol. 7, No. 02, pp.135, 2016. Bivand, R, “Geocomputation and open source software: components and software stacks,” NHH Dept. of Economics Discussion Paper, Vol. 23, 2011. Barik, R. K., A. B. Samaddar, and R. D. Gupta, ”Investigations into the Efficacy of Open Source GIS Software,”, Map World Forum, 2009. Ma, Yan, Haiping Wu, Lizhe Wang, Bormin Huang, Rajiv Ranjan, Albert Zomaya, and Wei Jie, “Remote sensing big data computing: challenges and opportunities,” Future Generation Computer Systems, Vol. 51, pp. 47-60, 2015. Dasgupta, Arup, “Big data: The future is in analytics,” Geospatial World, 2013.
[20] Evangelidis,K. , Ntouros,K., Makridis,S., and Papatheodorou, C., “Geospatial services in the cloud,” Computers and Geosciences, Vol. 63, pp. 116–122, 2014. [21] Lee, J.G. and Kang, M., ”Geospatial big data: challenges and opportunities,” Big Data Research, Vol. 2, No. 2, pp.74-81, 2015. [22] Barik, Rabindra K., Dubey, Harishchandra, Samaddar. A. B., Gupta,R.D., and Ray, P.K., “FogGIS: Fog Computing for Geospatial Big Data Analytics,” 3rd IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics, 2016. [23] H. Dubey, J. Yang, N. Constant, A. M. Amiri, Q. Yang, and K. Mankodiya, “Fog data: enhancing telehealth big data through fog computing,” in Proceedings of the ASE Big Data & SocialInformatics2015. ACM, pp. 14, 2015.
-95-