gistam 2018

GISTAM 2018 4th International Conference on Geographical Information Systems Theory, Applications and Management

PROCEEDINGS Funchal, Madeira, Portugal 17-19 March, 2018

EDITORS Cédric Grueau Robert Laurini Lemonia Ragia

http://www.gistam.org

SPONSORED BY

PAPERS AVAILABLE AT

DigitAl LIBRARY

GISTAM 2018 Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management

Funchal, Madeira - Portugal March 17 - 19, 2018 Sponsored by INSTICC - Institute for Systems and Technologies of Information, Control and Communication Technically Co-sponsored by IEEE GRSS - IEEE Geoscience and Remote Sensing Society IEEE Portugal Section - Institute of Electrical and Electronics Engineers In Cooperation with ACM SIGSPATIAL - ACM Special Interest Group on Spatial Information SIFET - Società Italiana Di Fotogrammetria e Topografia EuroSDR EUROGEO - European Association of Geographers ASPRS - The Imaging and Geospatial Information Society GEO - Group on Earth Observations GITA - Geospatial Information and Technology Association ISPRS - International Society for Photogrammetry and Remote Sensing CIG - Canadian Institute of Geomatics EAGE - European Association of Geoscientists and Engineers AIGeo - Associazione Italiana di Geografia fisica e Geomorfologia

Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Edited by Cédric Grueau, Robert Laurini and Lemonia Ragia

Printed in Portugal ISBN: 978-989-758-294-3 Depósito Legal: 437578/18

http://www.gistam.org [email protected]

B RIEF C ONTENTS I NVITED S PEAKERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV O RGANIZING C OMMITTEES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V P ROGRAM C OMMITTEE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI AUXILIARY R EVIEWERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII S ELECTED PAPERS B OOK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII F OREWORD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX C ONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI

III

I NVITED S PEAKERS Uwe Stilla Technische Universitaet Muenchen Germany Kostas E. Katsampalos Aristotle University of Thessaloniki Greece Eugene Fiume University of Toronto Canada

IV

O RGANIZING C OMMITTEES C ONFERENCE C HAIR Lemonia Ragia, Technical University of Crete, Greece

P ROGRAM C HAIR Cédric Grueau, Polytechnic Institute of Setúbal/IPS, Portugal

H ONORARY C HAIR Robert Laurini, Knowledge Systems Institute, United States

S ECRETARIAT Vitor Pedrosa, INSTICC, Portugal

G RAPHICS P RODUCTION AND W EBDESIGNER André Poeira, INSTICC, Portugal

W EBMASTER João Francisco, INSTICC, Portugal Carolina Ribeiro, INSTICC, Portugal

V

P ROGRAM C OMMITTEE Thierry Badard, Laval University, Canada Jan Blachowski, Wroclaw University of Science and Technology, Poland Thomas Blaschke, University of Salzburg, Austria Alexander Brenning, Friedrich Schiller University, Germany Jocelyn Chanussot, Grenoble Institute of Technology Institut Polytechnique de Grenoble, France Filiberto Chiabrando, Politecnico di TorinoDIATI, Italy Keith Clarke, University of California, Santa Barbara, United States Antonio Corral, University of Almeria, Spain

Technical

Suzana Dragicevic, Simon Fraser University, Canada Qingyun Du, Wuhan University, China Arianna D’Ulizia, IRPPS - CNR, Italy Ana Paula Falcão, Instituto Superior Técnico, Portugal Ana Fonseca, Laboratório Nacional de Engenharia Civil (LNEC), Portugal Cidália Maria Parreira da Costa Fonte, Coimbra University, Portugal Jinzhu Gao, University of the Pacific, United States Lianru Gao, Chinese Academy of Sciences, China Georg Gartner, Vienna University of Technology, Austria Luis Gomez-Chova, Universitat de València, Spain Gil Rito Gonçalves, Coimbra University, Portugal José Alberto Álvares Pereira Gonçalves, Oporto University, Portugal

VI

Institute

Cambridge,

Stephen Hirtle, University of Pittsburgh, United States Wen-Chen Hu, University of North Dakota, United States Haosheng Huang, University of Zurich, Switzerland Simon Jirka, 52° North, Germany Wolfgang Kainz, University of Vienna, Austria Harry D. Kambezidis, National Observatory of Athens, Greece Andreas Koch, University of Salzburg, Austria

Paolo Dabove, Politecnico di Torino, Italy

Cédric Grueau, Polytechnic Setúbal/IPS, Portugal

University,

Eric Kergosien, Univ. Lille 3, France

Joep Crompvoets, KU Leuven, Belgium Anastasios Doulamis, National University of Athens, Greece

Hans W. Guesgen, Massey New Zealand Bob Haining, University of United Kingdom

of

Jacek Kozak, Jagiellonian University, Poland Artur Krawczyk, AGH University of Science and Technology, Poland Roberto Lattuada, myHealthbox, Italy Robert Laurini, Knowledge Systems Institute, United States Songnian Li, Ryerson University, Canada Christophe Lienert, Canton of Aargau, Dept. Construction, Traffic & Environment, Switzerland Vladimir V. Lukin, Kharkov Aviation Institute, Ukraine Paulo Marques, Instituto de Telecomunicações / ISEL, Portugal Gavin McArdle, University College Dublin, Ireland Fernando Nardi, Università per Stranieri di Perugia, Italy Anand Nayyar, KCL Institute of Management and Technology, India Volker Paelke, Bremen University of Applied Sciences, Germany Dimos N. Pantazis, TEI of Athens, Greece Massimiliano Pepe, Università degli Studi di Napoli "Parthenope", Italy

Marco Piras, Politecnico di Torino, Italy Alenka Poplin, Iowa State United States Dimitris Potoglou, Cardiff United Kingdom

University, University,

Lemonia Ragia, Technical University of Crete, Greece Jorge Gustavo Rocha, University of Minho, Portugal Mathieu Roche, Cirad, France Markus Schneider, University of United States Sylvie Servigne, INSA Lyon, France

Florida,

Yosio Edemir Shimabukuro, Instituto Nacional de Pesquisas Espaciais, Brazil Francesco Soldovieri, Consiglio Nazionale delle Ricerche, Italy Luís de Sousa, EAWAG - Swiss Federal Institute of Aquatic Science and Technology, Switzerland Uwe Stilla, Technische Universitaet Muenchen, Germany

Jantien Stoter, Delft University of Technology, Netherlands Jose Pablo Suárez, Universidad de Las Palmas de Gran Canaria, Spain Nicholas Tate, United Kingdom

University

of

Leicester,

José António Tenedório, Universidade NOVA de Lisboa / CICS.NOVA, Portugal Ana Cláudia Moreira Teodoro, Oporto University, Portugal Fabio Giulio Tonolo, ITHACA - Information Technology for Humanitarian Assistance, Cooperation and Action, Politecnico di Torino, Italy Theodoros Tzouramanis, Aegean, Greece

University of the

Michael Vassilakopoulos, University of Thessaly, Greece Lei Wang, Louisiana State University, United States May Yuan, University of of Texas at Dallas, United States

AUXILIARY R EVIEWERS Ludwig Hoegner, Technical University Munich, Germany

Sven Kralisch, University of Jena, Germany Yusheng Xu, TUM, Germany

S ELECTED PAPERS B OOK A number of selected papers presented at GISTAM 2018 will be published by Springer in a CCIS Series book. This selection will be done by the Conference Chair and Program Chair, among the papers actually presented at the conference, based on a rigorous review by the GISTAM 2018 Program Committee members.

VII

F OREWORD This book contains the proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018). This conference is sponsored by the Institute for Systems and Technologies of Information, Control and Communication (INSTICC) and held in cooperation with the International Society for Photogrammetry and Remote Sensing (ISPRS), the Canadian Institute of Geomatics (CIG), the ACM Special Interest Group on Spatial Information (ACM SIGSPATIAL), the European Association of Geographers (EUROGEO), the Associazione Italiana di Geografia fisica e Geomorfologia (AIGeo), the Group on Earth Observations (GEO), The Imaging and Geospatial Information Society (ASPRS), the Società Italiana Di Fotogrammetria e Topografia (SIFET), the EuroSDR, the Geospatial Information and Technology Association (GITA) and the European Association of Geoscientists and Engineers (EAGE). The conference is also Technically co-sponsored by IEEE Geoscience and Remote Sensing Society (IEEE GRSS). The purpose of the International Conference on Geographical Information Systems Theory, Applications and Management is to create a meeting point of researchers and practitioners that address new challenges in geo-spatial data sensing, observation, representation, processing, visualization, sharing and managing, in all aspects concerning both information communication and technologies (ICT) as well as management information systems and knowledge-based systems. GISTAM 2018 received 58 paper submissions from 31 countries, 28% were accepted as full papers which shows the intention of preserving a high-quality forum for the next editions of this conference. The conference program includes three invited talks delivered by internationally distinguished speakers, namely: Uwe Stilla (Technische Universitaet Muenchen, Germany), Kostas E. Katsambalos, (Aristotle University of Thessaloniki, Greece) and Eugene Fiume (University of Toronto, Canada). The GISTAM program also includes a Doctoral Consortium on Geographical Information Systems Theory, Applications and Management chaired by Prof. Armanda Rodrigues. A short list of authors will be selected and invited to submit extended revised versions of their papers for a book that will be published by Springer, in the CCIS Series, with the best papers of GISTAM 2018. To recognize the best submissions and the best student contributions, several awards based on the combined marks of paper reviewing, as assessed by the Program Committee, and the quality of the presentation at the conference venue, are conferred at the closing session of the conference. We would like to express our thanks to all participants. First of all, to the authors, whose quality work is the essence of this conference. Next, we thank all the members of the program committee and the auxiliary reviewers for their diligence and expert reviewing. We must deeply thank the invited speakers for their excellent contribution in sharing their knowledge and vision. Finally, special thanks to all the members of the INSTICC team whose collaboration was fundamental for the success of this conference. We wish you all an inspiring conference and an unforgettable stay at Funchal, Madeira, Portugal. We look forward to having additional research results presented at the next edition of GISTAM, details of which are available at http://www.gistam.org/. Cédric Grueau Polytechnic Institute of Setúbal/IPS, Portugal Robert Laurini Knowledge Systems Institute, United States Lemonia Ragia Technical University of Crete, Greece IX

C ONTENTS INVITED SPEAKERS K EYNOTE S PEAKERS Photogrammetry Meets BIM Uwe Stilla

5

Positioning Techniques and Accuracy Requirements in Geo-sciences and Engineering Kostas E. Katsampalos

7

More than Pretty Pictures - Synergies Between Computer Graphics and Geographic Information Systems Eugene Fiume

9

PAPERS F ULL PAPERS Hotspot Analysis of the Spatial and Temporal Distribution of Fires Chien-Yuan Chen and Qi-Hua Yang

15

A Newly Emerging Ethical Problem in PGIS - Ubiquitous Atoque Absconditus and Casual Offenders for Pleasure Koshiro Susuki

22

Enhanced Address Search with Spelling Variants Konstantin Clemens

28

Automatic Tree Annotation in LiDAR Data Ananya Gupta, Jonathan Byrne, David Moloney, Simon Watson and Hujun Yin

36

Improvements to DEM Merging with r.mblend Luís Moreira de Sousa and João Paulo Leitão

42

Mapping and Monitoring Airports with Sentinel 1 and 2 Data - Urban Geospatial Mapping for the SCRAMJET Business Networking Tool Nuno Duro Santos, Gil Gonçalves and Pedro Coutinho

50

Outdoors Mobile Augmented Reality Application Visualizing 3D Reconstructed Historical Monuments Chris Panou, Lemonia Ragia, Despoina Dimelli and Katerina Mania

59

An Interactive Story Map for the Methana Volcanic Peninsula Varvara Antoniou, Paraskevi Nomikou, Pavlina Bardouli, Danai Lampridou, Theodora Ioannou, Ilias Kalisperakis, Christos Stentoumis, Malcolm Whitworth, Mel Krokos and Lemonia Ragia

68

GIS and Geovisualization Technologies Applied to Rainfall Spatial Patterns over the Iberian Peninsula using the Global Climate Monitor Web Viewer Juan Antonio Alfonso Gutiérrez, Mónica Aguilar-Alba and Juan Mariano Camarillo Naranjo

79

Land-use Classification for High-resolution Remote Sensing Image using Collaborative Representation with a Locally Adaptive Dictionary Mingxue Zheng and Huayi Wu

88 XI

Optimal Estimation of Census Block Group Clusters to Improve the Computational Efficiency of Drive Time Calculations Damon Gwinn, Jordan Helmick, Natasha Kholgade Banerjee and Sean Banerjee

96

ResPred: A Privacy Preserving Location Prediction System Ensuring Location-based Service Utility Arielle Moro and Benoît Garbinato

107

Elcano: A Geospatial Big Data Processing System based on SparkSQL Jonathan Engélinus and Thierry Badard

119

VOLA: A Compact Volumetric Format for 3D Mapping and Embedded Systems Jonathan Byrne, Léonie Buckley, Sam Caulfield and David Moloney

129

Minimum Collection Period for Viable Population Estimation from Social Media Samuel Lee Toepke

138

Spatiotemporal Data-Cube Retrieval and Processing with xWCPS George Kakaletris, Panagiota Koltsida, Manos Kouvarakis and Konstantinos Apostolopoulos

148

S HORT PAPERS Improving Urban Simulation Accuracy through Analysis of Control Factors: A Case Study in the City Belt along the Yellow River in Ningxia, China 159 Rongfang Lyu, Jianming Zhang, Mengqun Xu and Jijun Li Creating a Likelihood and Consequence Model to Analyse Rising Main Bursts Robert Spivey and Sivaraj Valappil

167

Identifying the Impact of Human Made Transformations of Historical Watercourses and Flood Risk Thomas Moran, Sivaraj Valappil and David Harding

173

Evaluation of AW3D30 Elevation Accuracy in China Fan Mo, Junfeng Xie and Yuxuan Liu

180

Comparison of Landsat and ASTER in Land Cover Change Detection within Granite Quarries R. S. Moeletsi and S. G. Tesfamichael

187

Quantifying Land Cover Changes Caused by Granite Quarries from 1973-2015 using Landsat Data Refilwe Moeletsi and Solomon Tesfamichael

196

Standardized Big Data Processing in Hybrid Clouds Ingo Simonis

205

An Automated Approach to Mining and Visual Analytics of Spatiotemporal Context from Online Media Articles 211 Bolelang Sibolla, Laing Lourens, Retief Lubbe and Mpheng Magome An Example of Multitemporal Photogrammetric Documentation and Spatial Analysis in Process Revitalisation and Urban Planning 223 Agnieszka Turek, Adam Salach, Jakub Markiewicz, Alina Maciejewska and Dorota Zawieska Positional Accuracy Assessment of the VGI Data from OpenStreetMap - Case Study: Federal 231 University of Bahia Campus in Brazil Elias Nasr Naim Elias, Vivian de Oliveira Fernandes and Mauro José Alixandrini Junior XII

Representing GeoData for Tourism with Schema.org Oleksandra Panasiuk, Zaenal Akbar, Thibault Gerrier and Dieter Fensel

239

Logic Modeling for NSDI Implementation Plan - A Case Study in Indonesia Tandang Yuliadi Dwi Putra and Ryosuke Shibasaki

247

Delimitation of Urban Areas with Use of the Plataform Google Engine Explorer Sherlyê Francisco de Carvalho, Jhonatha Fiorio Conceição Carla Bernadete Madureira Cruz and Elizabeth Maria Feitosa da Rocha de Souza

Guimarães, 255

An Alternative Raster Display Model Titusz Bugya and Gábor Farkas

262

Geospatial Data Sharing Barriers Across Organizations and the Possible Solution for Ethiopia Habtamu Sewnet Gelagay

269

A Concept for Fast Indoor Mapping and Positioning in Post-Disaster Scenarios Eduard Angelats and José A. Navarro

274

Location Intelligence for Augmented Smart Cities Integrating Sensor Web and Spatial Data Infrastructure (SmaCiSENS) 282 Devanjan Bhattacharya and Marco Painho Towards Rich Sensor Data Representation - Functional Data Analysis Framework for Opportunistic Mobile Monitoring 290 Ahmad Mustapha, Karine Zeitouni and Yehia Taher Evaluation of Two Solar Radiation Algorithms on 3D City Models for Calculating Photovoltaic Potential 296 Syed Monjur Murshed, Alexander Simons, Amy Lindsay, Solène Picard and Céline De Pin Investigating the Use of Primes in Hashing for Volumetric Data Léonie Buckley, Jonathan Byrne and David Moloney

304

Synthetic Images Simulation (SImS): A Tool in Development Carlos Alberto Stelle, Francisco Javier Ariza-López and Manuel Antonio Ureña-Cámara

313

Unmanned Aerial Survey for Modelling Glacier Topography in Antarctica: First Results Dmitrii Bliakharskii and Igor Florinsky

319

AUTHOR INDEX

327

XIII

I NVITED S PEAKERS

K EYNOTE S PEAKERS

Photogrammetry Meets BIM Uwe Stilla Photogrammetry and Remote Sensing, Technische Universitaet Muenchen, Muenchen, Germany

Abstract:

Building Information Modeling (BIM) is increasingly gaining the attention of researchers in architecture, engineering and construction (AEC) as well as in geo-information science (GIS). While Photogrammetry and Remote Sensing have established a strong link to GIS over decades, the field of mapping and reconstruction of spatial building structures in context of BIM is still young and challenging. The challenge becomes clear when comparing the modeling of buildings in GIS (e.g. CityGML) and BIM (e.g. IFC) which have different starting situations and objectives. Introducing digital methods for the build environment in civil engineering is required for reducing the backlog of digitization in industry and is forced by many countries. This presentation addresses the challenges and possibilities of Photogrammetry supporting the monitoring of existing buildings and buildings under construction using BIM. Three different ways for the acquisition of photogrammetric point clouds of construction sites and the method for automatic progress monitoring using a 4D-BIM are shown and discussed.

BRIEF BIOGRAPHY Prof. Stilla is the Head of the Department of Photogrammetry and Remote Sensing and cource direktor of the Bachelor’s and Master’s Programs “Geodesy and Geoinformation” at Technical University of Munich. His research interests include analysis of images and point clouds in the field of photogrammetry and remote sensing. The publication list of Uwe Stilla shows more than 400 entries. Prof. Stilla is the Chair of the ISPRS Working Group II/III “Pattern Analysis in Remote Sensing”, is a Principal Investigator of the International Graduate School of Science and Engineering (IGSSE), a member of the Scientific Board of German Commission of Geodesy (DGK), Vice chairman of Commission for Geodesy and Glaciology (KEG) of the Bavarian Academy of Science and Humanities, Munich, Germany and President of the German Society of Photogrammetry, Remote Sensing and Geoinformation (DGPF).

5 Stilla, U. Photogrammetry Meets BIM. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), page 5 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Positioning Techniques and Accuracy Requirements in Geo-sciences and Engineering Kostas E. Katsampalos Aristotle University of Thessaloniki, Thessaloniki, Greece

Abstract:

Twenty three centuries ago, Eratosthenes estimated the circumference of the earth, with an error of less than 10%, measuring a surface distance based on camel traveling time. At the end of the 19th century, astronomers were measuring ϕ and λ with an accuracy of 1 arcsec which is equivalent to 30 meters. Current day Global Navigational Satellite Systems (GNSS) like GPS, Galileo, GLONASS, have reached accuracies better than 1mm in 3D, including the height above the mean sea level. Repeated positioning estimations reveal the 3D displacements, due to complex tectonic motions. Therefore, from the era of determining ϕ,λ at epoch t, we have reached today a 13-component group which should be assigned to any point: X,Y,Z,σX,σY,σ,vX,vY,vZ,σvX,σvY,σvZ, at a selected epoch t. Many international or regional agencies (like EUREF/EPN) and a number of national mapping and cadastral agencies (civil, military, public or private) operate CORS (continuously operated reference stations), like the Hellenic HEPOS providing high accuracy positional services to the general public. In areas of significant tectonic motion, these networks offer valuable geodata to all sciences, under a unique scientific umbrella since the end of 2017. GIS, cartography, cadastral, navigation, construction engineering, as-built databases, transportation, internet of things, and most interdisciplinary applications, have a lot to benefit from cheap and accurate satellite positioning.

BRIEF BIOGRAPHY Kostas Katsampalos is Professor of Geodesy and Surveying and Dean of the Faculty of Engineering, Aristotle University of Thessaloniki, Greece. He studied Surveying Engineering at the National Technical University of Athens (1975) and he received ?.Sc. (1976) and Ph.D. in Geodetic Science (1981) at the Ohio State University, USA. He was Research Associate at the same time. He was Assistant Professor (1985), Professor (1994) and Chairman of the School of Surveying Engineering (1995 and 2011). He is member of the scientific committee of the Technical Chamber of Greece, professional association of surveying engineers, National Cadastral and Mapping Organization, national correspondent in FIG, CLGCE, EuroGeographics and scientific coordinator for the Hellenic Positioning System (HEPOS) since 2007. He participated in more than 20 research projects and international thematic networks. He is author of three textbooks, 70 research papers, five software packages for geodetic and cartographic transformations and multimedia applications. He teaches geodetic surveying, computer programming and GPS applications in undergraduate and graduate level.

7 Katsampalos, K. Positioning Techniques and Accuracy Requirements in Geo-sciences and Engineering. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), page 7 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

More than Pretty Pictures - Synergies Between Computer Graphics and Geographic Information Systems Eugene Fiume Department Of Computer Science, University of Toronto, Toronto, Canada

Abstract:

Computer graphics is best known for its applications in various entertainment industries. This may disguise the fact that many core concepts in computer graphics are shared with geographic information systems. After exploring some cautionary tales regarding how humans process images, I will explore possible synergies between CG and GIS, including parameterisation problems, big data, computer animation, visual and physical modelling and simulation, mixed reality, and hardware acceleration.

BRIEF BIOGRAPHY Eugene Fiume is Professor and Dean of the Faculty of Applied Sciences at Simon Fraser University, Canada. He is the former Chair of Department of Computer Science at the University of Toronto. Following his B.Math. degree from the University of Waterloo and M.Sc. and Ph.D. degrees from the University of Toronto, he was an NSERC Postdoctoral Fellow and Maitre Assistant at the University of Geneva, Switzerland. He was awarded an NSERC University Research Fellowship in 1987 and returned to the University of Toronto to a faculty position. He has written two books and (co-)authored over 130 papers on these topics. He has won several awards: two teaching awards; Innovation Awards from ITRC for research in computer graphics, and Burroughs-Wellcome for biomedical research; an NSERC Synergy Award for innovation and industrial collaboration in visual modelling; and the 2014 CHCCS Achievement Award. He was elected Eurographics Fellow in July 2014, and became a Fellow of the Royal Society of Canada in September 2014. He was papers chair for SIGGRAPH 2001, past chair of the SIGGRAPH Awards Committee (2003-2008) and the ACM Paris Kanellakis Awards Committee (2011), general co-chair of Symposium for Computer Animation 2008, Pacific Graphics International 2011, and the Eurographics Symposium on Rendering 2016.

9 Fiume, E. More than Pretty Pictures - Synergies Between Computer Graphics and Geographic Information Systems. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), page 9 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

PAPERS

F ULL PAPERS

Hotspot Analysis of the Spatial and Temporal Distribution of Fires Chien-Yuan Chen1 and Qi-Hua Yang2 1Dept.

of Civil and Water Resources Engineering, National Chiayi University, Syufu Rd., Chiayi City, Tawian, China 2Tainan City Government Fire Bureau, Yonghua Rd., Anping Dist., Tainan City, Taiwan, China [email protected], [email protected]

Keywords:

Geographic Information System, Hotspot, Fire Prevention, Geo-Statistical Analyis.

Abstract:

Fire can take lives and destroy structures. However, modern technology can assist authorities to make decisions on fire disaster prevention. Geographic information systems can play a vital role in fire prevention and mitigation by predicting potential hotspots for fires. This study collected and analysed data on fires in Tainan City in southern Taiwan. Spatial statistics analysis tools employing average nearest neighbour analysis and global analysis through Moran's I were used to analyse whether the fires had a clustered pattern and to plot a fire hotspot map using Getis-Ord Gi* analysis. The results showed that the highest fire risk index is that for people over 80 years old, followed by those between the ages of 60 and 80. The spatial distributions of fire locations, injuries, deaths, factory fires, house fires, and wild fires have clustered patterns in the city. The fire hotspots surround the downtown districts, which have high population density and highly developed commercial and industry areas. The fire cold spots are located in the lowly developed mountainous and coastal areas, which have lower population density. Residents in hotspots should be able to better understand their fire risk through studying the hotspot map. Moreover, authorities can identify hotspots for decision making on fire prevention and urban development planning.

1

INTRODUCTION

The incidence of fires has been reduced with the progress of fireproof technologies for buildings and products. However, injuries and deaths due to fires still continuously occur, and more attention must be paid to the fire warning signs. Further reduction in the number of fires requires communities to promote disaster prevention cognition and the establishment of public safety warning systems. More than 60% of fires occur in houses, as statistics by the National Fire Agency in Taiwan demonstrate (http://www.nfa.gov.tw/). This indicates that although society is generally concerned with public safety, people often lose sight of safety within their daily environment. Spatial statistics analysis has been used in various areas for disaster mitigation. Applications that have employed spatial statistics analysis tools include the spatial analysis of crimes committed in the Taichung port area (Lee et al., 2012). The gathering mode used to collect criminal cases was identified using average nearest neighbour analysis, and hotspot analysis was employed to assess cold

spot and hotspot positions regarding crime for the reference of coastguards. In one analysis of spatial clusters of dengue fever in Kaohsiung city (Yan and Hsueh, 2010), research was conducted from the perspective of the geography of the disease, and a geographic information system (GIS) was employed to create a disease map and study the spread of dengue fever. Whether there was a spatial aggregation in the city was determined using the average nearest neighbour method and point density analysis to locate the village with the highest incidence of dengue fever. Hotspot analysis using Getis-Ord Gi* and spatial autocorrelation coefficients Moran’s Index (Moran’s I) was also employed to study the spread of Anopheles gambiae and Anopheles funestus in Kenya (Kelly-Hope et al., 2009). Spatial analysis was used to examine betel nut plantation hotspots in the upper Shui-Li Creek watershed using the autocorrelation coefficients of Moran’s I and the G-statistic, with the objective of investigating the management strategy of betel nut plantations (Yeh et al., 2013). Liang et al. (2014) used spatial analysis to perform a risk assessment of invasive species and employed hotspot analysis

15 Chen, C-Y. and Yang, Q-H. Hotspot Analysis of the Spatial and Temporal Distribution of Fires. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 15-21 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

GISTAM 2018 - 4th International Conference on Geographical Information Systems Theory, Applications and Management

Getis-Ord Gi* to identify hotspot areas and plan management strategies. Truong and Somenahalli (2011) used the spatial autocorrelation coefficient Moran’s I to identify pedestrian–vehicle crash hotspots and unsafe bus stops using hotspot analysis Getis-Ord Gi*. Pedestrian–vehicle crash hotspots were concluded to correlate strongly with the locations of bus stops. Hotspot analysis Getis-Ord Gi* and spatial autocorrelation coefficient Moran’s I were also used to map forest fire risk zones in the Yeguare Region of Honduras (Cáceres, 2011). Factors such as slope, elevation, and distance to tribute affected the risk of a forest fire. Fires exhibit a spatial aggregation distribution and can be related to population density. The characteristics of fires are well-suited to the use of spatial statistics and an autocorrelation analysis to identify hotspot areas and risk factors for disaster prevention and management.

2

STUDY AREA AND METHODOLOGY

The ML 6.4 earthquake on 3 March 2010, which had its epicentre in Jianxin village in southern Taiwan, caused a building owned by a spinning and weaving company in Tainan City to catch fire. Furthermore, a technology factory caught fire on 28 July 2011, causing substantial economic losses in the city. The fire on 23 October 2012 at the Beimen branch of Sinying Hospital in the Beimen District of Tainan City resulted in the deaths of 13 elderly people and injured 69 others. These serious fires and various other factors led us to choose Tainan City as a study area because of its variety of lifestyles and areas, including villages, mountainous areas, coastal areas, and industrial areas. The diversity of the city has caused both its population and industrial development to increase rapidly. This analysis was completed through three steps: a literature review and data collection, statistics analysis, and GIS spatial statistics analysis. The study area was divided into a grid, each square of which was 1000 × 1000 m2 in size. Fire-related data were separated by injuries and deaths, age and gender of the injured and deceased individuals, fire location, land use, and population density. The coordinates of the fires were overlaid onto an administrative map to create a fire point density map to represent fire locations. Tools for average nearest neighbour and global analysis using Moran's I and Getis-Ord Gi* analysis were employed to analyse if the fires displayed a 16

clustered, dispersed, or random pattern on the fire hotspot map. The null hypothesis was the default hypothesis and states that there is no association between fire occurrence and the factors. The null hypothesis was assumed to be true until evidence indicated otherwise. The rejection of the null hypothesis concluded that there were reasons to believe that a relationship between fire and the other factors existed. The tools used for spatial statistics analysis are explained: (1) Average nearest neighbour analysis Euclidean distance was used in the nearest neighbour analysis. The average nearest neighbour distance tool measures the distance between each feature centroid and its nearest neighbour’s centroid location to predict the nearest neighbour index. Five values obtained by the analysis included observed mean distance, expected mean distance, nearest neighbour ratio, z-score, and p-value. The z-score and p-value were used for judging the possibility to reject the spatial random pattern of the null hypothesis. A z-score less than −2.58 or greater than 2.58 and a p-value lower than 0.01 with a confidence level of over 99% were used to reject the null hypothesis and confirm a clustered pattern. The average nearest neighbour ratio (NNR) is calculated using the observed average distance divided by the expected average distance, with the expected average distance being based on a hypothetical random distribution with the same number of features covering the same total area. If NNR is less than 1, the study pattern is clustered; if the index is greater than 1, the trend is toward a dispersed pattern. (2) Global analysis by Moran's I The spatial autocorrelation tool global Moran's I measures spatial autocorrelation based on both feature locations and feature values simultaneously. The method measures each feature centroid and its nearest neighbour’s centroid location to analyse the spatial autocorrelation of each fire. The eigenvalues of this technique included the Moran's I, expected index, variance, z-score, and p-value. The same conditions for z-score, p-value, and confidence level as for average nearest neighbour analysis were used to reject the null hypothesis. The method evaluates the pattern of fires as clustered, dispersed, or random. If Moran’s I is greater than 0 (positive value), the fires were clustered; the fires were dispersed if the index is less than 0 (negative value), and the fires were randomly distributed if the index is close to 0. (3) Hotspot analysis using Getis-Ord Gi* The point density tool calculates the density of point

Hotspot Analysis of the Spatial and Temporal Distribution of Fires

Coordinate analysis

Global analysis

Nearest neighbor analysis Point density map

Fire events

Attribute analysis

Global analysis

Moran's I

Hotspot analysis

Getis-Ord Gi*

Fire hotspot map

Figure 1: Method and tools used for the fire hotspot analysis using the GIS.

1600

1502

1400 1200

Number

features around each raster cell; the tool yields a heat map for visualization. However, the hotspot analysis using the Getis-Ord Gi* method yields a true statistical hotspot analysis. The Getis-Ord Gi* value of the target feature shows where fire hotspots (clusters of high values) and cold spots (clusters of low values) exist in the area accompanied by the z-score and p-value. The zscore and p-value were used to support the rejection of the null hypothesis. The values could help to judge the clustered pattern in high or low values and whether the fires exhibited a random pattern in the analysis. Fires in highly clustered patterns have a higher z-score and lower p-value; fires in a highly dispersed pattern have a lower z-score and a lower p-value. The closer the z-score is to 0, the less visible the clustered pattern. Figure 1 shows the flowchart of the spatial statistics analysis tools used in this study.

1000 800 600 400 200

284

176

216

Wild

traffic

1

0

boat

aircraft

0 Building

others

Type of fires

Figure 2: Statistics on the types of fires in the study area.

In general, building fires have decreased in the city in recent years. There were 193.8 building fires per year on average, with a total of 969 building fires in the years 2004–2008. The number of building fires decreased to 106.6 per year in the period 2009–2013, with a total of 533 building fires. The reduction ratio was thus 45% (Figure 3). This finding may be a result of progress in fireproof technologies and the implementation of fire disaster prevention strategies such as those that increased public awareness. 250

225 208

196

200

180

3 3.1

RESULTS AND DISCUSSION Statistics Analysis of Fires

3.1.1 Number and Types of Fire Events

Number

160 150

132

124

100

84

98

95

2012

2013

50 0 2004

2005

2006

2007

2008

2009

2010

2011

Year

The number of fires totalled 2179 in 10 years (20042013) in the city, which included 1502 building fires (68.94%), 216 traffic-accident fires (9.91%), 176 wild fires (8.07%), 1 boat fire (0.05%), and 284 other type of fires (13.03%) (Figure 2). There were 217.9 fires per year on average, which is equivalent to 0.6 fires per day and three fires every 5 days on average. Buildings such as houses, factories, shops, and warehouses, are the main buildings used by city residents in daily life. Fire disaster prevention is emphasized in buildings to reduce the threat to human life and economic losses. In relation to fires in buildings, this study demonstrates that there were 0.41 building fires per day, which is equivalent to two building fires every 5 days on average.

Figure 3: Statistics on building fires in the study area.

3.1.2 Fire Death Rate and Fire Risk Index The fire death (or injury) rate is the number of fatalities (injuries) per million people in the population. The fire risk index is based on the average death (injury) ratio and calculated as the fire death (injury) ratio in various ages divided by the fire death (injury) ratio in the total population according to the U.S. Fire Administration’s report (https://www.usfa.fema.gov/): Fire death (injured) rate = number of fire deaths (injuries)/ population (in millions) Fire risk index = fire death (injury) ratio in various ages/fire death (injury) ratio in the total population

(1) (2)

17


The fire death rate was 6.99 in 2004 in the study area, but only 2.13 in 2010. The rate increased abruptly to 12.37 in 2012 because a deadly hospital fire occurred in this year, causing 13 deaths. The average fire death rate was 4.65 over the 10 years this study examined, but would be lower than 6.0 if the year 2012 was excluded, as shown in Figure 4.

downtown for industry and commercial purposes (Figures 7 and 8). Age > 80

14 12.37

60~79

Fire death rate

12 10 8 6.99 6

5.86 3.73

4.3

4

2.65

2.67

2.67

2007

2008

3.2

40~59

2013 Year

2.13

2012

2

2011

0 2004

2005

2006

2009

2010

2011

2012

2013

Year

2010

20~39

2009

Figure 4: Changes in the fire death rate in the study area.

The statistical analysis discovered that the fire risk index was less than 1.0 on average for those under 60 years old, 1.76 for those aged 60–79, and 4.27 for those aged over 80 years old in the city (Figure 5). This indicates that those aged over 60 are a high-risk group regarding fires. The result coincides with the statistics provided by the Tokyo Fire Department (http://www.tfd.metro.tokyo.jp/), which states that more than 90% of fire deaths are of individuals over the age of 65. The results perhaps reflect the fact that elderly people are less able to escape due to mobility issues and are therefore more exposed to the effects of a fire. The fire risk index was highest in the Beimen District due to the fire on 23 October 2012 at the Beimen branch of the Sinying Hospital, which caused 59 injuries. These statistics were compared with the spatial statistics analysis for further analysis.

2008 2007

10~19

2006 2005 2004

0~9

0

1

2

3

4

5

6

7

8

9

10

Fire risk index Figure 5: Fire risk index in different ages in the study area.

3.1.3 Spatial Statistics Analysis of Fires This study used the coordinates of fire locations to create a fire point density map and overlaid this with the population density to perform relevance analysis using a GIS. A total of 2179 fires were imported into the spatial analysis for 2004–2013. The fires were concentrated in the southwest area of the city. The highest fire density area was the industrial area of Yongkang District, with 261 fires, including 194 building fires (Figure 6). There were 163 fire injuries and 69 deaths in 2004–2013 in the city. The fire point density map shows that a high density of fires was concentrated in the high human activity areas surrounding the

18

Figure 6: Distribution of fire points and population density map in the study area.


Figure 7: Distribution of fire injuries and corresponding population in the study area.

Figure 9: Results of the NNR distribution obtained using average nearest neighbor analysis.

Figure 8: Distribution of fire deaths and corresponding population in the study area.

(2) Global Moran's I analysis Global Moran's I analysis showed that the Moran’s I was 0.48, which is larger than 0.00, and that the fires had a positive clustered pattern. The z-score was 83.96, larger than 2.58, and the p-value was 0.00, which rejects the proposition of complete spatial randomness (null hypothesis). In summary, the fires in the city had a clustered pattern, with a less than 1% probability that they could be in a random pattern (Figure 10).

3.1.4 Spatial Aggregation Pattern of Fires (1) Average nearest neighbor analysis In the nearest neighbor analysis, the calculated observed mean distance was 334.6 m and the expected mean distance was 542.2 m. The average NNR was thus 0.62, which is smaller than 1.0 and indicates a clustered pattern of fires. The z-score was −31.62, which is smaller than −2.58, and the p-value was 0.00; thus, the null hypothesis, was rejected. These values demonstrate that the fires were in a clustered pattern in the city, with a less than 1% probability of their being in a random pattern. The results of the average nearest neighbor analysis for fires are shown in Figure 9. Figure 10: Moran’s I obtained using global Moran's I analysis.

The types of fires including house, industry, and wild fires and those fires that caused injuries and deaths were all found to exhibit a clustered pattern

19


by using the GIS spatial statistics analysis in the study area (Table 1). Table 1: Summary of the spatial analysis results for fires in 2004–2013 in the study area.

Average nearest neighbor

NNR Z-score

p-value Moran’ sI Z-score p-value Type of distribution

Global Moran's I

All types of fire 0.62 31.62 0.00

Industry fire

0.00

0.00

0.00

0.48

0.17

0.41

0.04

83.96 0.00

29.85 72.02 0.00 0.00 clustered

6.68 0.00

House fire

Wild fire

0.47

0.58

0.68

-16.06

-23.43

-7.78

(3) Fire hotspot analysis The outlines of fire concentration areas were analyzed using the tool Getis-Ord Gi* to identify fire hotspots in the city. Figure 11 displays the fire hotspot map of the city. Two major hotspots (standard deviator larger than 2.58) are displayed on the map. The largest fire hotspot surrounds the administrative area of the rapidly developing districts, which is close to the downtown center where more jobs are available, as well as good educational and medical facilities and living conditions. The other hotspot is located at the deputy downtown center, Hsinying District, which has similar conditions to those of the largest hotspot. A review of the fire spatial distribution on the point density and hotspot maps reveals that all types of fires occurred in the area surrounding the city’s administration center, which has a population density of more than a million residents per square kilometer.

Figure 11: Fire hotspot map in the study area.

(4) Fire cold spot area Two major fire cold spot areas exist in the map in 20

Figure 11. One is located in the mountainous area, which has a considerably lower population and a low amount of land developed for economic use, as the land is unsuitable for agricultural use. The other is located in the coastal area, which has a similarly small population and low use of land; most of this area is used for fish farms and a limited amount is used for agricultural use.

3.2

Deployment of Firefighters

The ratio of the general population to each firefighter is 1773:1 in Taiwan, and the ratio is 2047:1 in the study area of Tainan City (National Fire Agency in Taiwan http://www.nfa.gov.tw/ on May 2014). The ratio is higher than that in some other developed countries, such as the United States, where the ratio was 1000/1.33 in New York in 2009–2010 (UFOA, 2011). In some districts of Tainan such as Yongkang District, the ratio is as high as 3668:1 in the hotspot area, which is close to the 4000:1 absolute maximum population that can be adequately served by firefighters. The deployment of firefighting teams must be focused on the hotspot areas of the city.

4

CONCLUSIONS

This study used GIS spatial statistics analysis to investigate the fire hotspot area distribution in the study area, Tainan City in southern Taiwan, using fire data from the years 2004-2013. The point density map shows the fire, injury, and death distributions in the city. Spatial statistics analysis tools for the average nearest neighbor and global analysis using Moran's I were employed to analyze whether the fires had a clustered pattern and to plot the fire hotspot map using the Getis-Ord Gi* analysis. The results showed the following: (1) The highest fire risk index is for people over the age of 80, followed by those aged 60–80. (2) The spatial distribution of fire locations, injuries, deaths, industrial fires, house fires, and wild fires had clustered patterns. (3) The fire hotspot is the downtown area, which has high population density, and the cold spot areas are located in underdeveloped mountainous or coastal areas with lower population density. (4) Fire hotspots are highly correlated with house fires, and fire deaths are concentrated in the downtown area. Finally, the results can provide valuable insights for governments in relation to land development and


urban planning, and could help plan future firefighting resource requirements. This study suggests that other type of disasters can be included in the analysis because non-fire-related disasters also require the assistance of firefighters.

REFERENCES Cáceres, C.F., 2011. Using GIS in hotspots analysis and for forest fire risk zones mapping in the Yeguare region, southeastern Honduras, Saint Mary’s. University of Minnesota University Central Services Press, Winona, MN, Resource Analysis, 13, 14pp. Kelly-Hope, L.A., Hemingway, J. and McKenzie, F.E., 2009. Environmental factors associated with the malaria vectors Anopheles gambiae and Anopheles funestus in Kenya, Malaria Journal, doi: 10.1186/1475-2875-8-268. Lee, Q.C., Chen, C.W., Luo, D.C., Hong, F.F., 2012. A spatial analysis of criminal cases in Taichuang port area. Journal of Taiwan Maritime Safety and Security Studies, Vol 3, No 4, 39-60. (in Chinese with English abstract). Liang, L., Clark, J.T., Kong, N., Rieske, L.K. and Fei, S., 2014. Spatial analysis facilitates invasive species risk assessment, Forest Ecology and Management, 315, 22-29. National Fire Protection Association, http:// www.nfpa.org/. Truong, L.T. and Somenahalli, S.V.C., 2011, Using GIS to Identify Pedestrian-Vehicle Crash Hotspots and Unsafe Bus Stops, Journal of Public Transportation, 14(1), 99-114. UFOA, 2011 available: http://www.ufoa.org/researchfiles/ file00000009.pdf. Yan, L.E., Hsueh, Y.H., 2010. The Analysis of Spatial Cluster of Dengue Fever in Kaoshiung City 2010. The International Conference on Eco-Society and Sustainable Development, 129-153. (in Chinese with English abstract). Yeh, C.K., Chuang, Y.C., Liaw, S.C., 2013. The Spatial Analysis of Betel Nut Plantation Hotspots in the Upper Shui-Li Creek Watershed. Journal of Chinese Soil and Water Conservation, 44(3):202-214.

21

A Newly Emerging Ethical Problem in PGIS Ubiquitous Atoque Absconditus and Casual Offenders for Pleasure Koshiro Susuki Faculty of Humanities, University of Toyama, 3190 Gofuku, Toyama, Japan [email protected]

Keywords:

Ubiquitous Mapping, Absconditus, PGIS, Geographic Information Ethics, Cyberbullying, Casual Offenders for Pleasure.

Abstract:

Thanks to the recent technological advances of cellular phones, the practical realization of GeoAPI and SNS, and the consolidation of wireless LAN networks, hardware has become capable of providing portable highspeed Internet access and interactive SNS, and people can now easily communicate far more, casually and unboundedly, via the Internet. Currently, PGIS studies mainly look at the ‘sunny side’ of GIT progress. Although there are also relevant studies on online ethics, they rely unduly on spontaneously arising equilibrium innervated by mutual surveillance among the people involved. However, it is an over-optimistic and ingenuous perception regarding this exponential technological advance. In this paper, the author illustrates the existence of ‘casual offenders for pleasure’ by referring to two recent online cyberbullying incidents. Because the appreciation of technology-aided ubiquitous mapping can be very hard to see or to grasp, especially for people not educated and trained to see it, the advances prompt people to nonchalantly lower technical and ethical barriers. Further studies are essential to establish the geographic information ethics and offer a clear-cut answer for this newly emerging problem.

1

INTRODUCTION

On 6 November 2012, a murder occurred in Zushi City, Kanagawa Prefecture, Japan. Although the victim had secretly relocated to an apartment at that time to escape from the criminal’s repeated stalking, the 40-year-old criminal somehow found the apartment into which his ex-lover had moved, invaded it, and stabbed her to death before hanging himself. It is commonly called the ‘Zushi stalker murder case’. According to The Japanese Metropolitan Police Department, the numbers of stalkers recognized in 2013 was the highest number, 21,089, of which 15 resulted in incidences, including the Zushi case. Among them, there is a reason why the Zushi incident has been given particular attention. Immediately after the incident, there emerged the suspicion that the criminal had prepared for the crime using a major portal online Q & A bulletin board for more than one year before committing the crime. An anonymous suspect who had uploaded the questions one after another disappeared from the web after the incident, leaving only a series of questions. The remaining writing still vividly conveys how the person gradually obtained the

knowledge related to the incident while keeping the murderer’s aim secret, such as: how to figure out an address from a phone number, how to analyse Exif metadata from a photo, how to uncover a locked private account on Facebook, how to request a professional detective search, how to purchase weapons, and how to move to a site in Zushi. This case symbolically presents an emerging ethics agenda for Geographic Information Science, or GIScience. Although it foretold of the potential threat of the trend in Geospatial Information Technology (GIT), for GIT-aided ubiquitous mapping and cartography (Morita, 2003; Reichenbacher, 2007; Gartner et al., 2007), for the next five years on, little attention was paid to the implication of such a technology-aided incident. In this paper, the author critically summarizes existing debates in relevant fields to clarify what was overlooked and what should be considered.

2

THE RISE OF PGIS AND INTERNET PRIVACY

Since the 1990s, the possibilities of geospatial analysis in conjunction with GIS have dramatically

22 Susuki, K. A Newly Emerging Ethical Problem in PGIS - Ubiquitous Atoque Absconditus and Casual Offenders for Pleasure. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 22-27 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

A Newly Emerging Ethical Problem in PGIS - Ubiquitous Atoque Absconditus and Casual Offenders for Pleasure

increased in the context of consolidation of geostatistical data, high precision of GPS, improvement of PC processing capability, and speeding up of LAN access. Geographers gradually became aware of the magnitude of the social impacts of GISystem. GISystem became capable of analysing and outputting even personal level data (Miller, 2007). GIT innovation has increased the necessity of dealing with GIS from an interdisciplinary science perspective, examining the consequential social influence of the innovation as well as the functionality of the system itself. This has become GIScience. Since the middle of the 2000s, this situation has dramatically changed even further. Thanks to the consolidation of wireless LAN networks, such as Wi-Fi, which was initially established in 2000, hardware became capable of providing portable high-speed Internet access. Moreover, interactive web services such as Mixi, Facebook, Twitter, posting bulletin boards, Flickr, etc. were launched one after another over the course of the decade. As a result, communication between people far more casually and unboundedly via the Internet, or socalled Web 2.0, became a sudden reality. Likewise, in the case of GIT, when map integration technology (GeoAPI) was put into practical use on the web, Google began to provide Street View and Google Earth continuously in 2006-2007. From then on, everyone could freely geo-tag and share photos and texts on the web maps. One of the most positive aspects stemming from these technological innovations is the rise of the Participatory Geographic Information Systems, PGIS. The evolution has prompted public citizens who were simply receivers of geographic information to become senders, sharers, and communicators of geographic information with use of Social Networking Services (SNS) and online mapping devices (Turner, 2006; Crampton, 2010). Sometimes such grass-roots mappers voluntarily participate in regional policy planning and local governance, called Volunteered geographic information (VGI) (Goodchild, 2007), or bottom-up GIS in Talen’s (2000) nomenclature. Moreover, such mappers can utilize GIT in the post-disaster construction and damage repair process by simply digitizing satellite imagery of the afflicted areas on OpenStreetMap to help find ways around severed roads (Norheim-Hagtun and Meier, 2010). Cartographers generally interpret the phenomenon positively, as a people-powered, net-rooted, undisciplined, alternative and Dionysiac way of

mapping (Crampton, 2010; Kingsbury and Jones, 2009). On the other hand, it became apparent that there are potential threats stemming from this advance. boyd and Ellison’s (2007) review of studies dealt with online-inherent privacy issues and summarized these as (1) damaged reputation due to rumors and gossip; (2) unwanted contact and harassment or stalking; (3) surveillance-like structures due to backtracking functions; (4) use of personal data by third-parties; and (5) hacking and identity theft (Debatin et al. 2009). Previous studies on SNS privacy issues mainly focused on ethical questions involving the remote monitoring of users conducted by the service provider. Above all, the invasion of privacy and surveillance of geographic space as an exercise of public power are subjects of considerable discussion in GIScience (Armstrong, 2002). Although the discussion about ubiquitous mapping is still limited, some scholars coined the term ‘geosurveillance’ (Crampton, 2003) to enable critical discussion about the potential risks of privacy infringement through aggregation of users' attributes and location information collected by public authority and SNS providers. Although the development of Information and Communications Technology (ICT) permits people to share geographic information in a friendlier manner, users remain under scrutiny more tightly because of the geosurveillance (Monmonier, 2002). Dobson and Fisher (2003) defined the term ‘geoslavery’ as ‘a practice in which one entity, the master, coercively or surreptitiously monitors and exerts control over the physical location of another individual, the slave’ (p. 48). Many scholars metaphorically refer to the ‘big brother’ motif in George Orwell’s famous novel 1984 to describe the power and position of a master (e.g. Klinkenberg, 2007; Propen, 2005), and Bentham’s panopticon for the systems and techniques of monitoring (Dobson and Fisher 2007; Koskela, 2002). Although many studies have been extremely conscious of the potential risks of geosurveillance by public powers, their discussions regarding privacy infringement at individual levels lack diversity. There are many empirical studies on individual offenders and victims via SNS. For instance, Gross and Acquisti (2005), one of the classic empirical studies on SNS profiles, found 89% of users used their real names on their Facebook profiles, and 61% used identifiable information in their posts. Jones and Soltren (2005) found that 62% of student users did not configure any privacy setting despite the fact 74% of them knew about Facebook privacy options.

23


They also pointed out that 70% of Facebook users posted personal information. In other words, they could not defend their privacy effectively although they cared about its leakage, in what Barnes (2006) termed the ‘privacy paradox’. Some other studies found another rationale: that the tendency to inadequately protect one’s private information was the consequence of exhibitionistic motives (McGrath, 2004; Ong et al., 2011). These studies demonstrate how potential victims are vulnerable and undefended against anonymous third party offenders on the Internet, but they do not tell much about the offenders. Little has been studied regarding the offenders, except the cyberbullying and cyberstalking studies that mostly focused on adolescent students in a criminological context (Smith et al., 2008; Wolak et al., 2008). In sum, previous online privacy studies in GIScience can be summarized as emphasizing the risks of privacy infringement by public power or criminological studies through SNS. However, in the Web 2.0 era, the panoptic one-to-many relationship becomes the many-surveilling-the-many situation of what Shilton (2009) described as little brothers and Rose-Redwood (2006) termed as omnopticon. In such views, the progress of PGIS may encompass the participatory panopticon and total loss of privacy (Whitaker, 1999). Kawaguchi and Kawaguchi (2012) rephrased the omnopticon as ‘paradoxical others’ to describe the feeling of discomfort upon being disclosed on Google Street View. Liberally interpreted, these views suggest that an omnoptic mutual surveillance environment restrains and intermediates the people from deviant behaviours as a sort of unseen hand of God. But why do we rule out plausible alternatives? In this paper, the author brings up two cases for examining the possibilities not yet discussed in the preceding contributions: the existence of casual offenders for pleasure. Before the Web 2.0 era, most of the people who could create and manage maps were knowledgeable experts who generally had educated and internalized codes of professional ethics. However, in the ubiquitous mapping circumstance, people can participate in mapping behaviour far more casually without being aware that they are in a position of power to create geographic information, without knowledge of cartography or ethics. Thus, the premise that the net-rooted, undisciplined, alternative and Dionysiac people do what experts expect of them no longer applies.

24

3 3.1

CASES OF THE CASUAL OFFENDERS FOR PLEASURE Individual Online Peepers

On 20 February 2015, there was a case of murder in Kanagawa prefecture, Japan. The then-13-year-old victim had tried to withdraw from the perpetrators’ circle, and was found bound and stabbed to death by the three juvenile criminals. The case received much media coverage because of the atrociousness of the crime that can hardly be attributed to their age. However, this case became especially memorable not only because of the savagery, but also in the context of the present paper. The then-15year old podcaster, whose handle name was Noeru (Noël), somehow located and found the chief culprit’s family’s house and webcasted it across the globe. Figure 1 is a screenshot of the delivered movie (now deleted) showing a symbolic composition of a journalist holding out a microphone to a nameless boy as seen from his behind. The figure demonstrates that even a boy goes toe-to-toe with professional media in terms of competence for information transmission. Needless to say, Noeru (and other mappers) could determine a location by simply specifying aggregated place names and utilizing Google Street View to find the same exterior appearance of the home broadcasted by the mass media to detect the exact target location (termed ‘dataveillance’ by Clarke, 1988). Why would the non-involved boy do this? It is ratiocinative to consider his aspiring to fame and increased advertisement revenue, even if he becomes seen as an online ‘weirdo’.

Figure 1: A screenshot of the podcasted movie (http://www.afreecatv.jp/noeru) *now deleted.


3.2

Private Sanctions and Collective Droves

On 15 May 2012 in Hachioji, Tokyo, an elementary school child was on his way home from school. Suddenly, two junior high school students surrounded him, while making a visual recording with a cell phone. The two adolescents found a pretext for quarrelling with the boy, causing him to move backward and whimper in fear. The adolescents then uploaded the movie file on one of the adolescent’s YouTube account for kicks (Yomiuri Online, 21 July 2017). Immediately after the upload, the URL was disseminated on the Internet by SNS, and appeared on the famous online bulletin board 2channel with fusillade of accusations. An anonymous person promptly created a portal site with using @wiki, a free rental wiki maintained by a limited liability company, Atfreaks (Figure 2). The website served as a ‘traffic cop’, directing thousands of seekers to the appropriate information. As the sub-domain name /dqntokutei/ eloquently shows, the creator of the domain cared less about right or wrong but rather to tokutei (identify) the dqn (an argot for ‘homeboys’) who deserved to be sanctioned.

Figure 2: The top page screenshot of the promptly created wiki (https://www34.atwiki.jp/dqntokutei/).

Subsequently, thousands of Internet users (mainly consisting of 2channel viewers) voluntarily began Googling for information about the captured location, as well as analysing the past uploaded files on the YouTube account. The power of collective intelligence was used to pinpoint the filmed location before long, by scoping out distinctive landmarks

captured in the setting and comparing them with images on Google Street View. The school uniform of the perpetrators also revealed the school they attended. Likewise, some of the amateur investigators examined the contents of past uploaded movies and found that the uploaders' faces and their neighbourhoods were visible in some of the files. These online droves dataveillanced all information published online, found two nameless targets, and privately sanctioned through complaint calls to the schools and police stations. Five years on from this initial burst of enthusiasm, the portal site remains on the Internet, exposing the faces and locations of involved individuals to the public gaze.

4

CONCLUDING REMARKS

In 1495 A.D., in medieval Germany, Ewiger Landfriede passed by Maximilian I, German king and emperor of the Holy Roman Empire, prohibited Fehde (the duel) as a self-help right to take vengeance. This was the first time in European history when a Reichskammergericht, the Supreme Court, was established in a related move (Jackson, 1994). As this event clearly demonstrates, the modern concept of law and justice could not be made possible without consignment of individual rights of vengeance to the public power. Five hundred and a few decades on, an overwhelming innovation in GIT is prompting the resurgence of this pre-modern principle in a way too modernized figure. The recently realized ubiquitous mapping based on Web 2.0 circumstances is making it an open possibility for people to create and use geographic information anywhere and at any time, and without advanced map-use skills (Gartner et al., 2007). However, as the meaning of the word illustrates, ubiquitous stands for being omnipresent, like air, health, and water, all largely taken for granted. In Latin, an antonym for ubiquitous is absconditus, signifying hidden, hard to see or to grasp (Lewis, 1890). Although air is everywhere, its existence is largely overlooked because of its ubiquitous nature. Likewise, in a ubiquitous mapping situation, its presence becomes very hard to see or to grasp, especially for people not educated and trained to ‘see’ it. As the examples in this study demonstrate, technological advances also enable people to participate by nonchalantly lowering the technical, intelligential, and ethical barriers. For the time being, PGIS studies mainly take a look at the sunny side of the progress in GIT.

25


Relevant studies on online ethics place undue reliance on spontaneously arising equilibrium innervated by mutual surveillance among the people involved (Rose-Redwood, 2006; Kawaguchi and Kawaguchi, 2012). However, this view of this exponential technological advance is over-optimistic and ingenuous. GIS is only a device and tool. As the Zushi murder case at the beginning of this article shows, people can utilize the new technologies both in good ways and bad. Further studies are clearly essential to establish geographic information ethics from a collaboration of relevant fields such as information ethics, comparative jurisprudence, geographical education as well as GIScience for offering a clear-cut answer to this newly emerging problem.

ACKNOWLEDGEMENTS This work was supported by JSPS KAKENHI Grant Number JP17H00839. Some parts of this article are based on the following conference presentations conducted by the author: the 63rd Annual Conference of The Japanese Society for Ethics in 2012, the Kyoto Collegium for Bioethics in 2014, the conferences of the Association of Japanese Geographers in 2014 and 2015, and a keynote speech at Hokuriku Geo-Spatial Forum 2017.

REFERENCES Armstrong, M.P., 2002. Geographic information technologies and their potentially erosive effects on personal privacy. Studies in Social Sciences, 27(1), 1928. Barnes, S.B., 2006. A privacy paradox: Social networking in the United States. First Monday, 11(9) doi: http://dx.doi.org/10.5210/fm.v11i9.1394 boyd, d.m., Ellison, N.B., 2007. Social network sites. Journal of Computer-mediated Communication, 13, 210-230. Calvert, C., 2004. Voyeur War-The First Amendment, Privacy & (and) Images from the War on Terrorism. Fordham Intellectual Property, Media and Entertainment Law Journal, 15(1), 147-168. Clarke, R., 1988. Information technology and dataveillance. Communications of the ACM, 31(5), 498-512. Crampton, J.W., 2003. Cartographic rationality and the politics of geosurveillance and security. Cartography and GIS, 30(2), 135-148. Crampton, J.W., 2010. Mapping: A Critical Introduction to Cartography and GIS, Wiley-Blackwell. Malden, MA.

26

Debatin, B., Lovejoy, J. P., Horn, A. K., Hughes, B. N. 2009. Facebook and online privacy: Attitudes, behaviors, and unintended consequences. Journal of Computer-Mediated Communication, 15(1), 83-108. Dobson, J.E., Fisher, P.F., 2003. Geoslavery. Technology and Society Magazine, 22(1), 47-52. Dobson, J. E., Fisher, P. F., 2007. The Panopticon's changing geography. Geographical Review, 97(3), 307-323. Gartner, G., Bennett, D. A., Morita, T., 2007. Towards ubiquitous cartography. Cartography and Geographic Information Science, 34(4), 247-257. Goodchild, M.F., 2007. Citizens as sensors. GeoJournal, 69(4), 211-221. Gross, R., Acquisti, A., 2005. Information revelation and privacy in online social networks. Proceedings of the 2005 ACM workshop on Privacy in the Electronic Society, 71-80. Jackson, W.H., 1994. Chivalry in twelfth-century Germany: the works of Hartmann von Aue (Vol. 34), Boydell & Brewer Ltd. Cambridge. Jones, H., Soltren, J.H., 2005. Facebook: Threats to privacy. Project MAC: MIT Project on Mathematics and Computing, 1, 1-76. Kingsbury, P., Jones, J.P., 2009. Walter Benjamin’s Dionysian adventures on Google Earth. Geoforum, 40, 502-513. Klinkenberg, B., 2007. Geospatial technologies and the geographies of hope and fear. Annals of the Association of American Geographers, 97(2), 350360. Kawaguchi, Y., Kawaguchi, K., 2012. What does Google Street View bring about? -Privacy, discomfort and the problem of paradoxical others-. Contemporary and Applied Philosophy, 4, 19-34. Koskela, H., 2002. ‘Cam Era’—the contemporary urban Panopticon. Surveillance & Society, 1(3), 292-313. Lewis, C.T., 1890. An elementary Latin dictionary. American Book Company. New York, Cincinnati, and Chicago. Miller, H., 2007. Place-based versus people-based GIScience. Geography Compass, 1(3), 503-535. Monmonier, M., 2002. Spying with Maps: surveillance technologies and the future of privacy, University of Chicago Press. Chicago. Morita, T., 2005. A Working conceptual framework for ubiquitous mapping. Proceedings of XXII International Cartographic Conference. A Courna, Spain. 2005. McGrath, J.E., 2004. Loving big brother: Performance, privacy and surveillance space. Psychology Press. London. Norheim-Hagtun, I., Meier, P., 2010. Crowdsourcing for crisis mapping in Haiti. innovations, 5(4), 81-89. Ong, E.Y.L., Ang, R.P.. Ho, J.C.M., Lim, J.C.L., Goh, D.H., Lee, C.S., Chua, A.Y.K., 2011. Narcissism, extraversion and adolescents’ self-presentation on Facebook. Personality and Individual Differences, 50(2), 180-185.


Propen, A.D., 2005. Critical GPS: Toward a new politics of location. ACME: An International Journal for Critical Geographies, 4(1), 131-144. Reichenbacher, T., 2007. Adaptation in mobile and ubiquitous cartography. In William C., Peterson, M.P. and Gartner, G. eds. Multimedia Cartography, Springer. Berlin Heidelberg, 383-397. Rose-Redwood, R.S. 2006. Governmentality, geography, and the geo-coded world. Progress in Human Geography, 30(4), 469-486. Shilton, K., 2009. Four billion little brothers?: Privacy, mobile phones, and ubiquitous data collection. Communications of the ACM, 52(11), 48-53. Smith, P.K., Mahdavi, J., Carvalho, M., Fisher, S., Russell, S., Tippett, N., 2008. Cyberbullying: Its nature and impact in secondary school pupils. Journal of Child Psychology and Psychiatry, 49(4), 376-385. Talen, E., 2000. Bottom-up GIS. Journal of the American Planning Association, 66, 279-294. Turner, A., 2006. Introduction to neogeography, O'Reilly Media, Inc. Yomiuri Online, 21 July 2017. http://www.yomiuri.co.jp/national/news/20120721OYT1T00402.htm (deleted) Whitaker, R., 1999. The end of privacy: How total surveillance is becoming a reality, The New Press. New York. Wolak, J., Finkelhor, D., Mitchell, K. J., Ybarra, M. L., 2008. Online" predators" and their victims: myths, realities, and implications for prevention and treatment. American Psychologist, 63(2), 111-128.

27

Enhanced Address Search with Spelling Variants Konstantin Clemens Technische Universität Berlin, Service-centric Networking, Germany [email protected]

Keywords:

Geocoding, Postal Address Search, Spelling Variant, Spelling Error, Document Search.

Abstract:

The process of resolving names of spatial entities like postal addresses or administrative areas into their whereabouts is called geocoding. It is an error-prone process for multiple reasons: Names of postal address elements like cities, streets, or districts are often reused for historical reasons; structures of postal addresses are only coherent within countries or regions - around the globe addresses are not structured in a canonical way; human users might not adhere even to locally common format for specifying addresses; also, humans often introduce spelling mistakes when referring to a location. In this paper, a log of address searches from human users is used to model user behavior with regards to spelling mistakes. This model is used to generate spelling variants of address tokens which are indexed in addition to the proper spelling. Experiments show that augmenting the index of a geocoder with spelling variants is a valuable approach to handling queries with misspelled tokens. It enables the system to serve more such queries correctly as compared to a geocoding system supporting edit distances: While this way the recall of such a system is improved, its precision remains on par at the same time.

1

INTRODUCTION

Nowadays digital maps and digital processing of location information are popularly used. Besides various applications for automated processing of location data, like (Can et al., 2005), (Sengar et al., 2007), (Borkar et al., 2000), or (Srihari, 1993), users rely on computers to navigate through an unknown area or to store, retrieve, and display location information. Withal, internally, computers reference locations through a coordinate system such as WGS84 latitude and longitude coordinates (National Imagery and Mapping Agency, 2004). Human users, on the other hand, refer to locations by addresses or common names. The process of mapping such names or addresses to their location on a coordinate system is called geocoding. There are two aspects to this error-prone process (Fitzke and Atkinson, 2006), (Ge et al., 2005), (Goldberg et al., 2007), (Drummond, 1995): First, the geocoding system needs to parse the user query and derive the query intent, i.e., the system needs to understand which address entity the query refers to. Then, the system needs to look up the coordinates of the entity the query was referring to and return it as a result. Already the first step is a non-trivial task, especially when considering the human factor: Some address elements are often misspelled or abbreviated by users in a non-standard way. Also, while postal addresses

seem structured and like they adhere to a well-defined format, (Clemens, 2013) shows that each format only holds within a specific region. Considering addresses from all over the world, address formats often contradict to each other, so that there is no pattern that all queries would fit in. In addition to that, like with spelling errors, human users may not adhere to a format, leaving names of address elements out or specifying them in an unexpected order. Such incomplete or missorted queries are often ambiguous, as the same names are reused for different and often times unrelated address elements. Various algorithms are employed to mitigate these issues. Even with the best algorithms at hand, however, a geocoding service can only be as good as the data it builds upon, as understanding the query intent is not leading to a good geocoding result if, e.g., there is no data to return. Many on-line geocoding services like those offered by Google (Google, 2017), Yandex (Yandex, 2017), Yahoo! (Yahoo!, 2017), HERE (HERE, 2017), or OpenStreetMap (OpenStreetMap Foundation, 2017b) are easily accessible by the end user. Because most of these systems are proprietary solutions, they neither reveal the data nor the algorithms used. This makes it hard to compare distinct aspects of such services. An exception to that is OpenStreetMap: The crowd-sourced data is publicly available for everyone. Open-source projects like Nominatim

28 Clemens, K. Enhanced Address Search with Spelling Variants. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 28-35 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Enhanced Address Search with Spelling Variants

(OpenStreetMap Foundation, 2017a) provide geocoding services on top of that. In this paper, data from OpenStreetMap is used to create a geocoding service that is capable of deriving the user intent from a query, even if it contains spelling errors or is stated in a nonstandard format. Nominatim - the reference geocoder for OpenStreetMap data - is used as one of the baselines to compare with. Thereby, the recall of a geocoding system is the ratio of successful responses containing the result queried for, while as the precision describes the ratio of responses not containing different and therefore wrong results. For ambiguous queries most geocoding systems return responses with multiple results. Obviously, at most one result can be the one queried for, while all other results can only be wrong. Therefore, such responses can be regarded as either successfully served and increasing the recall, or as failures reducing precision. Because this paper aims at increasing the recall by reducing the ambiguity of queries, each response with more than one result is counted as non-successful, affecting the precision metric of the respective geocoder negatively. In this paper a novel approach is suggested to increase the recall of a geocoder. The idea is to make the system capable of supporting specific, most commonly made spelling errors. Usually, this is achieved by allowing edit distances between tokens of the query and the address. That, however, inherently increases the ambiguity of queries and leads to a lower precision of the system: More responses contain results that queries did not refer to. The suggested approach aims to avoid that by only allowing specific spelling variants that are made often, while avoiding spelling variants that are not made at all - edit distances lack this differentiation. For that, from a log of real user queries the most common spelling mistakes users make are derived. These spelling variants are indexed in addition to the correctly spelled address tokens. Variants of geocoding systems created this way are evaluated with regard to their precision and recall metrics, and compared to a similar system supporting edit distances, as well as Nominatim. In (Clemens, 2015a) and (Clemens, 2015b), similar measurements have shown that TF/IDF (Salton and Yang, 1973) (Salton et al., 1975) or BM25f (Robertson et al., 2004) based document search engines like Elasticsearch (Elastic, 2017) handle incomplete or shuffled queries much better than Nominatim. This paper is a continuation of that work. It adds to both the indexing mechanism proposed in (Clemens, 2015a) and (Clemens, 2015b) as well as the way the system performance is measured.

Work on comparing geocoding services has been undertaken in, e.g., (Yang et al., 2004), (Davis and Fonseca, 2007), (Roongpiboonsopit and Karimi, 2010), or (Duncan et al., 2011). Mostly, such works focus on the recall aspect of a geocoder: Only how often a system can find the right result is compared. Also, other evaluations of geocoding systems treat every system as a black box. Thus, a system can be algorithmically strong, but perform poorly in a measurement because it is lacking data. Vice versa, a system can look better than others just because of great data coverage, despite being algorithmically poor. In this paper, the algorithmic aspect is evaluated in isolation, as all systems are set up with the same data. Also, a different way of measuring the geocoders performance is proposed: Based on real user queries a statistical model is created which is used to generate erroneous, user-like queries out of any given valid address. This approach allows to measure a system on a much greater number of addresses. Another approach to the geocoding problem is to find an address schema that is easy to use and standardized in a non-contradicting way. While current schemata of postal addresses are maintained by the UPU (Universal Postal Union, 2017), approaches like (what3words, 2017), (Coetzee et al., 2008), (Mayrhofer and Spanring, 2010), (Fang et al., 2010), or (geo poet, 2017) are suggesting standardized or entirely alternative address schemata. (Clemens, 2016) shows that such address schemata are beneficial in some scenarios, though they are far from being adopted into everyday use. In the next section, the steps to set up such geocoding systems are described. Afterwards, in Section 3 the undertaken measurements are described in detail. Next, in Section 4, the observed results are discussed and interpreted. Finally, in the last section, the conclusions are summarized and further work is discussed.

2

SETTING UP A GEOCODER

The experiment is conducted on the OpenStreetMap data set for Europe. This data set is not collected with a specific application in mind. For many use cases, it needs to be preprocessed from its raw format before it can be consumed. As in (Clemens, 2015a) and (Clemens, 2015b), the process for preprocessing OpenStreetMap data built into Nominatim has been used. Though a long-lasting task, reusing this process ensures all systems are set up with exactly the same data, thereby enabling the comparability of the algorithmic part of those systems. Thus, first, Nomi29


Figure 1: Example of a document indexed in the geocoding system.

natim has been set up with OpenStreetMap data for Europe as the baseline geocoding system. Internally Nominatim uses a PostGIS (PostGIS, 2017) enabled PostgreSQL (PostgreSQL, 2017) database. After the preprocessing, this database contains assembled addresses along with their parent-child relationships: A house number level address is the child of a street level address, which in turn is the child of a district level address, etc. This database is used to extract address documents that are indexed in Elasticsearch, as defined in (Clemens, 2015a) and (Clemens, 2015b). Note that in this paper, the geocoding of only house number level addresses is evaluated. Therefore, though OpenStreetMap data also contains points of interests with house number level addresses, only their addresses but not their names have been indexed. Similarly, no parent level address elements, such as streets, postal code areas, cities, or districts have been indexed into Elasticsearch. All house number addresses with the same parent have been consolidated into one single document. Every house number have thereby been used as a key to specify the respective house number level address. Figure 1 shows an example document containing two house numbers 7 and 9, along with their WGS84 latitude and longitude coordinates and spelled-out addresses. The TEXT field of the document is the only one indexed; the TEXT fields mapped by the house numbers are only used to assemble a human-readable result. Because Elasticsearch retrieves full documents, and because the indexed documents contain multiple house number addresses, a thin layer around Elasticsearch is needed to make sure only results with house numbers specified in queries are returned. That is a non-trivial task, as given a query, it is not known upfront which of the tokens is specifying the house number. Therefore, this layer has been implemented as follows: First, the query is split into tokens. Next, one token is assumed to be the house number; a query for documents is executed containing all the other tokens. This is repeated for each token, trying out every token as a house number. Because each time only one token is picked to specify the house number, this approach fails to support house numbers that are 30

Figure 2: Average number of tokens per document for various amounts of spelling variants.

specified in multiple tokens. Nevertheless, it is good enough for the vast majority of cases. For every result document returned by Elasticsearch the house number map is checked. If the token assumed to be the house number happens to be a key in that map, the value of that map is considered a match and the house number address is added to the result set. Finally, the result set is returned. As edit distances are specified in the query to Elasticsearch, this layer allows enabling edit distances easily: A parameter passed to the layer is forwarded to Elasticsearch, which then also returns documents with fuzzily matching tokens. Also note, that as house numbers are used as keys in the documents, neither edit distances nor spelling variants are supported on house numbers. That, however, is a natural limitation: If a query specifies a different house number than the one intended, especially if it is a house number that exists in the data, there is no way for a geocoding system to still match to the right house number value. Having the baseline systems Nominatim and Elasticsearch supporting edit distances set up, the next step is to create a similar system that indexes spelling variants. For that, the spelling variants to be indexed need to be defined first. HERE Technologies, the company behind the HERE geocoding system (HERE, 2017), provided logs of real user queries issued against the various consumer offerings of the company, like their website or the applications for Symbian, Windows Phone, Android and iOS mobile phones. The log contained data from a whole year and included queries users have issued along with results users chose to click on. For this paper, a user click is considered the selection criterion of a result, linking input queries to their intent, i.e. the addresses users were querying for. Given such query and result pairs, first both were tokenized and the Levenshtein distance (Levenshtein, 1966) from every query token to every result token was computed. With edit distances at hand, the Hungarian method (Kuhn, 1955) was used to align every query token to a result token.


From these computations, several observations were extracted: 1. Some query tokens are superfluous as they do not match (close enough) to any result token. Such tokens are ignored. 2. As the result is a fully qualified address, result tokens have an address element type, such as city, street, house number, or country. Thus, for each query, the query format, i.e., which address elements were spelled out in what order, is known. 3. Some query tokens matched to result tokens are misspelled. Thus, for each spelling variant of a token, the specific spelling mistake made is known. For this paper, the following classes of spelling variants were considered: • inserts: Characters are appended after a tailing character, prepended before a leading character, or inserted between two characters, e.g., s is often inserted between the characters s and e, as apparently the double-s in sse sounds like a correct spelling for many users. • deletes: Characters are removed after a character that is left as the tailing one, before a character that is left as the leading one, or between two characters that are left next to each other, e.g., oa between the characters r and d are often deleted, as users often abbreviate road as rd. • replacements: One character is replaced by a different character, e.g., ß is often replaced by an s in user queries so that Straße becomes Strase instead. • swaps: Two consecutive characters are swapped with each other, e.g., ie is often times swapped into ei, as, to users, both sounds may seem similar. Thus from each query and result pair, the query format used as well as the set of spelling variations can be deduced. Doing so for all queries while counting the occurrences of each query format and each spelling variation results in a statistical model capable of two things: For a given token the model can determine the possible spelling variations, each with their observed count or relative probability. Also, out of a set of available address elements, the model can select and order elements such that the resulting choices correspond to formats human users use, each with their observed count or relative probability too. Because the spelling mistakes made as well as the query formats used are Pareto distributed (Arnold, 2015), the model contained a long tail of mistakes and formats used only very few times. To reduce the noise,

the model was cleansed by stripping off the 25% of all observations from the long tail of rare spelling mistakes and query formats. In addition to that, all query formats that did not contain a house number were removed too, as the goal was to generate queries for addresses with house numbers. Because the log used is, unfortunately, proprietary, neither the log nor the trained model can be released with this publication. However, having a similar log of queries from another source enables the creation of a similar model. Having the user model at hand, the spelling variants for indexing were derived as follows: Given a document to be indexed, its TEXT field was tokenized first. Next, for each token N most common spelling variants were fetched from the model and appended to the field. Thus, the field contained both the properly spelled tokens as well as N spelling variants for each token. Every house number level address from Nominatim was extracted from the database, augmented with spelling variants and indexed in Elasticsearch. For N the values 5, 10, 20, 40, 80, 160, 320, and 640 were chosen. Note that given a model, especially for short tokens, the number of applicable spelling variations is limited. In most extreme cases for a given token no spelling variant can be derived from the model at all. Figure 2 shows the resulting token counts of the TEXT field for every N. There is only a minor increase between indexing 320 and 640 spelling variants, as with 320 spelling variants almost all observed variants have already been generated. An interesting aspect of the described approach is that, besides lowercasing, no normalization mechanisms have been exploited. While users often choose to abbreviate common tokens like street types, or avoid choosing the proper diacritics, the idea is that the model would observe common replacements of Avenue with Av., or Straße with Strasse and generate according spelling variants for indexing. Like with the index without spelling variants, house numbers are not modified in any way here. In total, three geocoding systems were set up with exactly the same address data indexed: Nominatim as the reference geocoder for OpenStreetMap data, Elasticsearch with documents containing aggregated house numbers and a layer to support edit distances, and Elasticsearch with indexed spelling variants. While the edit distance was specified at query time by specifying a parameter to the layer wrapping Elasticsearch, for the various numbers of indexed spelling variants distinct Elasticsearch indices have been set up. As the same layer has been used for all Elasticsearch based indices, the setup supported the possibility to query an index with spelling variants indexed while allowing an edit distance at the same time, the31


reby evaluating the effect of the combination of the two approaches.

3

MEASURING THE PERFORMANCE

To evaluate the geocoding systems for precision - the ratio of responses not containing results not queried for, and recall - the ratio of responses containing only the right result, 50000 addresses have been sampled from the data used in these systems. Using the generative user model, for each address a query format has been chosen so that the distribution of the query formats corresponded to the observed distribution of the query formats preferred by users. Next, for each query one to five query tokens have been chosen to be replaced with a spelling variant. Again, spelling variants picked were distributed in the same way the spelling variants of human users were distributed. Thus common query formats, and frequent spelling mistakes were often present in the test set, while rare query formats and rare spelling variants were selected rarely. This way six query sets with 50000 queries each have been generated. One contained all tokens in their original form, while the others had between one and five query tokens replaced with a spelling variant respectively. Note that not always a query had the desired number of spelling variants: The token to be replaced with a spelling variant was chosen at random. For some tokens, as discussed, no spelling variant can be generated by the model. These tokens were left unchanged, making the query contain fewer spelling variants than anticipated. Also, sometimes the house number token was chosen to be replaced. Given the set up of the documents in the indices, where house numbers are used as keys in a map, such queries had no chance of being served properly. This, however, does not pollute measurement results, as it equally applies to all systems evaluated. Because generated queries and indexed addresses originate from the same Nominatim database, both share the same unique identifier. Therefore, inspecting the result set of a response for the result a query has been generated from is a simple task. Each test set was issued against indices with 5, 10, 20, 40, 80, 160, 320, and 640 indexed spelling variants, against the index with no spelling variants that allowed edit distances of 1 and 2, and against the two baselines: An index with neither spelling variants indexed nor edit distances allowed, as well as Nominatim. Additionally, each query set was issued against the combination of the two approaches: Indices with spelling variants indexed were queried so that edit dis32

tances were allowed. For every query set, responses were categorized into three classes: (i) Responses that yielded no result, (ii) responses that yielded only the correct result the query was generated from, and (iii) responses containing at least one wrong result that - as the query was not generated from that - was not the query intent. As the classes cover all possible cases and do not overlap, it is sufficient to consider two of the three metrics: While the ratio of cases in (ii) exactly is the recall of a geocoding system, the ratio of responses with wrong results in (iii) allows computing precision with ease. The fact is that knowing the distributions of spelling variants, it is possible to calculate how many responses will include the expected result without any measurement: The portion of spelling variants indexed is exactly the portion of spelling variants in queries that an index will be able to serve. There is, however, no simple way to calculate the precision, as it heavily depends on the data and how ambiguous queries with spelling variants become. This, in turn, makes it impossible to compute the recall as it is defined for this experiment. These measurements allow observing the development of both metrics while the number of indexed spelling variants or the number of supported edit distances are increased.

4

RESULTS

Figure 3 shows an overview of the recall and inversed precision of some select systems tested. The blue chart denotes the performance of Nominatim, while the green chart denotes the performance of Elasticsearch with neither spelling variants indexed nor edit distances allowed. On the left-hand side, for recall, Nominatim performs slightly better for queries with no or one spelling mistake. That is most likely due to the normalization mechanisms that are built into Nominatim, but missing in Elasticsearch: Likely, a chunk of commonly made spelling variants can be handled through normalization. For no spelling mistake, both charts show higher recall compared to the red and yellow charts plotting the recall of the index with 320 spelling variants per token indexed, and the recall of enabling the edit distance of one, respectively. These two systems gain a slightly lower recall, due to their slightly lower precision visible on the right-hand side. As discussed, more queries become ambiguous when spelling variants are indexed, or edit distances allowed, leading to more responses containing results that the respective query was not generated from. As expected, the more spelling variants there are present in queries, the more recall


Figure 3: Recall (left, more is better) and inversed precision (right, less is better) of select systems.

Figure 4: Detailed overview on the performance of indexing spelling variants and allowing edit distance.

drops. Without exception, the index with 320 spelling variants per token indexed outperforms the index allowing an edit distance of one. For zero or one spelling variant Nominatim has the lowest precision, returning most of the responses with results the query did not query for, while, as expected, the most strict system with neither spelling variants index nor edit distances allowed performs the best. The other two systems - one allowing an edit distance of one, the other indexing 320 spelling variants for each token perform very similarly. Thereby, for no spelling variants the system allowing an edit distance of one performs slightly worse, while for any number of spelling variants in the queries, it performs slightly better. However, the margin of difference between the two systems with regards to precision is minor, compared to the margin of difference for the same two systems for recall. Generally, both the ratio of replies with the

correct result as well as the ratio of replies containing wrong results drop more, the more spelling variants are present in the query. That is due to the number of replies with no result growing, as neither system can process queries containing too many spelling mistakes. The detailed experiment results are denoted in Figure 4. Each line in the charts represents the development of recall or inversed precision on a specific test set. The legend specifies the allowed number of spelling errors in the queries of a test set. The top two charts show the recall and the inversed precision of the six test sets depending on how many spelling variants per token were indexed. Unsurprisingly, the more spelling errors a query contains, the less responses with only correct results are retrieved. At the same time, however, the ratios of responses containing wrong results decrease. Thus, the more errors a

33


user makes, the less results are discovered by the system overall. This behavior is also observable on the bottom two charts showing the performance on the six test sets depending on what edit distance was allowed. Interestingly, increasing the allowed edit distance to be greater than one does not improve the recall on any test set. At the same point, it worsens the precision, as with an allowed edit distance of two more candidates fit to the queries, resulting in more responses containing wrong results. That symptom is not observable when indexing spelling variants. As discussed, indexing 640 spelling variants for every token of the document almost maxed out the total number of tokens generated. The observation is that for every test set indexing more spelling variants leads to a clear improvement of recall. This pattern is also observable when enabling an edit distance of one, though to a lesser extent. Overall, on every test set, both the recall of the index containing spelling variants is greater compared to the index allowing edit distances, while their precision is of similar size. The blue chart showing the test set containing zero spelling variants visualizes the impact of allowing edit distances or indexing spelling variants on the left-hand side best: Indexing spelling variants or allowing an edit distance both reduce the recall by a similar degree, though the recall of the geocoder indexing 640 spelling variants is slightly greater compared to enabling an edit distance of one. Table 1: Configurations yielding best recall. variants in query variants indexed edit distance only correct result also wrong result

0 0 0 61% 21%

1 640 0 43% 16%

2 320 0 26% 9%

3 160 1 13% 7%

4 320 1 6% 7%

5 640 1 3% 6%

In Table 1 the combinations of indexed spelling variants and allowed edit distances that led to best results with regards to recall for the various test sets are listed. Interestingly, the number of spelling variants in the index varies between 160 and 640. That is an artifact of the random generation of queries. The numbers also show that for one or two spelling errors in queries, allowing edit distances on top of indexed spelling variants does not lead to any improvement of recall. Only if three or more query tokens are misspelled, a combination of indexed spelling variants and edit distance are yielding a better performance.

5

CONCLUSION

As already observed in previous papers, here too, Nominatim does not handle spelling mistakes well. 34

Using a statistical model to derive and index common spelling variants, however, has proven to be a viable approach to serve queries with spelling errors. Compared to allowing edit distances, it yields more responses containing only the right result, while only marginally increasing the number of responses with wrong results. Interestingly, this approach implicitly incorporates any standardization logic that would be of help: Exactly those abbreviations or misspelled diacritics are indexed as spelling variants that are commonly made. The experiment also suggests to index all possible spelling variants a cleansed model can generate: No number of indexed spelling variants smaller than that turned out to be the optimum beyond which performance of the index would degrade. Also, while indexed spelling variants outperform edit distances on all query sets, a combination of the two showed slightly better results for queries with many typos. Going forward, it is worth investigating how spelling variants can be indexed without obtaining a statistical user model first. In this paper user clicks were used to learn how often and which typos are made. Users, however, can only click on results they receive. Thus, a query token may be spelled so significantly different, that the system will not present the proper result to the user. Even if that spelling variant would be common, without a result to click on, no model could learn that spelling variant so that it can be indexed. Further, the set of supported spelling variants might be defined more precisely. The model could learn more circumstances of an edit, like, e.g., four or more characters that surround an observed edit, as opposed to two characters only. Pursuing this idea to its full extent, a model could learn specific spelling variants for specific tokens instead of edits that can be applied in different scenarios, though doing so would probably require to utilize normalization mechanisms independent of the model. Another interesting study would be to measure how much such a model degrades over time. Assuming that user behavior changes, it is likely that the kind of spelling errors common at one point in time will no longer be common some time later. Thus, if a geocoder only relies on indexed spelling variants, its performance would be reduced over time.

REFERENCES Arnold, B. C. (2015). Pareto distribution. Wiley Online Library. Borkar, V., Deshmukh, K., and Sarawagi, S. (2000). Automatically extracting structure from free text addresses. IEEE Data Engineering Bulletin, 23(4):27–32.


Can, L., Qian, Z., Xiaofeng, M., and Wenyin, L. (2005). Postal address detection from web documents. In International Workshop on Challenges in Web Information Retrieval and Integration, 2005. (WIRI’05), pages 40–45. IEEE. Clemens, K. (2013). Automated processing of postal addresses. In GEOProcessing 2013: The Fifth International Conference on Advanced Geograhic Information Systems, Applications, and Services, pages 155–160. Clemens, K. (2015a). Geocoding with openstreetmap data. GEOProcessing 2015: The Seventh International Conference on Advanced Geograhic Information Systems, Applications, and Services, page 10. Clemens, K. (2015b). Qualitative Comparison of Geocoding Systems using OpenStreetMap Data. International Journal on Advances in Software, 8(3 & 4):377. Clemens, K. (2016). Comparative evaluation of alternative addressing schemes. GEOProcessing 2016: The Eighth International Conference on Advanced Geograhic Information Systems, Applications, and Services, page 118. Coetzee, S., Cooper, A., Lind, M., Wells, M., Yurman, S., Wells, E., Griffiths, N., and Nicholson, M. (2008). Towards an international address standard. 10th International Conference for Spatial Data Infrastructure. Davis, C. and Fonseca, F. (2007). Assessing the certainty of locations produced by an address geocoding system. Geoinformatica, 11(1):103–129. Drummond, W. (1995). Address matching: Gis technology for mapping human activity patterns. Journal of the American Planning Association, 61(2):240–251. Duncan, D. T., Castro, M. C., Blossom, J. C., Bennett, G. G., and Gortmaker, S. L. (2011). Evaluation of the positional difference between two common geocoding methods. Geospatial Health, 5(2):265–273. Elastic (2017). Elasticsearch. https://www.elastic.co/ products/elasticsearch. Fang, L., Yu, Z., and Zhao, X. (2010). The design of a unified addressing schema and the matching mode of china. In Geoscience and Remote Sensing Symposium (IGARSS), 2010. IEEE. Fitzke, J. and Atkinson, R. (2006). Ogc best practices document: Gazetteer service-application profile of the web feature service implementation specification-0.9. 3. Open Geospatial Consortium. Ge, X. et al. (2005). Address geocoding. geo poet (2017). http://geo-poet.appspot.com/. Goldberg, D., Wilson, J., and Knoblock, C. (2007). From Text to Geographic Coordinates: The Current State of Geocoding. URISA Journal, 19(1):33–46. Google (2017). Geocoding API. https://developers. google.com/maps/documentation/geocoding/. HERE (2017). Geocoder API Developer’s Guide. https:// developer.here.com/rest-apis/documentation/geocoder/. Kuhn, H. W. (1955). The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97. Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710.

Mayrhofer, A. and Spanring, C. (2010). A uniform resource identifier for geographic locations (’geo’uri). Technical report, RFC 5870, June. National Imagery and Mapping Agency (2004). Department of Defense, World Geodetic System 1984, Its Definition and Relationships with Local Geodetic Systems. In Technical Report 8350.2 Third Edition. OpenStreetMap Foundation (2017a). Nomatim. http:// nominatim.openstreetmap.org. OpenStreetMap Foundation (2017b). OpenStreetMap. http://wiki.openstreetmap.org. PostGIS (2017). http://postgis.net/. PostgreSQL (2017). http://www.postgresql.org/. Robertson, S., Zaragoza, H., and Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 42–49. ACM. Roongpiboonsopit, D. and Karimi, H. A. (2010). Comparative evaluation and analysis of online geocoding services. International Journal of Geographical Information Science, 24(7):1081–1100. Salton, G. and Yang, C.-S. (1973). On the specification of term values in automatic indexing. Journal of documentation, 29(4):351–372. Salton, G., Yang, C.-S., and Yu, C. T. (1975). A theory of term importance in automatic text analysis. Journal of the American society for Information Science, 26(1):33–44. Sengar, V., Joshi, T., Joy, J., Prakash, S., and Toyama, K. (2007). Robust location search from text queries. In Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems, page 24. ACM. Srihari, S. (1993). Recognition of handwritten and machine-printed text for postal address interpretation. Pattern recognition letters, 14(4):291–302. Universal Postal Union (2017). http://www.upu.int. what3words (2017). what3words. https://map. what3words.com/. Yahoo! (2017). BOSS Geo Services. https://developer. yahoo.com/boss/geo/. Yandex (2017). Yandex.Maps API Geocoder. https://tech. yandex.com/maps/geocoder/. Yang, D.-H., Bilaver, L. M., Hayes, O., and Goerge, R. (2004). Improving geocoding practices: evaluation of geocoding tools. Journal of medical systems, 28(4):361–370.

35

Automatic Tree Annotation in LiDAR Data Ananya Gupta1,2 , Jonathan Byrne2 , David Moloney2 , Simon Watson1 and Hujun Yin1 1 School

of Electrical and Electronic Engineering, The University of Manchester, Manchester, U.K. 2 Movidius Group, Intel Corporation, Dublin, Ireland {ananya.gupta, simon.watson, hujun.yin}@manchester.ac.uk, {jonathan.byrne, david.moloney}@intel.com

Keywords:

Airborne LiDAR, Urban Areas, Classification, Tree Detection, Voxelization.

Abstract:

LiDAR provides highly accurate 3D point cloud data for a number of tasks such as forest surveying and urban planning. Automatic classification of this data, however, is challenging since the dataset can be extremely large and manual annotation is labour intensive if not impossible. We provide a method of automatically annotating airborne LiDAR data for individual trees or tree regions by filtering out the ground measurements and then using the number of returns embedded in the dataset. The method is validated on a manually annotated dataset for Dublin city with promising results.

1

INTRODUCTION

Trees are critical to the healthy functioning of the ecosystem and provide a number of benefits to the environment such as regulation of water systems, maintaining air quality, carbon sequestration and promoting biodiversity. Hence, up-to-date tree inventories are extremely important for monitoring and preservation of ecological environments so much so, that, one of the key items on the top ten initiatives of the World Economic Forum on the Future of Cities in 2015 was to increase green canopy (Treepedia, 2015). LiDAR sensors are a good tool for acquiring dense point cloud data for the purpose of surveying in short ranges. These sensors measure the distance by timing a laser pulse reflected from a target and have been applied in a number of remote sensing applications ranging from mapping (Schwarz, 2010), landslide investigations (Jaboyedoff et al., 2012) to tree inventories (Shendryk et al., 2016b). LiDAR systems are particularly suited to surveying forest canopies due to their active sensors and their ability to penetrate canopies. Currently, most of the research on isolating trees focuses on forested areas, with little emphasis given to urban areas where there are more complex environments due to the presence of multiple types of natural and artificial objects. However, surveying trees in urban areas is of paramount importance for applications such as city planning, estimating green canopy of areas and monitoring solar radiations (Jochem et al., 2009).

In this work, we provide an algorithm to identify trees in urban areas using information embedded in the LiDAR data without requiring human intervention. Contrary to the common use of the Canopy Height Model (CHM) for this task, our method works directly with the LIDAR data and uses the number of returns information from the LiDAR data to isolate trees.

2

RELATED WORK

A number of methods have been developed to segment trees in LiDAR data with the most common being based on CHM (Lu et al., 2014; Ferraz et al., 2016; Reitberger et al., 2009; Smits et al., 2012; Mongus and Zalik, 2015). Hyyppa et al. (2001) pioneered work in this area; they used the information from the highest laser returns to build a tree height model and then used region growing techniques to for tree segmentation. Koch et al. (2006) used local maximum filters to identify potential tree regions followed by the use of a pouring algorithm and knowledge-based assumptions to identify tree crowns. Li et al. (2012) took advantage of the spacing between treetops at their highest points to identify trees and used a region growing algorithm to segment them. More recently, Shendryk et al. (2016a) used Euclidean distance clustering to delineate trunks in eucalypt forests. These methods proved highly effective in identifying trees in forested areas but are unsuitable for use in urban environments

36 Gupta, A., Byrne, J., Moloney, D., Watson, S. and Yin, H. Automatic Tree Annotation in LiDAR Data. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 36-41 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Automatic Tree Annotation in LiDAR Data

tecting trees in urban environments by using the number of returns information embedded in LiDAR data. Similar to previous approaches, our method starts with ground removal; however, following that step we voxelise the point cloud data and show that trees can easily be extracted from this subsampled data by using certain heuristics and data analysis.

Figure 1: LiDAR point cloud showing decreasing number of returns where Red>Yellow>Green>Blue.

since the assumption of highly dense collections of trees does not apply to isolated individual trees. Pioneering work in urban tree detection was based on machine learning techniques. Secord and Zakhor (2007) used a combination of aerial images and LiDAR data for segmentation and classification with Support Vector Machines (SVM). They extended this work to using features derived from depth images of LiDAR data with an SVM classifier (Chen and Zakhor, 2009). Carlberg et al. (2009) used a cascade of binary classifiers to progressively identify water, ground, roofs and trees by conducting 3D shape analysis. Segmenting foreground and background and classifying object-like clusters was used to locate different 3D objects in an urban environment (Golovinskiy et al., 2009). Decision trees and Artificial Neural Networks have also been used for segmenting features from Digital Surface Models for classification (Höfle et al., 2012). The main drawback with these methods is that they need pre-labelled data in order to train their models and cannot work in an unsupervised manner. There has been some work done on identifying trees in urban environments without the need for labelled data. Liu et al. (2013) proposed a method for extracting tree crowns by filtering out ground points and using a spoke wheel method to get tree edges. Wu et al. (2013) proposed a voxel-based method to extract individual trees from mobile laser scanning data but their method is not suitable for use with airborne LiDAR scans. Zhang et al. (2015) developed a method to estimate tree metrics for urban forest inventory purposes by detecting treetops and using a region growing algorithm for segmentation. We propose a new technique for automatically de-

Figure 2: Voxelized point cloud showing number of returns.

3

METHODOLOGY

Our method for labelling trees is based on four distinct steps: ground filtering, voxelizing non-ground point cloud data, isolating tree-like regions using the information gained from the number of returns, and post-processing to remove false positives.

3.1

Ground Filtering

A Digital Terrain Model (DTM) is used to represent the surface of the Earth and there is a vast body of research in extracting ground points from LiDAR scans in order to produce a DTM. There are number of different algorithms for ground filtering such as morphological filtering, surface based adjustment and statistical analysis (Chen et al., 2017). We use a Progressive Morphological Filter (PMF) (Zhang et al., 2003) to identify ground points. PMF uses the morphological operations of dilation and erosion, where it uses progressively increasing window sizes to identify non-ground points. We filter out the ground points identified using this technique and the results are shown in Figure 3(b). However, it is not successful in removing all the ground points, hence we do statistical outlier removal on the filtered point cloud and get a much cleaner result can be seen in Figure 3(c). 37


(a) Original Point Cloud

(b) Result of filtering with PMF

tight meshes. Meshing algorithms such as Poisson reconstruction require normals for the points which are not directly available from the LiDAR output. They also remove tall thin objects such as lampposts, tree trunks and chimneys during the fitting process. We convert the data into a volumetric occupancy grid in order to overcome the limitations of meshing algorithms while reducing its dimensionality. A fixed size 3 dimensional grid is overlaid on the point cloud and the occupancy of each cell depends on the presence of points within the cell, i.e. the cell is unoccupied if there are no points in the cell’s volume and vice versa. In this case, each volumetric element, Voxel, represents a region in the subsampled point cloud. Following the original tile dimensions of 100m × 100m, we convert each tile into a voxel grid of dimensionality 256 × 256 × 256 hence limiting the resolution of the voxel grid to be ≈ 0.39m × 0.39m × 0.39m per voxel. Any further increase in resolution does not noticeably increase accuracy and causes an exponential increase in the processing time due to the increased dimensionality of the data. Furthermore, we use VOLA (Byrne et al., 2017) to sparsely encode the voxel representation. VOLA is a hierarchical 3D data structure which only encodes for occupied voxels with a one bit per voxel using a standard unsigned 64 bit integer. Unlike standard octrees, which does not explicitly encode empty voxels, we use a 2 bits per voxel approach to encode the additional information per voxel such as colour, number of returns and intensity value.

3.3

(c) Result after cleaning for noise Figure 3: Ground Filtering.

3.2

Voxelization

An aerial LiDAR survey returns an extremely large volume of data, with an hour-long survey generating over a billion unique points (Geosystems, 2015). Even after filtering the ground points, the dataset can retain more than half of the original points if the survey was in an occupied region, such as forests or urban areas. Hence, LiDAR point cloud data is often converted to a mesh in order to reduce its dimensionality. However, meshing algorithms can be error prone when there are voids in the data since they make assumptions about the shape in order to make water38

Isolating Tree Regions

LiDAR pulses reflect from from surfaces they encounter such as buildings, vegetation and ground. Each pulse can return to the LiDAR sensor once or multiple times, depending on the number of surfaces it encounters. Trees typically have a high number of returns since the laser pulses can reflect from multiple edges of leaves and branches. Other features that can have a high number of returns are the edges of building and window ledges. However, these latter values are more scattered than in the case of trees which have a large number of high returns closely packed as can be seen in Figure 1. We use this insight to isolate tree regions by identifying voxels with multiple returns (greater than 3 per voxel) and then doing a connected component analysis on these voxels. Regions with a minimum number of connected components are then identified as tree regions, while any regions with smaller than the threshold value are discarded as noise from buildings, corners, etc.


3.4

Identifying Individual Trees

The tree regions isolated by using connected components typically only return tree canopy as trunks might have only a few disconnected voxels. Hence, in order to find tree trunks, the maximum and minimum x and y coordinates of each region are identified, along with the maximum z coordinate. These coordinates are then used to place a three dimensional bounding box in the original data which is extended to ground level in order to capture the trunk information. The width to length ratio of the bounding box is constrained so that one dimension is never more than twice the other. Any regions not matching these constraints are discarded as a false positive. This allows to discard walls covered with ivy since walls are typically long but not very thick, whereas trees have similar widths and lengths and hence the width to length ratio ≈ 1.

4 4.1

EXPERIMENTS AND RESULTS Data

This method was tested on a dense LiDAR dataset of Dublin city (Laefer et al., 2015). This dataset was captured at an altitude of 300m using a TopEye system S/N 443. It consists over 600 million points with an average point density of 348.43points/m2 . It covered an area of 2km2 in Dublin city centre. We tested our results with the labels from Ningal (2012) containing tree annotations around some of the major streets in Dublin from 2008. In order to get more up to date results, we manually annotated the region north of the Liffey river for trees.

4.2

Evaluation Metric

There are three different results for the purposes of detection: True Positives (TP) where the trees are correctly recognised, False Positives (FP) where regions are incorrectly identified as trees, and False Negatives (FN) where trees are not detected. Based on these results we evaluate the following metrics (Goutte and Gaussier, 2005): TP × 100 T P + FN TP p= × 100 T P + FP r× p Fscore = 2 × × 100 r+ p r=

(a) Experiment 1

(1)

where r (recall) is the tree detection rate, p (precision) is the tree detection precision and Fscore is the total accuracy.

4.3

Results

The extracted trees have been shown in Figure 4 along with the original labels. We compared our labelled outputs with two different sets of annotations, the first from 2008 and the second from 2015 and the results are summarized in Table 1. (b) Experiment 2 Figure 4: Map of survey area with tree locations shown in yellow and the outputs of our labelling algorithm shown in red.

Table 1: Summary of Results. Experiment 1 2

Trees 313 535

TP 178 469

FP 45 56

FN 135 66

p 0.57 0.88

r 0.8 0.89

Fscore 0.66 0.88

The results of Experiment 1 (against 2008 annotations) seem to suggest that the accuracy of our labelling method is extremely low with an Fscore of 0.66. On further analysis, we discovered that the urban landscape had changed a lot from when the tree 39


in labelling isolated trees irrespective of size, one of such cases is shown in Figure 5(b), where all trees along the side of O’Connell Street have been correctly identified.

5

(a) Missed trees due to combined canopies

CONCLUSIONS

This paper addresses the challenge of automatic labelling trees in LiDAR data from urban environments. Most previous work in this area focused on extracting trees in forested regions, but the techniques cannot directly be applied to urban environments due to the complexity of the environment and the presence of multiple objects. The proposed method uses the number of returns information present in the LiDAR data to isolate tree regions and isolates individual trees by voxelizing the data and finding clusters resembling trees using connected component analysis. It deals well with partially occluded trees and has achieved a satisfactory accuracy of almost 90% in central Dublin hence showing the effectiveness of the method. The method has some drawbacks, namely that it is unable to separate all individual trees within a large clump. In order to deal with the current drawbacks, this work can be extended in a number of ways in the future: • Improving individual tree detection in clumps by isolating individual trunks along with the tree canopy. • Combining photogrammetry data with LiDAR point cloud for more robust interpretation.

(b) Correctly labelled individual trees Figure 5: Results.

labels were acquired in 2008 to when the LiDAR survey was done in 2015 due to roadworks and the construction of the city tram. Hence, we annotated a section of the city using imagery from 2015 to obtain a fair analysis of our methods. The results from the second experiment show that our algorithm performs well by identifying almost 90% of the trees in Dublin correctly but has some weak points. It is unable to distinguish between multiple trees packed closely as shown in Figure 5(a) and assumes the entire canopy is a single tree leading to a number of missed detections. Also, it mislabels heavy ivy and bushes as trees since those produce multiple returns as well. Our method performs extremely well 40

• Utilising the labelled trees from this method to train a more robust machine learning based classifier which can generalise across point cloud datasets for tree detection. • Identifying building edges and windows using the number of returns information and extrapolating from those to obtain building reconstructions.

ACKNOWLEDGMENT This research was undertaken while A. Gupta was an intern at Intel Corporation and was partly funded by the HiPEAC4 Network of Excellence under the EU’s H2020 programme, grant agreement number 687698.


REFERENCES Byrne, J., Caulfield, S., Xu, X., Pena, D., Baugh, G., and Moloney, D. (2017). Applications of the VOLA Format for 3D Data Knowledge Discovery . In International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, pages 1–8. Carlberg, M., Gao, P., Chen, G., and Zakhor, A. (2009). Classifying urban landscape in aerial lidar using 3D shape analysis. In Proceedings - International Conference on Image Processing, ICIP, pages 1701–1704. IEEE. Chen, G. and Zakhor, A. (2009). 2D tree detection in large urban landscapes using aerial LiDAR data. In Image Processing (ICIP), 2009 16th IEEE International Conference on, pages 1693–1696. IEEE. Chen, Z., Gao, B., and Devereux, B. (2017). State-of-theArt: DTM Generation Using Airborne LIDAR Data. Sensors, 17(1):150. Ferraz, A., Saatchi, S., Mallet, C., and Meyer, V. (2016). Lidar detection of individual tree size in tropical forests. Remote Sensing of Environment, 183:318–333. Geosystems, L. (2015). Leica scanstation p30/p40. Technical report, Heerbrugg, Switzerland. Golovinskiy, A., Kim, V. G., and Funkhouser, T. (2009). Shape-based recognition of 3D point clouds in urban environments. In 2009 IEEE 12th International Conference on Computer Vision, pages 2154–2161. IEEE. Goutte, C. and Gaussier, E. (2005). A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In European Conference on Information Retrieval, pages 345–359. Höfle, B., Hollaus, M., and Hagenauer, J. (2012). Urban vegetation detection using radiometrically calibrated small-footprint full-waveform airborne LiDAR data. ISPRS Journal of Photogrammetry and Remote Sensing, 67(1):134–147. Hyyppa, J., Kelle, O., Lehikoinen, M., and Inkinen, M. (2001). A segmentation-based method to retrieve stem volume estimates from 3-D tree height models produced by laser scanners. IEEE Transactions on Geoscience and Remote Sensing, 39(5):969–975. Jaboyedoff, M., Oppikofer, T., Abellán, A., Derron, M.-H., Loye, A., Metzger, R., and Pedrazzini, A. (2012). Use of LIDAR in landslide investigations: a review. Natural Hazards, 61(1):5–28. Jochem, A., Höfle, B., Hollausb, M., and Rutzingerc, M. (2009). Object detection in airborne LIDAR data for improved solar radiation modeling in urban areas. In Laser scanning, volume 38, pages 1–6. International Society for Photogrammetry and Remote Sensing (ISPRS). Koch, B., Heyder, U., and Weinacker, H. (2006). Detection of Individual Tree Crowns in Airborne Lidar Data. Photogrammetric Engineering & Remote Sensing, 72(4):357–363. Laefer, D. F., Abuwarda, S., Vo, A.-V., Truong-Hong, L., and Gharibi, H. (2015). 2015 Aerial Laser and Photogrammetry Survey of Dublin City Collection Record.

Li, W., Guo, Q., Jakubowski, M. K., and Kelly, M. (2012). A New Method for Segmenting Individual Trees from the Lidar Point Cloud. Photogrammetric Engineering & Remote Sensing, 78(1):75–84. Liu, J., Shen, J., Zhao, R., and Xu, S. (2013). Extraction of individual tree crowns from airborne LiDAR data in human settlements. Mathematical and Computer Modelling, 58(3-4):524–535. Lu, X., Guo, Q., Li, W., and Flanagan, J. (2014). A bottomup approach to segment individual deciduous trees using leaf-off lidar point cloud data. ISPRS Journal of Photogrammetry and Remote Sensing, 94:1–12. Mongus, D. and Zalik, B. (2015). An efficient approach to 3D single tree-crown delineation in LiDAR data. ISPRS Journal of Photogrammetry and Remote Sensing, 108:219–233. Ningal, T. (2012). PhD Thesis. PhD thesis, UCD School of Geography. Reitberger, J., Krzystek, P., Stilla, U., and Sensing, R. (2009). Benefit of airborne full waveform lidar for 3D segmentation and classification of single trees. Schwarz, B. (2010). Lidar: Mapping the world in 3D. Nature Photonics, 4(7):429–430. Secord, J. and Zakhor, A. (2007). Tree Detection in Urban Regions Using Aerial LiDAR and Image Data. IEEE Geoscience and Remote Sensing Letters, 4(2):196– 200. Shendryk, I., Broich, M., Tulbure, M. G., and Alexandrov, S. V. (2016a). Bottom-up delineation of individual trees from full-waveform airborne laser scans in a structurally complex eucalypt forest. Remote Sensing of Environment, 173:69–83. Shendryk, I., Broich, M., Tulbure, M. G., McGrath, A., Keith, D., and Alexandrov, S. V. (2016b). Mapping individual tree health using full-waveform airborne laser scans and imaging spectroscopy: A case study for a floodplain eucalypt forest. Remote Sensing of Environment, 187:202–217. Smits, I., Prieditis, G., Dagis, S., and Dubrovskis, D. (2012). Individual tree identification using different LIDAR and optical imagery data processing methods. Biosystems and Information Technology, 1(1):19–24. Treepedia (2015). MIT Senseable City Lab. Wu, B., Yu, B., Yue, W., Shu, S., Tan, W., Hu, C., Huang, Y., Wu, J., and Liu, H. (2013). A Voxel-Based Method for Automated Identification and Morphological Parameters Estimation of Individual Street Trees from Mobile Laser Scanning Data. Remote Sensing, 5(2):584–611. Zhang, C., Zhou, Y., and Qiu, F. (2015). Individual Tree Segmentation from LiDAR Point Clouds for Urban Forest Inventory. Remote Sens, 7:7892–7913. Zhang, K., Chen, S. C., Whitman, D., Shyu, M. L., Yan, J., and Zhang, C. (2003). A progressive morphological filter for removing nonground measurements from airborne LIDAR data. IEEE Transactions on Geoscience and Remote Sensing, 41(4 PART I):872–882.

41

Improvements to DEM Merging with r.mblend Lu´ıs Moreira de Sousa1 and João Paulo Leitão2 1 ISRIC 2 Swiss

- World Soil Information, Droevendaalsesteeg 3, Building 101, 6708 PB Wageningen, The Netherlands Federal Institute of Aquatic Science and Technology (EAWAG),Urban Water Management Department (SWW), ¨ Uberlandstrasse 133, CH-8600, Dübendorf, Switzerland [email protected], [email protected]

Keywords:

Digital Elevation Model (DEM), Terrain Analysis, Raster.

Abstract:

r.mblend is an implementation of the MBlend method for merging Digital Elevation Models (DEMs). This method produces smooth transitions between contiguous DEMs of different spatial resolution, for instance, when acquired by different sensors. r.mblend is coded on the Python API provided by the Geographic Resources Analysis Support System (GRASS), being fully integrated in that GIS software. It introduces improvements to the original method and provides the user with various parameters to fine tune the merging procedure. This article showcases the main differences between r.mblend and two conventional DEM merge methods: Cover and Average.

1

INTRODUCTION

In the Geographic Information Systems (GIS) domain, the representation of terrain elevation has been predominantly performed using the raster data format, in what are called Digital Elevation Models (DEMs). The discretisation of elevation by a regular grid is rather useful in software development, with a direct correspondence to a two dimensional array. This ease of development has fostered the creation of numerous spatial analysis methods (de Smith et al., 2015), making DEMs ever more convenient. DEMs have traditionally been acquired by stereoscopic sensors on board of air-borne or space-borne vehicles. For decades DEMs remained an expensive and inaccessible type of data. The emergence of technologies like Light Detection and Ranging (LiDAR) sensors, and small and easy to operate Unmanned Aerial Vehicles (UAVs) have made the acquisition of high resolution DEMs considerably simpler and inexpensive (Küng et al., 2011). With multiple DEMs obtained by different methods and at different spatial resolutions available, spatial analysts often face today the need to combine or merge various of these data sets. However, there is no obvious method for doing so; a direct merge of overlapping DEMs produces artefacts along borders, leading to inconsistent terrain aspects and slopes (Katzil and Doytsher, 2005; Luedeling et al., 2007). Spatial analysis conducted on such merged DEMs inevitably results in fickle results, be it in view-shed computa-

tion, overland water flow, least cost path, etc. The MBlend method (Leitão et al., 2016) proposes to merge two overlapping DEMs by retaining the highest spatial resolution DEM and introducing a smooth transition into the lower resolution DEM. Modifications are applied only to the lower resolution DEM, producing a single DEM that covers the entire study area with the highest possible accuracy, while also ensuring smooth transitions between the original DEMs. r.mblend is an implementation of the MBlend method, coded in the Python language as an add-on to the Geographic Resources Analysis Support System (GRASS). It introduces an advanced and flexible computation of the transition between DEMs, that the user may tune through various parameters. This article compares results obtained using r.mblend with those of two conventional DEM merging methods on two different test cases. Section 2 describes the methods used, Section 4 presents the test cases used for comparison and Section 6 rounds up the results.

2 2.1

RASTER MERGING Conventional Methods

Common GIS programmes provide simple functions to merge raster data sets. They usually require the inputs to have the same cell size and be in

42 Sousa, L. and Leitão, J. Improvements to DEM Merging with r.mblend. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 42-49 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Improvements to DEM Merging with r.mblend

the same coordinate system. These simple methods can be classified in two different types: Cover and Average (Eastman, 2012). Cover type methods do not operate any adjustments to the input DEMs, they are simply superimposed. The DEM resulting from this method assumes cell values of the first input across its entire extent and values from the second input in areas not covered by the first. The resulting DEM can yield significant elevation discontinuities along the boundary between the input DEMs, resulting in erroneous slope and aspect values (Hickey, 2000). Average methods assign to the merged DEM the average elevation within areas where the input DEMs overlap. Outside the overlapping area the resulting DEM assumes the value of the existing input; only values within the overlapping area are changed. Some of these methods try to tackle the discontinuities issue using a weighted average, as is the case with IDRISI (Eastman, 2012). A subset of these, usually referred as Blend methods, go further, using a averaging weighting function that may be linear, smoothed (e.g. bicubic), or discontinuous; this way more weight can be given to one of the inputs in certain areas, e.g. closer to borders. It must be noted though, that these averaging methods act as low pass filters, therefore reducing the accuracy of the resulting DEM.

2.2 MBlend The MBlend method differs from conventional methods in two essential aspects: it is aware of the different spatial resolution of its inputs and modifies areas for which only low resolution data are available (Leitão et al., 2016). This method works by identifying two edges: the border between the low and high resolution inputs (near edge) and the border around the area of the low resolution input not overlapping with the high resolution input (far edge). Points are set along each of these two edges, those on the near edge take the difference between the two inputs at the location; those on the far edge take the value zero (see Figure 1). From these points is computed a transition surface, spatially restricted to the area of the low resolution input, not overlapping with the high resolution input. Finally, the transition surface is subtracted from the low resolution input; the resulting DEM assumes the values of the high resolution input within its extent and outside the values of the low resolution input minus the transition surface. The MBlend method consist in seven essential steps: 1. Obtain the low resolution extent - this is the extent of the study area that is only covered by the

low resolution DEM. It can be obtained by vectorising the extent of each DEM and then applying an intersection. 2. Compute differences - obtained by subtracting the low resolution from the high resolution DEM. 3. Obtain the near edge - the differences map is vectorised into points. A buffer around the low resolution extent is then used to select from these difference points those that lay along the border between the two rasters (see Figure 1 (a)). 4. Obtain the far edge - the low resolution DEM is vectorised to points and those along the border are selected using an internal buffer to the low resolution extent. 5. Build interpolation points set - the value zero is assigned to the points in the far edge; it is then merged with the near edge into a single data set. 6. Interpolate smoothing surface - a new raster surface is created by interpolation using the edges points data set. The resulting surface smoothly transitions from the full difference between the two input DEMs along the near edge towards zero along the far edge (see Figure 1 (b)). 7. Apply smoothing - the smoothing surface is added to the low resolution raster. The result is then patched with the high resolution raster to obtain a single data set covering the entire study area.

3

Implementation

3.1

The GRASS Add-on Development Environment

GRASS is a Geographical Information System (GIS) originally developed by the US Army Corps of Engineers with a focus on spatial data management and analysis (Neteler et al., 2012). It is characterised by a deep dataset management and archiving structure and a vast roll of analysis operations, also known as modules. GRASS manages multiple data-set types: raster, vector, imagery and voxel (3D). GRASS was originally written in C, with its modern structure now also coded in C++. Since 2012, in the wake of version 6.4.21 , an API to the GRASS C library was made avaialble for the Python programming language (Sanner et al., 1999). This API greatly simplified the development of new GRASS modules, 1 https://grass.osgeo.org/announces/announce

.html

grass642

43


also facilitating the integration of popular Python libraries such as NumPy2 or Pandas3 . A system to host new modules – called “addons” – was also created, whereby third party developers commit their code to the GRASS repository, thus making their module(s) automatically available to all GRASS users. These “add-ons” can be added to every GRASS installation with the module g.extension. This module connects automatically to the GRASS repository, downloads and installs the required binaries or code. Developed within this environment, r.mblend is versioned and managed at GitHub4 , and released under the European Union Public Licence5 .

3.2

Improvements Over the Original MBlend Method

(a) Edges in original method

(b) Smooth surface

The main difference from the r.mblend implementation to the original method concerns the computation of the far edge. r.mblend uses by default only those points in the far edge that are farther away from the near edge, with the aim of obtaining a geometrically even transition in the smoothing surface. In detail, this computation is performed by r.mblend as follows: 1. Compute a distance map within the low resolution area relative to the high resolution raster. 2. Vectorise the distance raster into a points data set. 3. Normalise the distance values and select those above a certain threshold (by default 95% of the maximum distance). 4. Use an inner buffer to the interpolation area to select further those points only along the low resolution raster border. Figure 1 presents these differences to the original proposal with a simple case. The user is able to adjust the distance cut off thus dosing the weight of the far edge on the smoothing surface interpolation. The r.mblend implementation also provides the user with the option to use the average difference between the two input DEMs as the value assigned to the far edge interpolation points (instead only zero, as in the original proposal). In this mode the resulting DEM remains closer to the high resolution input. This may be useful when the differences between the two DEMs are spatially uncorrelated. 2 http://www.numpy.org/

3 http://pandas.pydata.org/

4 https://github.com/ldesousa/r.mblend 5 http://ec.europa.eu/idabc/eupl.html

44

(c) Edges with 95% distance cut off (d) Smooth surface Figure 1: Interpolation points edges with original method (a) and a 95% cut off to the maximum distance (c) and the respective smoothing surfaces (b and d). Near edge in green, far edge in yellow.

3.3

Model Parameters

The r.mblend module takes the following arguments: • high - name of the high resolution DEM;

• low - name of the low resolution DEM (overlapped by the high resolution input); • output - name of the resulting blended DEM;

• far edge - percentage of the maximum distance to the high resolution DEM used to determine the far edge; • inter points - number of points (from both edges) to use in interpolation; • -a - optional flag that indicates to assign the average difference between the two input rasters to the far edge (instead of zero).


The far edge argument is bounded between 0 and 100; by default is used a value of 95. Values closer to zero translate into a higher number of points in the far edge, impacting the shape of the differences raster. The Inverse Distance Weighting (IDW) method is used to interpolate the smoothing surface. This method takes as a parameter the number of points (from the two edges) used to interpolate each new cell value. By default 50 points are used; inter points provides the user a mean to tweak this value. The higher the number of points used the smoother is the resulting smoothing surface; however, it also means a lengthier computation time.

4

TEST CASES

4.1

A - Lucern

For the first test were employed two DEMs representing an urban catchment in the city of Lucerne in Switzerland. This is a relatively smooth surface, but including a number of detailed man-made features. The lower resolution DEM was obtained with an air-borne LiDAR sensor and provided to this study by the official cadastral service of the Canton of Lucerne. It has a cell side of 0.5 metres and a vertical accuracy of approximately 0.5 metres (Figure 2a). This dataset was last updated in July of 2012 (Doe, 2014). The high resolution DEM was obtained with a conventional camera mounted aboard an electricity powered, fixed-wing UAV. This UAV made several flights at an altitude of 114 metres over the study area in March of 2014. Overlapping images where acquired from different angles allowing for stereoscopic depth rendition. The resulting DEM has a spatial resolution of 0.5 metres and a vertical accuracy of 0.2 metres (Figure 2b).

4.2

(a) LiDAR

B - North Carolina

The second test case was derived from the open spatial data set from North Carolina distributed with GRASS as sample data 6 . This data set includes a 10 metres cell side DEM representing relatively rugged terrain with carved valleys and sparse man-made features. A section of this DEM was cropped to be used as high resolution input (Figure 3b). The original DEM was then converted to a lower spatial resolution with 60 metres side cells, to which non spatially correlated noise was added (Figure 3a). 6 https://grass.osgeo.org/download/sample-data/

(b) UAV stereoscopic Figure 2: Overlapping DEMs of different resolutions built from the North Carolina data set.

5

RESULTS

To compare r.mblend with the Cover and Average methods the Mosaic tool provided with the ArcGIS software was used. This is tool is able to merge DEMs using both types of conventional methods. All results where then assessed with GRASS. A high pass filter was used for a first assessment of the merged DEMs produced by each of the three methods. A 5-by-5 cell filter was used in order to highlight zones of transition, e.g. sharp edges, walls and so forth. Figure 4 presents a detail of these results for the Lucerne test case. Immediately standing out is the artificial step introduced by the Cover and Average methods along the border between the two input DEMs. The step is not so marked with Average, but still present; contrariwise, at a closer inspection a loss of detail is visible with this method, with various 45


(a) Low resolution

(a) Cover method

(b) Average method (b) High resolution Figure 3: Overlapping DEMs of different resolutions built from the North Carolina data set.

small transitions in the UAV DEM loosing magnitude. As for r.mblend it shows no border at all, while preserving the fine detail in the high resolution DEM. In the North Carolina test case the artificial step introduced by the Cover and Average methods is also present, even if less marked (Figure 5). In this case the transitions within the larger cell areas stand out considerably more. Since this is rugged terrain, the 60 metres cells introduce relevant ridges and cliffs. It is also interesting to observe the effects of the Average method on the high resolution area, introducing the artificial ridges from the 60 meters DEM. Taking the differences from the original to the output DEMs provides another point of assessment. Figure 6 shows together the differences from the blended result to the high resolution DEM and the dif46

(c) r.mblend Figure 4: Results of high pass filter applied on merged DEMs in test case A.


(a) Cover method

ferences from the result to the low resolution DEM in the areas not covered by the high resolution input. This analysis is not presented for the Cover method since it does not change the inputs. The differences between Average and r.mblend are striking, appearing in opposite areas. r.mblend applies changes only to the low resolution DEM, with a smooth transition surface; Average leaves the low resolution data untouched, while applying irregular and many times severe changes to the high resolution data. A similar pattern in differences is patent in the North Carolina test (Figure 7). r.mblend yields again the smooth transition surface, applying changes solely to the low resolution input. As before, the Average method introduces broad changes to the area where both inputs overlap, in this case coinciding with the full extent of the high resolution DEM.

(b) Average method (a) Average method

(c) r.mblend Figure 5: Results of a high pass filter applied on merged rasters in test case B.

(b) r.mblend Figure 6: Differences from resulting blended DEM to inputs in test case A.

47


(a) Average method

arly superior alternative to the conventional methods assessed. Presently, r.mblend operates on a single execution thread. All operations conducted are relatively straightforward, except for the interpolation of the smoothing surface. For the Lucern case study presented above, this operation may take in the order of dozens of minutes. However, it is possible to parallelise this operation, since there is no dependence between cells of the resulting surface. The GRASS Python API provides elementary tools for parallelisation, spawning GRASS commands as sub-processes. Therefore, an obvious evolution to r.mblend is to slice the interpolation area and run the interpolation independently on each slice. Other ways of improvement also concern the smoothing surface interpolation. Alternative methods beyond IDW can be made available to the user, as so their respective parameters. This would provide the user with further degrees of freedom to tune the module output. Finally, r.mblend can also be extended to automatically apply an high pass filter on the resulting DEM, providing it as a secondary output. This is a useful asset to assess to quality of the resulting DEM, either visually or in more elaborate analysis.

REFERENCES

(b) r.mblend Figure 7: Differences from resulting blended DEM to inputs in test case B.

6

SUMMARY AND FUTURE WORK

This article compared the r.mblend GRASS addon with conventional methods to merge overlapping DEMs of different spatial resolution. Using two different test cases, it was possible to assess its advantages on smooth and rugged terrain. r.mblend eliminates the steps introduced along the border of the areas where the merging inputs overlap; these steps are not so marked in rugged but still present. This smooth transition is not achieved at the expense of loss of detail, as the high resolution DEM is left untouched. This contrasts particularly with the Average method, that visibly derides the information from the high resolution input. r.mblend presents itself as cle48

de Smith, M. J., Goodchild, M. F., and Longley, P. A. (2015). Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools - Fifth Edition, chapter Geocomputational methods and modeling, pages 625–672. Winchelsea Press. Doe, R. (2014). GIS Kanton Luzern. https://rawi.lu. ch/themen/gis_kanton_luzern. Accessed: 30-102014. Eastman, J. (2012). IDRISI Selva. Clark University, MA, USA. Hickey, R. (2000). Slope angle and slope length solutions for GIS. Cartography, 29(1):1–8. Katzil, Y. and Doytsher, Y. (2005). Spatial rubber sheeting of dtms. In Proceedings of the 6th Geomatic Week Conference, Barcelona, Spain, volume 811. Küng, O., Strecha, C., Beyeler, A., Zufferey, J.-C., Floreano, D., Fua, P., and Gervaix, F. (2011). The accuracy of automatic photogrammetric techniques on ultralight uav imagery. In UAV-g 2011-Unmanned Aerial Vehicle in Geomatics, number EPFL-CONF-168806. ˇ (2016). ImLeitão, J., Prodanović, D., and Maksimović, C. proving merge methods for grid-based digital elevation models. Computers & Geosciences, 88:115 – 131. Luedeling, E., Siebert, S., and Buerkert, A. (2007). Filling the voids in the srtm elevation modela tin-based delta


surface approach. ISPRS Journal of Photogrammetry and Remote Sensing, 62(4):283–294. Neteler, M., Bowman, M. H., Landa, M., and Metz, M. (2012). Grass gis: A multi-purpose open source gis. Environmental Modelling & Software, 31:124–130. Sanner, M. F. et al. (1999). Python: a programming language for software integration and development. J Mol Graph Model, 17(1):57–61.

49

Mapping and Monitoring Airports with Sentinel 1 and 2 Data Urban Geospatial Mapping for the SCRAMJET Business Networking Tool Nuno Duro Santos1, Gil Gonçalves2 and Pedro Coutinho3 1

Bluecover Technologies, Lisboa, Portugal University of Coimbra & INESC-Coimbra, Coimbra, Portugal 3 WATERDOG mobile Lda., Porto, Portugal [email protected], [email protected], [email protected] 2

Keywords:

Earth Observation, Satellites, Open Data, SCRAMJET, Urban Mapping, Spatial Resolution.

Abstract:

SCRAMJET is an online tool that allows business travellers to connect and plan to meet in any of the airports included in their trip. To successfully deliver, SCRAMJET needs accurate and up-to-date worldwide airport mapping information. This paper describes an assessment of the use of Earth Observation (EO) products, in particular the Sentinel program, for improving airport mapping and monitoring its changes. The first step is to verify the data availability of Sentinel-1 and Sentinel-2 at a global scale, and then evaluate its adequacy for airport mapping. For monitoring airport changes, the analysis tested multispectral change detection methods and interferometry processing techniques. The main conclusion was that the acquisition frequency of both Sentinels is a great benefit to assure up-to-date information at a global scale. The recommended approach for a target of 200 airports is to do the airport mapping, assisted by Sentinels data for validation and improvements, and monitoring changes by integrating a Sentinel-2 change detection chain (using NIR/SWIR bands), in parallel with OpenstreetMap change detection processing.

1

INTRODUCTION

SCRAMJET is a web and mobile product to connect business travellers at airports that is being developed by WATERDOG at ESA BIC Portugal. One of the key assets of the tool is to maintain reliable, updated and accurate airport maps to ensure travellers can agree on a meeting point while planning their trip and, once physically at the airport, find each other. The maps comprise both indoor and outdoor features, including the buildings’ morphology, gates identification and Points of Interest (shops, toilets, etc…) as depicted in figure.

Figure 1: Outdoor and indoor mapping needs.

The typical usage scenarios of the maps are: • The user knows his gate and the gate of the person to meet and uses the map to choose the

meeting place, for example Gate 21 or a coffee shop POI ; • The user lands and a location-based tool running on his phone provides rough indoor guidance visually identifying the place to go. SCRAMJET will have its own airport map information and the research presented in this paper is crucial for two development and maintenance needs: • Airport Mapping: the initial geographical information of all airports is obtained from OpenstreetMap (OSM) and Google Maps is used for validation. Nonetheless, many airports have incomplete or outdated mapping data on these platforms that needs to be validated and improved. • Monitoring the Airport Changes: airports may be subject to works, renovations and extensions that need to be detected. The available literature on automatic airport mapping and monitoring from remote sensing image data is very scarce. First, previous works on automatic airport mapping mainly focused on the runways detection as they are the primary characteristic of an airport. Wang et al. (2013) used a Hough transform to judge whether an airport exists

50 Santos, N., Gonçalves, G. and Coutinho, P. Mapping and Monitoring Airports with Sentinel 1 and 2 Data - Urban Geospatial Mapping for the SCRAMJET Business Networking Tool. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 50-58 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Mapping and Monitoring Airports with Sentinel 1 and 2 Data - Urban Geospatial Mapping for the SCRAMJET Business Networking Tool

2

EO DATA AVAILABILTY

The first step of the analysis was to confirm the temporal availability of EO data at a global scale, by defining a timeframe for validation on a set of worldwide airports. The selected timeframe was the latest two months before the study started, from 1st December 2016 to 31st January 2017, while nine airports were chosen, representative of different geographical regions from USA, Europe and Asia. During this period, two Sentinel-1 (S1) and one Sentinel-2 (S2) were operational. The Sentinel-1 (synthetic-aperture radar) was operating with S1A and S1B satellites, while the Sentinel-2

(multispectral) had only the S2A satellite active (S2B was launched just on 7 March 2017). The data procurement results of Sentinel-1 (S1) and Sentinel-2 (S2) on these airports are presented in the table. Table 1: Sentinels data availability.

USA

Europe

Sentinel-2 Lisbon

S2A 2016-12-19

München

None (dense cloud coverage)

Istanbul

S2A 2017-02-02

Malaga *

S2A 2016-12-20

Atlanta

S2A 2016-11-28

NYC/JFK

S2A 2016-12-04

Miami

S2A 2017-01-06

Ben Gurion S2A 2017-02-10 Asia

in a Very High Resolution (VHR) image. Then a scale invariant feature transform in conjunction with a hierarchical discriminant regression tree was employed to detect the airport area. Aytekin et al. (2013) used a texture-based runway detection algorithm that uses the Adaboost machine learning package for identifying 32x32 pixel image tiles as runway or non-runway. Second, for automatically monitoring the airport changes, Digital Change Detection algorithms that provide binary land cover “change/no-change” information, can be used (Jensen, 2015). In fact, by automatically detecting the spatial regions within a bi-temporal image pair where meaningful change is likely to have occurred, a human operator (or another process) can then analyse the changes using his/her knowledge. ESA’s Sentinel missions are providing us with reliable and timely open data on land, ocean and atmosphere with high spatial and temporal resolutions for state-of-the-art research activities and services, e.g., natural resources management and urban land cover mapping (Malenovský et al., 2012). In this context, synergetic use of Sentinel 1/2 data has been used for urban land cover mapping and change detection (Ban et al., 2017; Haas and Ban, 2017). Although the potential of Sentinel 1/2 data has been highlighted in the above works, the effective use of this data in the context of mapping and monitoring airports need to be assessed. This study aims to confirm the needs and verify how Earth Observation satellites, in particular the latest Sentinels satellites, can be used to assure the best up-to-date outdoor mapping for an initial target of 200 world airports. The work assesses the temporal and spatial suitability of Sentinels (or other EO data) and defines a service chain design for airport mapping and monitoring changes.

AbuDhabi

S2A 2016-12-25

Shanghai

S2A 2017-01-29

Sentinel-1 2017-01-19 (S1A IW VV-VH) 2017-01-25 (S1A IW VV- VH) 2017-01-14 (S1A IW VV-VH) 2014-11-27 (S1A SM HH-HV) 2017-01-06 (S1A IW VV-VH) 2017-01-12 (S1A IW VV-VH) 2017-01-01 (S1A IW VV-VH) 2017-01-04 (S1A IW_ VV-VH) 2017-01-07 (S1A IW_VV) 20170122 (S1A IW VV-VH)

S2A has visible data from almost all airports, including the 4 relevant bands for this study with 10 m spatial resolution: B2, B3, B4 and B8.

Figure 2: S2A True colour composition.

S1A and S1B were also capturing data in all airports but using different modes. The main operational mode for land is Interferometric Wide (IW) High Resolution, typically using single or dual polarization, with a spatial resolution up to 25 m. The best resolution mode is Stripmap (SM) Full Resolution, with a spatial resolution up to 10 m, that is used only on request, typically on extraordinary events, such as emergency management. Both acquisition modes are available in the SLC product format, needed for interferometry applications, and GRD product format that is geo-referenced from SLC.

51


The acquisitions in IW mode were widely available for all nine airports selected. Malaga* airport was the only aerodrome found, acquired in Stripmap mode Full Resolution and thus was added to the baseline.

Figure 3: S1A IW VH-VV and SM HH-HV RGD compositions.

2.1

Satellite Open Data Availability

Considering the technical specifications (namely the spatial and temporal resolutions) of the SCRAMJET two different open satellite data products have been identified as the most useful: Sentinel-1 and Sentinel-2 data. Concerning the Sentinel-2 data, it was found that: • Temporal frequency of Sentinel2 is fine. During the selected timeframe S2A captured in average 1 good quality image per month, and there are good images available in 90% of the airports. • S2B was launched on 7 March 2017 and will increase temporal availability. • It may be difficult to capture images during winter season in some airports (e.g. Munich, Atlanta) due to high dense cloud coverage. • Airports on the intersection of granules or tiles need to have a special handling, such as JFK that is right on the intersection of 4 granules. Regarding the Sentinel 1 data, it was found that: • IW acquisitions are available for all the 9 airports in dual polarization VV-VH excepting in Abu Dhabi that acquisitions are done in single VV polarization. • Very few Stripmap (SM) Full Resolution images are available at the archive. The ones found were acquired from special zones, namely the Strait of Gibraltar and a region of Germany. • Almost all acquisitions are available in both SLC and GRD product formats. Other open satellite data procurement, in particular Landsat8, was dropped since the first results pointed that S2 has better spatial resolution.

52

3

AIRPORT MAPPING

The adequacy of the available data to meet the mapping requirements was performed focusing on spatial and spectral resolution and on the quality of OSM data. The initial analysis covered only Sentinels but it was later extended to analyse commercial solutions. The three airports (Lisbon, Istanbul, Abu Dhabi), used as analysis baseline, were thus extended to Malaga and Malaysia in order to address relevant data found on these areas.

3.1

Mapping with Sentinel-2

Three airports were selected for study from the initial nine: Lisbon, Istanbul and Abdu Dhabi. The approach was to build RBG composites with the better resolution bands, layered with existing Points of Interests from OpenstreetMap. In the Lisbon airport, the S2A image was composed with OSM data (Fig. 4), resulting on the following findings: • The visibility is slightly blurred. It is hard to identify planes and gates. • Many gates are mapped in OpenstreetMap (20 "aeroway"=>"gate, 1 "aeroway"=>"helipad") • Infrared composition in Lisbon during winter, (with more intense the grass) may be an advantage to identify airport morphology

Figure 4: Lisbon S2A true colour composition with OSM.

Regarding the Istanbul airport, the S2A image depicted on Fig. 5 highlights that: • Although it has good visibility, additional support photos and maps need to be used for mapping • The gates identified by red polygons are not available on OSM


Figure 7: Malaga S1A SM HH-HV RGB compositions. Figure 5: Istanbul S2 true colour composition with OSM.

The Abu Dhabi airport (27th in Asia) was selected, not being as busy as Dubai International Airport (3rd in Asia). The S2A image composition on Figure 6 concludes that: • It has very good visibility: parked airplanes the new gates under construction are visible • 41 gates are mapped in OSM (new gates were not yet available in OSM)

The analysis concluded that: • Both modes allow a good identification of building areas and runways. It allows to easily identify that Google Maps was showing an outdated image of Malaga airport, with a single runway, acquired before expansion on June 2012. • S1A Sripmap FR (Malaga) has a more appropriate spatial resolution than S1A IW HR (Lisbon).

3.3

Mapping with Non-Open Eo Data

Considering that Sentinels may not have enough spatial resolution for the needs, alternative commercial satellite data with better resolution was analysed. The project identified two very highresolution solutions from Pléiades (0.5-m) and Deimos-2 (1m-4m) with competitive prices. An example of Langkawi Airport at Malaysia using Pléiades from 2017 is provided in Figure 8. Figure 6: Abu Dhabi S2 true colour composition with OSM.

3.2

Mapping with Sentinel-1

Two study areas were analyzed: the Lisbon airport using images acquired in the default Interferometric Wide mode High Resolution and the Malaga airport acquired in Stripmap mode Full Resolution. After performing the geo-corrections of both S1A GRD products, a RGB composite was produced with two polarization bands (refer to Malaga RGB composite on Fig 7).

Figure 8: Malaysia Pleiades true colour composition.

The analysis concluded that: • Pansharpened images with 50 cm resolution and 4 bands offers excellent details of the airport, allowing recognition of the planes types 53


•

Temporal acquisition is not as flexible as Sentinels at cost-effective prices

3.4

Mapping Sources Analysis

The analysis highlighted that there is no single solution for all sites as presented in the table. Table 2: Analysis of the airport mapping sources. S2

S1 Low resolution (IW)

Pleiades

OSM

N/P

20 gates

Good visibility

N/P

N/P

No gates

Very good visibility

N/P

N/P

41 gates

Malaga

N/P

Good resolution (SM)

N/P

N/P

Malaysia

N/P

N/P

V High resolution

N/P

Lisbon Istanbul Abu Dhabi

Blurred

The best and relevant mapping sources depend on the particularities of each site. The conclusions per mapping source are presented below: • Sentinel-2 images may be used to support morphology and gates visual mapping and validation. The spatial resolution may be just on the limit. Acquisitions with good visibility are fine for gates but hardly recognize airplanes. • Sentinel-1 can also support the identification of runways and build-up areas, but GRD IW High Resolution products have a spatial resolution less than 25 m. • Some cases may need commercial very highresolution images. • OSM does not offer a complete mapping solution on all cases analysed. Not all airports have gates identified in OSM. • Additional support photos and maps may be used for morphology and gates. Note that Airport Buildings do not have clear boundaries. They are often confused with surrounding buildings (hotels, etc…). The mapping conclusion is that the acquisition frequency of Sentinel is a great benefit and the solution shall definitely be based in a combination of different sources.

4

MONITORING THE AIRPORT CHANGES

The usage of the change detection methods could be useful to trigger airport morphology changes. The two typical binary land cover changes that we want to detect are: • Urban to Demolition • Null Soil/Vacant Land/Demolition to Urban In this context, two detections approaches were evaluated: a) Change detections with Sentinel-2: detect abrupt changes using image pairs, before and after the event, and a reasonable number of pixels (between 9=3x3 a 25=5x5). b) InSAR with Sentinel-1: use an InSAR technique to detect surface deformations upon analysis of the phase difference between two radar signals acquired from the same area at different times. The usage of Advanced InSAR for the identification of hotspots subsidence at airports was kept in standby at this stage. Although it may resolve millimetre-scale movements of infrastructure, the usage of multiple images was considered having high costs (storage and computation).

4.1

Study Area

The area selected for testing was the Rio de Janeiro airport, which was renewed for the 2016 Olympic Games. The works started in 2014 and finished in Abril 2016.

Figure 9: Google Maps historical data of Rio de Janeiro airport.

The gates were extended with a new area and more car parks were constructed.

4.2

Detecting Changes with Sentinel-2

The pair of S2 images selected for testing were the first cloud-free image available from this airport (Fig. 10): one image was acquired in 2015 during renewal, and the other in 2016 after renewal. 54


Figure 10: Sentinel-2 images used in change detection.

For change detection using a pair of images three main categories of methods could be used: • Simple Detection: use Mean Difference, Ratio Of Means or Root Mean Square Differences of the relevant bands (typically visible and near infrared bands 2, 3, 4 and 8) • Normalized index change detections: produce normalized indicators related to built-in areas (using S2 bands) and compare them. The most relevant index is the Difference Built-up Index (NDBI) applied in Landsat TM with SWIR1 and NIR bands (Zha et al., 2003). • Post Classification Comparison: make supervised classification of the pairs and compare results (e.g. land cover comparison, Built-up Areas comparison)

Figure 11: Change detection with NIR.

Ratio of Means Detection with SWIR Band The second approach was the ratio of means using the SWIR band (B11) from Sentinel-2. Although this band has lower spatial resolution, the results achieved are also quite acceptable.

In this paper, only the first two categories were analysed and presented hereafter. Post Classification was abandoned since it was considered more relevant with global and regional scales (world, country, regions) rather than local scales such as the airports gates details. Ratio of Means Detection with NIR Band This analysis started by using simple detectors. The ratio of means was firstly used with NIR band (B8) from Sentinel-2. 2015 , , 2016 , The achieved results were quite acceptable, allowing to easily identify the new gates area and the reconstructed car park that was not initially identified during google maps inspection. Although this detector was successfully applied on Rio airport study area but it needs to be bounded and normalized to be applied widely on other airports: ,

1

2015 , min 2016 ,

2016 , , 2015 ,

Figure 12: Change detection with SWIR.

Note that this detector needs also an improvement in order to be bounded and normalized. Root Mean Square Differences Detection with 4 Bands The third detector was the Root Mean Square Differences computed with the visible and near infrared bands (B2, B3, B4 and B8). Because the obtained results are unclear, it was dropped. NDBI Index Detection The last multispectral detector was the Normalized Difference Built Index (NDBI), which is referred in the change detection literature as a promising method (Jensen, 2015; Zha et al., 2003). For its

55


usage with Sentinel-2 imagery, the S2 SWIR and NIR bands were used as follows: ,

_ 11 , _ 11 ,

_ 8 , _ 8 ,

Nonetheless, the change detection results with NDBI 2015 and NDBI 2016 obtained confusing results. Figure 15: RGB composition of S1A IW VV-VH pair.

Figure 13: Change detection result with NDBI 2015-2016.

The change detection conclusion was that simple detectors with NIR and SWIR bands could solved the problem on this study area. The usage of these bands shall be further verified and confirmed on other airports. The usage of spectral unmixing techniques at pixel level with significant changes on land cover could be an alternative approach for a future analysis to fine-tune the detections.

4.3

Interferometry Processing with Sentinel-1

For the analysis of the interferometry processing, a pair of Sentinel-1 images from 2015 and 2016 were used. Both images were acquired in IW mode with dual polarization VV-VH (Fig. 14).

The SLC products were used for the interferometry processing. The pair of products were captured in IW mode with three sub-swaths (IW1, IW2 and IW3) using Terrain Observation with Progressive Scans SAR (TOPSAR). The SNAP (Sentinels Application Platform) tool was used, following TOPS Interferometry Tutorial (Veci, 2015). The co-registration of IW1 relevant subswath was performed, the interferogram was produced, the topographic phase was removed and the phase filtering was applied. The interferogram results after the ellipsoid correction are in the figure below.

Figure 16: Coherence and phase interferogram 2015-2016.

The results were not effective. Although the interferogram requires a detailed interpretation, this preliminary analysis did not spot relevant changes on coherence and phase interferogram. Additionally, the spatial resolution of the IW acquisitions may not be sufficient for the monitoring cases.

5

Figure 14: IW product pair used in change detection.

The GRD products were used to produce a RGB colour-composite from VH and VV polarization images. The composites allowed to identify the extended gates in 2016 (Fig 15). 56

CONCLUSIONS

The study confirmed that several OpenstreetMap and Google Maps information of the airports are incomplete or outdated. Although, Sentinels lacks spatial resolution, they can be an advantage to validate and trigger mapping improvements. The acquisition frequency of both Sentinels is considered a great benefit to assure up-to-date information at a global scale.


The recommended SCRAMJET approach for a target of 200 airports is to do mapping assisted by Sentinels and eventually other commercial EO data, and to monitor changes using a Sentinel-2 semiautomatic change detection method. Mapping The mapping solution shall be definitely based in a combination of multiple sources, including OpenstreetMap, Sentinels, Google Maps, local photos and other commercial EO data. The system shall extract the relevant OSM data to create an initial mapping information. An extended EO chain (Figure 17) is recommended with automatic data acquisition and pre-processing of Sentinels data. The Sentinels data shall be used for mapping validation and trigger improvements based visual inspection of Sentinels and other complementary sources.

Figure 18: Monitor changes extended with S2 detections.

An initial automated proof-of-concept to validate the study conclusions is recommended as a next step. A pilot with 3-4 airports shall start by automating data acquisition and pre-processing for mapping purposes. The change detection processing chain with NIR and SWIR bands shall be further analysed with alternative approaches and automated afterwards in order to start collecting results and to fine tune the algorithm.

Figure 19.

Figure 17: Mapping solution with EO extended chain.

Monitoring Changes The foreseen solution is also to integrate extended EO change detection with the OSM change detection, checking changes every 3 months as depicted in Figure 18. The semi-automatic change detection with Sentinel-2 is suggested, taking advantage of its update frequency. Implementing an automatic detections is technically feasible to generate alerts but it will require a visual inspection to confirm and trigger the updates. The change detection algorithm needs to select cloud free images, normalize the processing and finally fine-tuned algorithm with a wide number of airports to become fully automated. The automation shall consider the costs of creating EO baselines (storage) and processing EO images (computing). The changes detected with S2 are real and faster but they will probably include many false positives while the changes detected from OSM are more accurate but more delayed.

ACKNOWLEDGEMENTS This work was developed by BLUECOVER to WATERDOG under the ESA BIC Portugal contract. The work of Gil Gonçalves at INESC-Coimbra was supported by Foundation for Science and Technology (FCT) of Portugal under the project grant UID/MULTI/00308/2013. A special thanks to WATERDOG and ESA BIC Portugal for the publishing authorization.

REFERENCES Aytekin, Ö., Zongur, U., Halici, U., 2013. Texture-based airport runway detection. IEEE Geosci. Remote Sens. Lett. 10, 471–475. doi:10.1109/LGRS.2012.2210189 Ban, Y., Webber, L., Gamba, P., Paganini, M., 2017. EO4Urban: Sentinel-1A SAR and Sentinel-2A MSI data for global urban services. 2017 Jt. Urban Remote Sens. Event, JURSE 2017 0–3. doi:10.1109/JURSE.2017.7924550 Haas, J., Ban, Y., 2017. Sentinel-1A SAR and sentinel-2A MSI data fusion for urban ecosystem service mapping.

57


Remote Sens. Appl. Soc. Environ. 8, 41–53. doi:10.1016/j.rsase.2017.07.006 Jensen, J.R., 2015. Introductory Digital Image Processing: A Remote Sensing Perspective, 4th ed. Prentice Hall Press, Upper Saddle River, NJ, USA. Malenovský, Z., Rott, H., Cihlar, J., Schaepman, M.E., García-Santos, G., Fernandes, R., Berger, M., 2012. Sentinels for science: Potential of Sentinel-1, -2, and 3 missions for scientific observations of ocean, cryosphere, and land. Remote Sens. Environ. 120, 91– 101. doi:10.1016/j.rse.2011.09.026 Veci, L., 2015. SENTINEL-1 Toolbox SAR Basics Tutorial. Esa 1–20. Wang, X., Lv, Q., Wang, B., Zhang, L., 2013. Airport detection in remote sensing images: A method based on saliency map. Cogn. Neurodyn. 7, 143–154. doi:10.1007/s11571-012-9223-z Zha, Y., Gao, J., Ni, S., 2003. Use of normalized difference built-up index in automatically mapping urban areas from TM imagery. Int. J. Remote Sens. 24, 583–594. doi:10.1080/01431160304987

58

Outdoors Mobile Augmented Reality Application Visualizing 3D Reconstructed Historical Monuments Chris Panou2, Lemonia Ragia1, Despoina Dimelli1 and Katerina Mania2 1

School of Architectural Engineering, Technical University of Crete, Kounoupidiana, Chania, Greece Department of Electrical and Computer Engineering, Technical University of Crete, Kounoupidiana, Chania, Greece [email protected], [email protected], [email protected], [email protected]

2

Keywords:

Augmented Reality, 3D Reconstruction, Cultural Heritage, Computer Graphics.

Abstract:

We present a mobile Augmented Reality (AR) tourist guide to be utilized while walking around cultural heritage sites located in the Old town of the city of Chania, Crete, Greece. Instead of the traditional static images or text presented by mobile, location-aware tourist guides, the main focus is to seamlessly and transparently superimpose geo-located 3D reconstructions of historical buildings, in their past state, onto the real world, while users hold their consumer grade mobile phones walking on-site, without markers placed onto the buildings, offering a Mobile Augmented Reality experience. we feature three monuments; e.g., the ‘GialiTzamisi’, an Ottoman mosque; part of the south side of a Byzantine Wall and the ‘Saint Rocco’ Venetian chapel. Advances in mobile technology have brought AR to the public by utilizing the camera, GPS and inertial sensors present in modern smart phones. Technical challenges such as accurate registration of 3D and real world, in outdoors settings, have prevented AR becoming main stream. We tested commercial AR frameworks and built a mobile AR app which offers users, while visiting these monuments in the challenging outdoors environment, a virtual reconstruction displaying the monument in its past state superimposed onto the real world. Position tracking is based on the mobile phone’s GPS and inertial sensors. The users explore interest areas and unlock historical information, earning points. By combining AR technologies with locationaware, gamified and social aspects, we enhance interaction with cultural heritage sites.

1

INTRODUCTION

AR is the act of superimposing digital artefacts on real environments. In contrast to Virtual Reality where the user is immersed in a completely synthetic environment, AR aims to digitally complement reality (Azuma et al. 2001, Zhou et al., 2018). In comparison to older systems that used a combination of cumbersome hardware and software, recent advances in mobile technology has led to an integrated platform including GPS functionality, ideal for the development of AR experiences, often referred to as Mobile AR (MAR). Applications in the fields of medical visualization, maintenance/repair, annotation, robot path planning and entertainment enrich the world with information the users cannot directly detect with their own senses (Nagakura et al., 2014; Niedmermair et al., 2011; Kutter et al., 2008, Ragia et al., 2015). In this paper, we present the design and implementation of a MAR digital guide application for Android devices, that provides on-site 3D

visualisation and reconstructions of historical buildings in the Old Town of Chania, Crete, Greece. 3D imagery of how archaeological sites existed in the past, are superimposed over their real-world equivalent, as part of a smart AR tourist guide. Instead of the traditional static images or text presented by mobile, location-aware tourist guides, we aim to enrich the sightseeing experience by providing 3D imagery visualising the past glory of these sites in the context of their real surroundings, seamlessly, without markers placed onto the buildings, offering a MAR personalized, gamified experience, showcasing the city’s cultural wealth. The mobile AR application features a multimedia database that holds records of various monuments. The database also stores the users’ documentation of their visits and interactions in the areas of interest. User requirements gathering and AR development while located in the challenging outdoors environment of a city, pose significant technical as well as user interaction challenges. Reliable position and pose tracking is paramount so that 3D content is

59 Panou, C., Ragia, L., Dimelli, D. and Mania, K. Outdoors Mobile Augmented Reality Application Visualizing 3D Reconstructed Historical Monuments. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 59-67 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved


accurately superimposed on real settings, at the exact position required and is one of the major technical problems of AR technologies. Our system features a geo-location and sensor approach which compared to optical tracking techniques allows for free user movement throughout the site, independent of changes in the building’s structure. The MAR application proposed provides an easily extendable platform for future additions of digital content requiring a minimal amount of development and technical expertise.The goal is to provide a complete and operational AR experience to the end-user by tackling AR technical challenges efficiently, as well as offering insight for future development in similar scenarios.

1.1

Motivation

Since the Neolithic era, the city of Chania has faced many conquerors and the influences of many civilizations through time. Byzantine, Arabic, Venetian and Ottoman characteristics are evident around the cultural center of the town clustered towards the old Venetian Harbor. In order to provide the tourist with a view of the past, based on cuttingedge AR technologies, we designed a historical route throughout the city consisting of a selection of historical buildings, to be digitally reconstructed and presented in their past state through an AR paradigm. The final selection includes three monuments that represent key historical periods of the Town of Chania (Figure 1). The Glass Mosque is located in the Venetian Harbor of Chania and it is the first mosque built in Crete and the only surviving in the City dating from the second half of the 17th Century. The mosque is a jewel of Islamic art in the Renaissance and featured a small but picturesque minaret demolished in 1920 or in 1939. The Saint Rocco temple is a Venetian chapelon that consists of two different forms of vaulted roof aisles. Although the southernmost part is preserved in good condition, the northern and oldest one has had its exterior painted over, covering its stony façade, while a residential structure is built on top. The Byzantine wall was built over the old fortifications of the Chydonia settlement around the 6th and 7th century AD. Its outline is irregular with longitudinal axle from the East to the West, where its two central gates were located. The wall consists of rectilinear parts, interrupted by small oblong or polygonal towers many of which are now partly or completely demolished. The scope of this work is to virtually restore partially or fully damaged buildings and structures on

60

historic sites and enable visitors to see them integrated with their real environment, while using a sophisticated AR mobile tourist guide. We aim to deliver geo-located information to the users as well as calculate accurate registration positioning between the real-world monument and 3D digitisations, while users document their visits. By integrating digital maps and a location-aware experience we aim to urge the users to further investigate interest areas in the city of Chania, Crete, and uncover their underlining history by exploiting cutting edge AR mobile technologies.

Figure 1: The Glass Mosque (left), the Saint Rocco Temple (right), the Byzantine wall (middle).

1.2

Previous Work

AR has been utilized for a number of applications in cultural heritage. One of the initial MAR systems provided on-site help and AR reconstructions of the ruins of ancient Olympia, Greece (Vlahakis et al., 2001). The system utlilized a compass, a DGPS receiver and combined with live view images from a webcam, it obtained the user’s location and orientation. Visitors carried a heavy backpack computer and wore a see-through Head Mounted Display (HMD) to display the digital content. The system was a cumbersome MAR unit not acceptable by today’s standards. MARCH (Choudary et. al. 2009) was a MAR application developed in Symbian C++, running on a Nokia N95. The system made use of the phone’s camera to detect images of cave engravings and overlay them using an image indicating ancient drawings. Although this was the first attempt of a real time MAR application without the use of grey-scale markers, the system still needed the placement of coloured patches at the corners of 2D images captured in caves and the experience was not tested in cave environments. With the advent of mobile devices, more sophisticated AR experiences are made possible such

Outdoors Mobile Augmented Reality Application Visualizing 3D Reconstructed Historical Monuments

as the one for the Bergen-Belsen memorial site, (Pacheco et. al. 2014), a former WWII concentration camp in northern Germany which was burned down after its liberation. The application integrated database interaction, reconstruction modelling and content presentation in a hand held device. Real time tracking was performed with the device’s GPS and orientation sensors and navigation was conducted either via map or the camera. The system was superimposing the reconstructed building models on the phone’s camera feed. Focusing more on the promotion of cultural heritage in outdoor settings, VisAge (Julier et al., 2016) was an application aiming to turn users into authors of stories and cultural histories in urban environments. The system featured an online portal where users could create their stories using routes through physical space. A story is a set of spatially distributed POIs (Points of Interest). Each POI has its own digital content consisting of images, text or audio. A viewing tool was developed for mobile tablets in Unity using Vuforia’s tracking library to overlay the digital content in the real space. The users could follow routes in the city and experience new stories. Tracking was performed using feature detection algorithms from the camera’s feed. As per any optical approach, content delivery is not guaranteed due to the lighting variations of the outdoor setting. Further work in 3D reconstructions was shown in CityViewAR (Lee et al., 2012), a mobile outdoor AR application that was developed to allow people to explore destroyed buildings after the major earthquakes in Christchurch, New Zealand. Besides providing stories and pictures of the buildings, the main feature of the application is the ability to visualize 3D models of the buildings in AR, displayed on a map. Finally, a practical solution presented in (Hable et al., 2012) targeted guiding groups of visitors in noisy in-door environments, based on design decisions such as analogue audio transmission and reliably trackable AR markers. However, preparation of the environment with fiducials is time consuming and the supervision of the visits by experts is necessary to avoid accidents and interference with the working environment.

2

METHODOLOGY

We present a MAR application that besides offering geo-localised textual information concerning cultural sites to a visitor, it also superimposes location-aware 3D reconstructions of historical buildings, positioned exactly where the real-world monuments are located,

displayed on visitors’ Android mobile phone. The geo-location approach used for real-time tracking is based on sensors available to both high and low-end mobile phones eliminating hardware restrictions and allowing for easy integration of added historical buildings. It also offers the opportunity to visualize historical sites in non-intrusive ways without placing markers or patches on their walls. It also introduces gamified elements of cultural exploration. The geo-location approach is adopted that employs the GPS and the inertial sensors of the device so that when the specific location of the actual monument is registered, then the 3D reconstruction of it would be displayed. Locations containing latitude and longitude information are received from the GPS while the accelerometer and geo-magnetic sensors are used to estimate the device pose in the earth’s frame. A visual reconstruction is then matched to the user’s position and viewing angle displaying the overlaid models on the mobile phone’s screen. This implementation offers the most reliable registration of 3D content accurately superimposed on the realworld site, demanding less actions from the users, therefore, ensuring a robust and intuitive experience (Střeláket. al. 2016). The preparation of the 3D models required the acquisition of historical information and the accurate depiction of them in scale with the real world. Due to the lack of accurate plots and outlines of the buildings, Lidar and DSM data were exported from Open Street Maps and used to create the final models. Designing for a mobile device means that limited processing power and the requirements of the AR technologies need to be taken into account. Complex geometries can impair performance so a low-poly, high-resolution texture approach was adopted in order to avoid frame rate drops. The final models are then processed through Google Sketch-up to georeference and position them onto the real world while looking at the screen of a mobile phone. Digital maps and an AR camera displaying the interest areas were integrated to assist in navigation through the geo-located content. The client-server architecture ensures that personalised experiences are provided by storing information about user visits and progress. Changes in the server can be conducted without interfering with the mobile application allowing for an easily extendable platform where new monuments could be added as visiting areas. The monuments’ information and assets are stored in a database and delivered to the mobile application in a location-request basis. The application was developed for Android in Java as well as employing the Wikitude Javascript API for the AR views. It

61


features a local database based on SQLite cashing the downloaded content.

2.1

3D Modeling and Texturing

In order to record the past state of the selected monuments, old photographs, historical information and estimates from experts were utilized. The 3D models visualizing their past state are presented in real size superimposed over the real-world monument and must be in proportion with their surroundings. Therefore, accurate measurements of their structure are necessary. Due to the lack of schematics and plots, we relied on data derived from online mapping repositories which provide outlines and height. The outlines of the three monuments were acquired from Open Street Map (OSM). By selecting specific areas of the monuments on the map, we can then export an .osm file that contains the available information concerning that area, including building outlines and height, where available. This file is essentially an xml file including OSM raw data about roads, nodes, tags etc. The file is then imported into OSM2World, e.g., a Java application whose aim is to produce a 3D scene importing as input the underline data. The modelling process was focused on preserving a low vertex count as complex geometry compromises interactive frame rate in systems of low processing power such as mobile phones. Complete 3D reconstructions of the selected monuments were created so that they completely overlap the real ones based on AR viewing. The most important aspect of this strategy is to keep the reconstructions in proportion. The final scale and size are defined in Google Sketch-up, while the final model is positioned in the world coordinate system. The Byzantine wall being the most abstracted, resulted in 422 vertexes. The Glass Mosque counts 7,614 and the Saint Rocco temple 4,919 vertexes. Images captured from the real monuments on site were used as references. A diverse range of materials are assigned to different parts of the buildings. Due to the lack of information, the actual texture of the reconstructed parts is unknown so the aim is to more accurately represent the compositing material rather than the actual surface. The AR framework supports only a power of 2 .png or .jpeg single material texture map. That means that bumps, normal maps and multitextures are not included. The materials that compose the entire texture set are baked into one image that will serve as the final texture. UV mapping is the process of unwrapping the 3D shape of the model into a 2D map. This map contains the coordinates of each vertex of the model placed on an image. Taking into

62

account that the monuments will be displayed on a mobile phone screen in real size, high resolution textures were required. This raises the final size of the texture files, however, the process results in a high quality visual result (2048x2048 pixels).

2.2

Geo-Positioning

In order for the reconstructions to be accurately displayed, combined with real-time viewing of the real world, an initial transformation and rotation is applied. The models were exported in .dae format and imported into Google Sketch-up.

Figure 2: Final 3D models.

The area of the monuments provided by Google maps was projected on a ground plane. The monument is, then, positioned on its counterpart on the map. Given that the proportions of the monuments are in line, the final model is scaled to fit on the outlines. The location of the monument is then added to the file and provided to the framework. In order include the model in the AR framework, we export it to .fbx and use the provided 3D encoder to make a packaged version of the file together with the textures in the custom wt3 format. This file is, then, channelled to the MAR app.

3

AR SOLUTIONS

The most significant technical challenge of AR systems is the registration issue. The virtual objects and the real ones must be appropriately aligned in order to maintain the illusion of integrated presence. Registration in an AR system relies on the accurate computation of the user’s position and pose as well as the location of the virtual objects. Focusing on MAR tracking is either conducted based on sensor-based implementations or computer vision techniques. While computer vision approaches seem to provide pixel perfect registration, anything that compromises the visibility between the user and the augmented area


can result in the virtual scene jitter or collapse. In addition to this, adequate preparation of the environment needs to take place. While sensor approaches are more robust, limits in acquiring reliable data leads to low accuracy. The work proposed focuses on providing a complete MAR experience to the end-user. Commercial AR frameworks were employed, initially evaluating both optical and sensor-based approaches.

3.1

Optical Implementation

In computer vision implementations, the live feed from the camera is processed in real-time to identify points in space, known a priori to the system and estimate the user’s position in the AR scene. These tracking methods require that scene images contain natural or intentionally placed features (fiducials) whose positions are known. Since we don’t want to interfere with the monuments, placing fiducial markers in the scenes was out of the question so we relied on pictures and the natural features of the scenes. A basic application was developed in Unity employing the Vuforia plug-in to test the registration. The image recognition methods use “trackable targets” to be recognized in the real environment. The targets are images processed through natural feature detection algorithms to produce 2D point clouds of the detected features to later be identified by the mobile device in the camera’s feed. When detected, pose and position estimations are available relative to the surfaces they correspond to. The denser the point clouds are, the better the estimate. This means that the initial images need to contain a large amount of detectable features. For real world buildings, these features most often represent window and door corners and intense changes in the texture, which do not provide a large enough number of features for the algorithm to track. Taking into account the lighting variations of outdoors, the sets of features provided to the system differed greatly from the actual scene and pose tracking was not achievable. Outdoors environments present very challenging conditions to such implementations. Building façades provided a limited amount of features and together with variations in lighting conditions, such a system would need a huge amount of images and training to reliably track the outdoor scene. Taking also into account the cramped environment of a touristic site, image recognition was not a realistic choice.

3.2

Sensor Implementation

Sensor approaches use long-range sensors and

trackers that report the locations of the user and the surrounding objects in the environment. These tracking techniques do not require any preparations of the environment and their implementation relies on cheap sensors present in every modern smart phone. They require fewer actions by the users and ensure that the AR experience will be delivered independent of external conditions. The AR system is aware of the distance to the virtual objects, because that model is built into the system. The AR system may not know where the real objects are placed in the environment and is relying on a “sensed” view with no feedback on how close the two align. Specifically in MAR, modern devices are equipped with a variety of sensors able to provide position and pose estimations. In outdoor settings, the Assisted GPS allows for position tracking with an accuracy of up to 3 meters and for pose estimation, the system combines data from the accelerometer and geo-magnetic sensors. The accelerometer provides the orientation relative to the centre of the Earth and the geo-magnetic sensor to the North. By combining this information, an estimate of the device’s orientation in the world coordinate system is provided. For this approach, we used the Wikitude’s JavaScript API in combination with our own location strategy. The Wikitude JavaScript API was selected due to its robust results, licensing options, big community and customer service. The implementation includes the Android Location API and the Google Play Services Location API. For the later, the application creates a request, specifying location frequency and battery consumption priority. After it is sent out, location events are fired including information about latitude, longitude, altitude, accuracy and bearing. These are then provided to the AR activities to transform and scale the corresponding digital content on the screen.

4 4.1

SYSTEM ARCHITECTURE Client-Server Implementation

One of the main challenges we faced in designing the proposed MAR application was the lack of established guidelines in the application and integration of AR technologies in outdoor heritage sites. The overall design of the system consists of two main parts: the mobile client and the server. The server facilitates a database developed in MYSQL which holds the records about the monuments (name, description, latitude, longitude etc.) and user specific information. The information is delivered to the mobile unit on a location request basis. The database 63


is exposed to the users via a Rest-full Web-Service. A basic registration to the system is required. The functions provided include storing data about visits, marked places and the overall progress visualizing all sites in a specific area.

4.2

Mobile AR Application

The mobile application’s architecture has been designed to be extendable and respond easily to changes in the underling model of the server. It is based on three main layers. The Views layer is where the interactions with the users take place. Together with the background location service they act as the main input points to the system. The events that take place are forwarded to the Handling layer which consists of two modules following the singleton pattern. The data handler is responsible for interaction with the local content and communicating with the views while the Rest Client is responsible for communicating with the Web Service. The Model layer consists of basic helper modules to parse the obtained JSON files and interact with the local database. The actions flow from the Views layer tothe lower components. Responding to a user event or a location update, a call is made to the Handling layer which will access the model to return the requested data. The Handling layer is the most important of the three; all interactions, exchange of information and synchronization passes through this layer. The Rest Client provides an interface for receiving and sending information to the remote database as requested by the other layers. It is responsible for sending data about user visits, saved places, updates in progress and personal information. It also provides functions for receiving data from the Server concerning Points of Interest (POI) information and images. It also allows for synchronizing and queuing the requests. While the rest client is responsible for the interactions with the server, the Data Handler is managing the communication of the local content. Information received are parsed and stored in the local DB. The handler responds to events from the views and background service and handles the business logic for the other components. It forms and serves the available information based on the state of the application. The views are basic user-interface components facilitating the possible interactions with the users. The Map View is a fragment containing a 2D map developed with the Google Maps API. It displays the user’s location, as obtained by the background service, and the POIs as markers on the map. The AR view is based on the Wikitude JavaScript API and is

64

where the AR experiences take place. It is a Web view with a transparent background overlaid on top of a camera surface. It displays the 3D reconstructions while it receives location updates from the background service and orientation updates from the underlying sensor implementation. It also contains a navigation view, where the POIs are displayed as labels on the real world. Interactivity is handled in JavaScript and it is independent from the native code. The View pagers are framework specific UI elements that display lists of the POIs, details for each POI, user leader-boards and user profiles. Finally, the Notification View is used when the application is in the background and aims to provide control over the location service. It is a permanent notification on the system tray where the user can change all preferences of the location strategy and start, stop or pause the service at will. The aim of the standalone background service is to allow users roam freely in the city while receiving notifications about nearby POIs. It is responsible for supplying the locations obtained by the GPS to the Map View and AR Views. The location Provider obtains the locations and offers the option to swap between the Google Play Services API and the Android Location API; e.g. two location strategies. In order to offer control over battery life and data-usage, the users can customize its frequency settings from the preferences. The views requesting location updates are registered as Listeners to the Location Service and receive locations containing latitude, longitude, altitude, accuracy information etc. The Location Event Controller serves the location events to the registered views. The user’s location is continuously compared to that of the available POIs and if the corresponding distance is in an acceptable range, the user can interact with the POI. The events sent, include entering and leaving the active area of a POI. If the application is in the background, a notification is issued leading to the AR Views and Map Views. The Model layer consists of standard storing units and handlers to enable parsing JSON files obtained from the server and interfaces to interact with the local DB. It stores the historical information concerning the current local, user specific information and additional variables needed to ensure the optimal flow of the application. The local assets, including the 3D models and the html, JavaScript files required by the Wikitude API, are stored in this layer and provided to the Handling layer as requested. The SQLite Helper is the component responsible for updating the local storage and offers an interface to the Data Handler containing all available interactions.


5

MAR USER INTERFACE

The main technical challenge was to visualize, onsite, the 3D reconstructions of monuments displayed on a standard mobile phone, merged with the realworld offering a location-aware experience. In this section, we present the final user interface of the application and the flow of the experience. Upon the activation of the application the user is welcomed in a splash screen and is requested to create an account or login with an existing one. After the login process, the application checks the location and requests the POIs and the historical information from the server while it transfers to the main screen. The map activity forms the main screen of our application and facilitates the core of the functionality as shown in Figure 4. POIs are displayed on the map in their corresponding geo-locations. By clicking on a marker, the user can see the info window of the POI containing information, a thumbnail and the distance between POIs.

By interacting with a bar at the bottom of the screen, the user navigates the app. These include user profile and preferences, leader-boards and the library while a round button placed in the middle is used to swap between map and camera navigation. The Camera View displays the real-world content as 2D labels, which contain basic information about the POIs and displays them as dots on the radar to assist in navigation. By clicking on a label, a bottom drawer appears which holds additional information and allows for more interactions. Users can save the POI for later reference, access the reconstruction if available, or return to the map. The main concept of the experience is to reveal the available information under controllable conditions, to facilitate a diverse range of textual, image and 3D interactions.

Figure 5: Reconstructions of the demolished towers of the Byzantine Wall and the restoration of the Rocco temple. Figure 3: 3D reconstruction of the Glass mosque featuring the now demolished minaret, as seen by MAR users.

Figure 4: Map view displaying the POIs.

When in the initial state, the user is shown the available 3D reconstructions on the map. After visiting the monuments and viewing them in AR, the rest of the POIs locations are unlocked and displayed on the map indicated as question marks. The goal is to visit them and classify them to the historical periods based on their architectural characteristics and on clues obtained in the library page and from the already visited monuments. The exploration of the POIs can be conducted freely by employing the background service of the app. The users enable or disable this functionality. The aim of this approach is to urge the visitor to observe the monuments, consult the information available and even interact with each other and locals to make each decision. The more areas they visit and unlock, the more points they earn for themselves. The overall progress in the city can be viewed in the Leader-boards page, accessible from the map view.

65


The 3D reconstructions are the main feature we aimed to provide as shown in Figures 3, 5. By situating information in the context of their real surroundings, we aim to elevate the communication of cultural and historical information from static forms to visual standards. The 3D reconstructions are accessed from the navigation views. When the user is in close proximity to the monuments, the location Event Handler informs the views to update the content and enable the AR experiences. In this screen, a reconstructed 3D model of the monument is overlaid on the camera and the GPS and inertial sensors are exploited to display the monument on its real world location. The users can freely move around the real site to view the monuments from all available angles. They can click on the model to get information or access the slider, available at the bottom right, to change between the available 3D models. The user can shift between visualizing either the whole model or the reconstructed parts. The Library page is where a collection of the historical information is displayed (Figure 6). It consists of a view pager containing the historical periods in chronological order. Each page includes a historical briefing and an image showing the active area for that period as well as a list containing the monuments that have been correctly classified. The locked monuments are contained in a separate list. The users can see the monument-specific information by selecting the items on the list. The user can also acquire historical information, including text and images, mark and save monuments or get directions to specific locations.

6

The MAR functionality as well as generic user interface of the application was constantly evaluated since the start of the technical development. Users communicated comments concerning the AR camera view during navigation stating that the use of the camera instead of the map limited their movements and perception of their surroundings and generally refrained from using it except to locate specific sites and to classify the monuments. During the classification process, the AR camera proved useful as it helped locate specified monuments. Moreover, when the AR camera was on in conjunction with the GPS, high battery consumption was an issue. Following such comments, the AR camera was defined as a standalone activity instead of as map replacement. In relation to users’ general impression using AR experiences, user feedback was quite promising. Most users had never been acquainted with a similar application and were very excited to see the reconstructions superimposed on the real-world monument. Although the registration problem was commented by most users, the geo-location approach was intuitive. The instant tracking method was overall challenging for an unaccustomed audience, but after an initial explanation and guidance, users got used to it and proceeded to experiment with placing the models in the annotated area so that they are accurately overlaid on the actual real-world building, as viewed on the mobile phone.

7

Figure 6: Pager View of Library, Monument details page.

66

EVALUATION

CONCLUSIONS

We presented the design of a mobile Augmented Reality application, geo-located and gamified, aimed for consumer-grade mobile phones increasing the synergy between visitors and cultural heritage sites. In addition to exploring application screens and web content, we offer a novel AR approach for visualizing historical information on-site, outdoors. By offering 3D reconstructions of cultural sites through AR, we enhance digital guide experience while the user is visiting sites of cultural heritage and bridge the gap between digital content and the real world. Our design was focused on providing an expandable platform that could envelop additional sites enabling future experts to display their digitized collections using cutting edge AR technologies. Although outdoors Mobile Augmented Reality is still hampered by technical challenges concerning


localization and registration, it offers novel experiences to a wide audience. The availability and technological advances of modern smart phones will allow, without a doubt, seamless and fascinating AR experiences enhancing the understanding of historical sites and datasets. The field of Augmented Reality is a quickly evolving. Tracking and registration in AR are far from solved. Employing a low-level AR SDK, such as the AR-toolkit, would provide access to low level functionality. In its current state, the application does not support interaction between the users, apart from the overall rankings. Extending the platform to include comments, likes, shares etc. and adding an additional communication layer would increase interest in cultural heritage. Gamification combined with scavenging and treasure hunts, location-aware storytelling etc. would add to an even more immersing experience and increase visitor involvement and engagement.

Experience: a Framework for the Geolocalization, Visualization and Exploration of Historical Data using VR/AR Tech. Proc. ACM VRIC 2014, No 1. Ragia, L., Sarri, F., Mania, K., 2015. 3D Reconstruction and Visualization of Alternatives for Restoration of Historic Buildings: A New Approach. In Proc. GISTAM 2015. Střelák, D., Škola, F., Liarokapis, F., 2016. Examining User Experiences in a Mobile Augmented Reality Tourist Guide, In Proc. PETRA 2016, No 19. Vlahakis, V., Karigiannis, J., Tsotros, M., Gounaris, M., Almeida, L., Stricker, D., Gleue, T., Christou, I. T., Carlucci, R., Ioannidis, N., 2001. Archeoguide: First Results of an AR, Mobile Computing System in Cultural Heritage Sites. Proc. VAST 2001, 131-140, ACM. Zhou, F., Been-Lirn, H. D., Billinghurst, M., 2008. Trends in Augmented Reality Tracking, Interaction and Display: A Review of 10 Years of ISMAR. Proc. ISMAR 2008, 193-202.

REFERENCES Azuma, R. T., Baillot, Y., Behringer, R., Feiner, S., Julier, S., MacIntyre, B., 2001. Recent Advances in Augmented Reality. Computer Graphics and Applications, IEEE, 21(6):34–47. Choudary, O., Charvillat, V., Grigoras, R., Gurdjos, P., 2009. MARCH: Mobile AR for Cultural Heritage, In Proc. 17th ACM Int. Conf on Multimedia, 1023-1024. Hable, R., Rößler, T., Schuller, C., 2012. evoGuide: Implementation of a Tour Guide Support Solution with Multimedia and AR content. Proc. 11th Conf. on Mobile and Ubiquitous Multimedia, No 29. Julier, S. J., Schieck, A. F., Blume, P., Moutinho, A., Koutsolampros, P., Javornik, A., Rovira, A., Kostopoulou, E., 2016. VisAge: Augmented Reality for Heritage. In Proc. ACM PerDis 2016, 257-258. Kutter, O., Aichert, A., Bichlmeier, C., Traub, J., Heining, S.M., Ockert, B., Euler E., Navab, N., 2008. Real-time Volume Rendering for High Quality Visualization in AR. In Proc. AMI-ARCS 2008, NY, USA. Lee, J. Y., Lee, S. H., Park, H. M., Lee, S. K., Choi, J. S., Kwon, J. S.. 2010. Design and Implementation of a Wearable AR Annotation System using Gaze Interaction. Consumer Electronics (ICCE), 2010, 185– 186. Nagakura, T., Sung, W., 2014. Ramalytique: Augmented Reality in Architectural Exhibitions. In Proc. CHNT 2014. Niedmermair, S., Ferschin, P., 2011. An AR Framework for On-Site Visualization of Archaeological Data. In Proc. 16th Int. Conf. on Cultural Heritage and New Technologies, 636-647. Pacheco, D., Wierenga, S., Omedas, P., Wilbricht, S., Knoch, H., Paul, F. M., J., 2014. Spatializing

67

An Interactive Story Map for the Methana Volcanic Peninsula Varvara Antoniou1, Paraskevi Nomikou1, Pavlina Bardouli1, Danai Lampridou1, Theodora Ioannou1, Ilias Kalisperakis2, Christos Stentoumis2, Malcolm Whitworth3, Mel Krokos4 and Lemonia Ragia5 1Department

of Geology and Geoenvironment, National and Kapodistrian University of Athens, Panepistimioupoli Zografou, 15784 Athens, Greece 2up2metric P.C., Engineering - Research - Software Development, Michail Mela 21, GR-11521, Athens, Greece 3School of Earth and Environmental Sciences, University of Portsmouth, Burnaby Road, Portsmouth PO1 3QL, U.K. 4School of Creative Technologies, University of Portsmouth, Winston Churchill Avenue, Portsmouth PO1 2DJ, U.K. 5Natural Hazards, Tsunami and Coastal Engineering Laboratory, Technical University of Crete, Chania, Greece {vantoniou, evinom, d.lampridou}@geol.uoa.gr, {pavlina.bard, theod.ioannou}@gmail.com, {ilias, christos}@up2metric.com, {malcolm.whitworth, mel.krokos}@port.ac.uk, [email protected]

Keywords:

GIS Story Map, Geomorphology, Methana Peninsula, Greece, Volcano, Geotope, Hiking Trails.

Abstract:

The purpose of this research is the identification, recording, mapping and photographic imaging of the special volcanic geoforms as well as the cultural monuments of the volcanic Methana Peninsula. With the use of novel methods the aim is to reveal and study the impressive topographic features of the Methana geotope and discover its unique geodiversity. The proposed hiking trails along with the Methana’s archaeology and history, will be highlighted through the creation of an ‘intelligent’ interactive map (Story Map). Two field trips have been conducted for the collection of further information and the digital mapping of the younger volcanic flows of Kammeni Chora with drones. Through the compiled data, thematic maps were created depicting the lava flows and the most important points of the individual hiking paths. The thematic maps were created using a Geographic Information System (GIS). Finally, those maps were the basis for the creation of the main Story Map. The decision to use Story Maps was based on the numerous advantages on offer such as user-friendly mapping, ease of use and interaction and user customized displays.

1

INTRODUCTION

Recent advancements in digital Geographic Information Systems (GIS) technologies can provide new opportunities for immersively engaging public audiences with complex multivariate datasets. Story Maps can be not only robust but also versatile tools for visualising spatial data effectively and when combined with multi-media assets (e.g. photos or videos) and narrative text, they can provide support for scientific storytelling in a compelling and straightforward way. Thereby, Story Maps can be used in order to disseminate and make scientific findings easy to access and understand to broader non-technical audiences (Janicki, J. et al., 2016; Wright, D.J. et al., 2014). The aim of the present research is to identify, record, map and photographically image the special volcanic geomorphs as well as the cultural

monuments of the Methana Peninsula (East Peloponnese, Greece). Methana peninsula is composed by 32 volcanic craters with rough topography, belonging to the western part of the Hellenic Volcanic Arc. Using Story Maps along with novel methods and research tools it is planned to reveal and highlight the peculiar geomorphs of the Methana geotope and discover its unique geodiversity. Adopting Story Maps for this work offers a number of advantages as compared to traditional methods: friendly mapping, the ease of use and understanding of the provided information, the increased interactivity comparing to analogue or simple web maps, the customized display based on the user’s needs, the ability to import different kind of media (images and videos) and ultimately the ability to add explanatory text covering a wide range of heterogeneous information.

68 Antoniou, V., Nomikou, P., Bardouli, P., Lampridou, D., Ioannou, T., Kalisperakis, I., Stentoumis, C., Whitworth, M., Krokos, M. and Ragia, L. An Interactive Story Map for the Methana Volcanic Peninsula. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 68-78 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

An Interactive Story Map for the Methana Volcanic Peninsula

2

STUDY AREA

The Methana volcanic peninsula (Methana Volcano) is located at the Western Saronic Gulf, approx. 163Km from Athens, covering an area of 50Km². Methana Volcano is at the western part of the Aegean volcanic arc extending from Saronic Gulf up to KosNisyros volcanic field at the eastern part (Fig. 1). The Aegean volcanic arc belongs to the Hellenic Orogenic Arc, which is formed along the convergent plate boundary of the northwards subducting African plate underneath the active margin of the European plate (Nomikou et al., 2013). The peninsula of Methana has the longest recorded volcanic history of any volcanic centre in the Aegean Volcanic Arc, consisting of 30 volcanic cones. Particularly noteworthy are the historical references regarding the volcanic activity of the submarine volcano Pausanias, lying offshore the northwest part of Methana peninsula (Pavlakis et al., 1990), in the 3rd century BC. Throughout Methana peninsula there is a welldeveloped network of hiking trails, passing through historic settlements, small churches, hot springs and unique geomorphological features attributed to the volcanic history (lava formations) and the complex tectonic regime of the area (Pe-Piper and Piper, 2013). The overall length of the hiking network is approximately 60Km, and based on the present study the hiking distances range between 0.5Km up to 5Km. Moreover, the trails are rated into different difficulty levels and in several cases appropriate equipment is needed. Volcanic activity in the area is considered to have begun in the late Pliocene (Gaitanakis and Dietrich, 1995), and the last eruption took place in 230 BC giving andesitic lava, at Kammeni Hora, as recorded by the ancient geographer Strabo (Georgalas, 1962). The Quaternary volcanic rocks on Methana consist of domes and flows radiating from the central part of the peninsula, overlying older, undated volcanic rocks (inferred Pliocene or early Pleistocene in age). At a map scale, many of the domes are elongated in an east–west or northeast–southwest direction. The volcanic style and rate of eruption are closely related to periods of change in regional tectonic style (Pe-Piper and Piper, 2013). Moreover, Pe-Piper and Piper (2013) deciphered the volcanological evolution of the Volcano in great detail based on geochemical, geochronological analyses and field observations. The following volcanic history has been identified (Fig. 2):  Phase A. Late Pliocene. Small domes of andesite and dacite were extruded on N–S-striking faults in eastern and southern Methana. Either

 



 



synchronously or later, a larger volcanic edifice grew somewhere near the present centre of the peninsula. Phase B. Erosion of the central edifice to form the volcanoclastic apron, perhaps associated with faulting and uplift. Phase C. Eruption of basaltic andesite now preserved in northern Methana around Kounoupitsa, at Ag. Andreas and Akri Pounda. A series of explosive Plinian eruptions deposited in the northern and eastern parts of the volcanοclastic apron and at Akri Pounda. Erosion of the central edifice and volcanοclastic deposition on the apron continued. The age of phase C is poorly constrained — the 1.4 ± 0.3 Ma date on a dome in northern Methana is only tentatively correlated with this phase. Phase D. Andesite flows in the north-western part of the peninsula and dacites in the south show some geochemical similarities to phase C (e.g. high TiO2 content), but overlie the volcanoclastic apron and its associated erosion surface in eastern Methana. Imprecise radiometric dates range from 0.5 to 0.9 Ma. Phase E. The north-western dacite volcanoes were formed and are dated at 0.6 ± 0.2 Ma in this study. Phases F and G. These phases were characterised by the eruption of the central andesite volcanoes and the E–W fissure dacites. Some explosive pyroclastic eruptions preceded major andesite and dacite eruptions. Available radiometric ages from phase G cluster between 0.29 and 0.34 Ma. Phase H. Eruption of the Kammeni Hora flows, probably within the last 0.2 Ma, with the most recent eruption in historic times.

2.1

Geomorphology

Methana peninsula is characterized by rough topography, generated by the complex regional tectonic regime in combination with the volcanic activity. The mountainous relief of the peninsula, 740 masl at its highest point, falls to the sea with no lowland plain. Abrupt and sudden changes in slope gradient alternate with flat basinal areas (Fig. 3) filled by Quaternary sediments, where at the same time volcanic agglomerates commonly fill depressions between domes (James et al.,1994). Moreover, the volcanic landforms are dissected by stream gullies, reflecting the intense erosion. This rugged terrain, with the well-developed drainage system and the steep slopes, is prone to landslides and rockfalls induced by geomorphologic and geologic controls.

69


Figure 1: Topographic map of the southern Aegean Sea combining onshore and offshore data. The four modern volcanic groups are indicated within red boxes together with the names of the main terrestrial and submarine volcanic centers along the volcanic arc (Nomikou et al., 2013).

Figure 2: Schematic cross-sections (E–W or SE–NW) and maps of Methana, showing inferred relationship of volcanic stratigraphy to evolution of regional fault patterns. Cross sections illustrate stratigraphy; no representation of the magmatic plumbing system is attempted (Pe-Piper and Piper, 2013).

70


Figure 3: Morphological Map of Methana peninsula.

3

DATA COLLECTION

To tackle the challenge of creating the Story Map of Methana volcanic peninsula, different types of datasets have been collected (Fig. 4). All the available literature regarding the geology, geodiversity, archaeology and biodiversity has been compiled and geospatial data have been downloaded from open source portals. Moreover, two field trips took place in September in order to acquire field data.

3.1

and the volcanic formations at the western part of Methana. There, video sequences and images at constant time intervals were captured, to guarantee a higher than 80% image overlap (Fig. 5).

Field Trips

Two field trips have taken place in order to collect new photographic material, to trace paths and find places of special interest, attaching representative photographs or videos, etc. In order to collect all these new data, up – to – date technology has been used which is the Collector for ArcGIS software and GPS. Furthermore, an aerial campaign with an unmanned aerial vehicle (UAV) was conducted. A commercial, off-the-shelf quadrocopter (DJI Phantom 4 Pro Plus) was used with a 21 MP digital RGB camera from University of Portsmouth and up2metric Company. Flights were performed at different areas of interest, over the City of Methana, the Kammeni Chora village

Figure 4: Chart showing the different types of datasets used in this study.

71


Figure 5: Photo taken during the fieldtrips capturing Methana Volcano.

4 4.1

METHODOLOGY References

For the present research, a geo-database was created, which is the systematic collection of the existing information for the study area into a user – friendly and functional system in order to support effective ways for data visualisation. Specifically, our database was divided into two subsets of data. The first subset consists of the available data files for the area (bibliography, topographic map, geological – tectonic structure etc.). Those that were in analogue form were converted into digital form, in order to make further use of them. This subset also includes the vector data files that have been designed for the data collection in the field work. The second subset consists of the data files that resulted from the processing of the aforementioned, or their modification with new field data, and these are the files that thematic maps were based on. These files were gradually transformed into

72

a format suitable for online use (feature or image services) and easily applied to the Story Map.

4.2

Field Data Collection

A Geographical Information System, ArcGIS platform from ESRI Company, with both desktop and online applications, was used to accomplish this study (https://www.esri.com/). The creation of information layers, including existing and new data, has been performed through ArcMap v.10.5.1 software (“ArcGIS for Desktop – ArcMap,” n.d.). In addition, ArcGIS Online (“ArcGIS Online - ArcGIS Online Help,” n.d.) has been used in order to construct the online map (webmap) on which collected data would be presented. Finally, Collector for ArcGIS, (“ArcGIS Collector,” n.d.), both compatible for Android and iOS software, has been used for the data collection. This application supports functionality, to collect and update spatial and descriptive data through mobile devices (tablets or smartphones). More specifically, these advantages are:


 Convenient collection of points, lines and elements that cover a large area.  Data collection and update using the map or the GPS signal.  Photos and videos attachments confirming the collected descriptive data.  Capability to download maps in a mobile device and use of them even with no internet access.  Capability of monitoring specific areas and report composition about them. In more detail, the methodology unfolds as follows (Fig. 6): Firstly, collection and organization of existing vector and grid data was carried out as well as their spatial and descriptive analysis, if necessary. For this purpose, a geodatabase has been created via ArcMap v.10.5.1 software, in which all information layers that would also appear on the online map have been added, including the editable ones. Each of the information layers hosts apart from the type of the spatial information all the necessary fields for the descriptive information. This information would be either the already existing one or the one that would be collected during field work. The pre-existing information layers include coastline, settlements, geological formations (Fig. 7) and tectonic structures of the island. Two editable information layers have been created, for the field data collection (Fig. 8). One point and one polyline vector file, which apart from spatial information they will also include descriptive information and photos or video for each collected feature. In the second part of this study, information layers were uploaded in the online platform of ArcGIS and have been converted to feature services, a file type that can show information online (https://goo.gl/mBTiKF). In the next step, a webmap and the individual parameters for each of the information layers, e.g. its symbol and the appearance or not of tags and pop-up menus etc., have been created (DiBiase et al., 1992; Newman et al., 2010). Moreover, a refresh interval for the information layer regarding data collection has been defined. Specific symbols for each user group have been created, so that each group can directly be identified. Imagery, which is available from ArcGIS platform, was assigned to be the background of the above information layers. In this research, GIS technology was used only to collect, analyse and visualize data, using desktop and online interactive techniques, because its main aim was to disseminate this way of data presentation to

the public, combining scientific information about the volcanic peninsula with archaeology and history.

4.3

UAV Survey

The acquired video samples were used to create small demonstration videos and panoramic photos. Still images, captured at constant time intervals to guarantee a higher than 80% overlap, were also used to generate photogrammetric 3D textured models. For the latter, the drone camera was calibrated and all images were oriented with a standard StructureFrom-Motion (SFM) approach. This procedure includes the establishment of sparse multi-image point correspondences. This is achieved by 2D feature extraction and matching among images, employing feature descriptors at multiple image scales. The point correspondences were filtered through standard RANSAC outlier detection and all mismatched points were identified and eliminated. Image orientations were initialized through closed form algorithms and finally optimal estimations of exterior and interior orientation parameters were computed through a standard self-calibrating bundle adjustment solution. After image orientation, dense point clouds were generated by means of dense stereo and multi-image matching algorithms. Through 3D triangulation, the 3D point clouds were converted to 3D mesh models. Photorealism was finally achieved by computing texture for each 3D triangle via a multi-view algorithm, using a weighted blending scheme. Photorealistic texture was estimated by means of interpolation, using all images which view each particular surface triangle (Fig. 9). The photogrammetric processing was performed using the Pix4DMapper commercial software, assisted by own developed algorithms for dense stereo matching (Stentoumis et al., 2014) and the refinement of the 3D model’s texture (Karras et al., 2007).

5

STORY MAP

In order to compose this Story Map, all the available information was uploaded to the online platform. Users have the possibility to either using a private server or uploading information directly to ArcGIS Online. The latter approach was followed during the deployment of this Story Map (Fig. 10).

73


Figure 6: Workflow to be followed in order to use mobile devices for collecting data via Collector for ArcGIS application.

Figure 7: Screenshot of ArcGIS Desktop 10.5.1, which shows the geological information (spatial and descriptive) for the area.

74


Figure 8: Screenshot which shows data that have been collected during field trips.

Figure 9: Photogrammetric 3D image of Nisaki (Methana peninsula).

75


Figure 10: Workflow highlighting the procedure followed to produce a Story Map (Antoniou, 2015).

A certain template, called Story Map Series was implemented, to present the available information. Story Map Series comes with three layout options: tabbed, side accordion and bulleted. The first one was selected for the main Story. Web maps, narrative text, images, tables, video, external websites, scenes which correspond to 3D presentation of data were used. Also, other Story maps and apps were embedded, such as Story Map Shortlist, Story Map Series-Side Accordion and Time Aware. Finally, Story Map Cascade was used as a home page (Fig. 11). Thematic maps were created in ArcGIS online and were based on the collected data, the fieldwork and literature review, depicting the most important and unique points. More specifically: First tab using Story Map Series-Side Accordion, gives general information about Methana peninsula, containing the geographical position and geomorphology and a brief description of the area’s points of interest. Text is accompanied by webmaps showing the spatial distribution of these having as basemap, imagery from ESRI’s basic gallery maps (Fig. 12). Second tab presents the geological setting of the peninsula. Text explains the volcanic activity of the area and in addition an embedded Time Aware application presents the geological-volcanic evolution of the island. As basemap, a 5m-hillshade of the area was used and a Scene (3D presentation) was created. Next three tabs present the main hiking trails in the area. Story Map Tour application was embedded

76

in each one of them. Text describes the morphology of the path and gives detail information for every point of interest and a webmap gives the spatial distribution of them. Users are able to select either a point in the map or a photo - video from the carousel and gather further information. Last tab, indicates the Research Team responsible for the creation of this Story Map. Finally, in order to give users, the ability to choose the language they prefer, a Story Map Cascade was used to be Story Map’s home page.

6

CONCLUSIONS

The use of Story Map has plenty of advantages since it presents useful and attractive information about the study area. The use of explanatory text and the incorporation of multi-media helps the end user to engage in scientific knowledge transfer and provides a better understanding of Methana’s volcanic geodiversity. The user of the Story Map can navigate easily through the content, by pop - ups, swipe up and down and through slides. As it is user - friendly, the interface can be customized according to the user’s display screen (mobile phones, computers or tablets) and every single user has the ability to customize the application to his needs (for example, unveiling specific volcanic cone).


Figure 11: Methana Volcano Story Map structure.

Figure 12: Representative Screenshot of the Story Map.

In conclusion, Methana Volcano Story Map portrays a good example of a web map, while providing information to a wide audience, developing the interest and possibly motivating the public to learn more (or even to visit) about the display area.

ACKNOWLEDGEMENTS This work was supported and funded by the Municipality of Troizinia - Methana in the framework of the Applied Research Program ‘Evaluation and exploitation of the geotope of the Methana Volcano” of National and Kapodistrian University of Athens.

77


REFERENCES Antoniou, V., 2016. Development of modern, onlineinteractive applications using WebGIS for processing and mapping geoenvironmental information and real time field data, to prevent and manage natural disasters. Master Thesis, National and Kapodistrian University of Athens, 138p. “ArcGIS Collector.” n.d. https://goo.gl/KoBxzn (accessed 3/1/2018). “ArcGIS for Desktop – ArcMap.” n.d. https://goo.gl/OPVwsG (accessed 3/1/2018). “ArcGIS Online - ArcGIS Online Help.” n.d. https://goo.gl/rJUyYn (accessed 3/1/2018). DiBiase, D., MacEachren, A.M., Krygier, J.B. and Reeves, C., 1992. “Animation and the Role of Map Design in Scientific Visualization.” Cartography and Geographic Information Systems, 19 (4): 201–2014. Gaitanakis, P. and Dietrich, A.D., 1995. Geological map of Methana peninsula, 1:25 000, Swiss Federal Institute of Technology, Zurich. Georgalas, G., 1962. Catalogue of the Active Volcanoes of the World, Including Solfatara Fields. Part 12: Greece. International Association of Volcanology, Rome,40pp. James, P. A., Mee, C. B. and Taylor, G. J., 1994. Soil Erosion and the Archaeological Landscape of Methana, Greece. Journal of Field Archaeology, 21(4), p.395. Janicki, J., Narula, N., Ziegler, M., Guénard, B., and Economo, E. P., 2016. Visualizing and interacting with large-volume biodiversity data using client–server webmapping applications: The design and implementation of antmaps.org. Ecological Informatics, 32, 185–193. Karras, G., Grammatikopoulos, L., Kalisperakis, I. and Petsa, E., 2007. Generation of orthoimages and perspective views with automatic visibility checking and texture blending. Photogrammetric Engineering and Remote Sensing, 73(4), 403-411. Newman G., Zimmerman D., Crall A. Laituri M., Graham J. and Stapel L., 2010. User-friendly web mapping: lessons from a citizen science website, International Journal of Geographical Information Science, 24, 12, 1851- 1869. Nomikou, P., Papanikolaou, D., Alexandri, M., Sakellariou, D. and Rousakis, G., 2013. Submarine volcanoes along the Aegean Volcanic Arc. Tectonophysics, 507-508, 123-146. Pe – Piper, G. and Piper, D., 2013. The effect of changing regional tectonics on an arc volcano: Methana, Greece, Journal of Volcanology and Geothermal Research, 260, 146 – 163. Pavlakis, P., Lykousis, V., Papanikolaou, D. and Chronis, G. 1990. Discovery of a new submarine volcano in the western Saronic Gulf: the Paphsanias Volcano. Bulletin of the Geological Society of Greece, 24, 59-70. “Pix4DMapper” n.d. https://pix4d.com/ (accessed 3/1/2018). Stentoumis, C., Grammatikopoulos, L., Kalisperakis, I. and Karras, G., 2014. On accurate dense stereo-matching

78

using a local adaptive multi-cost approach. ISPRS Journal of Photogrammetry and Remote Sensing, 91, 29-49. Wright, D.J., Verrill, A., Artz, M. and Deming, R, 2014. Story maps as an effective social medium for data synthesis, communication, and dissemination, Eos, Trans. AGU, 95, Fall Meet. Suppl., Abstract IN33B3773.

GIS and Geovisualization Technologies Applied to Rainfall Spatial Patterns over the Iberian Peninsula using the Global Climate Monitor Web Viewer Juan Antonio Alfonso Gutiérrez, Mónica Aguilar-Alba and Juan Mariano Camarillo Naranjo Departamento de Geografía Física y Análisis Geográfico Regional, Universidad de Sevilla, Spain [email protected], [email protected]

Keywords:

Geovisualization, Precipitation, Climate Data, Spatial Databases, Iberian Peninsula, Spatial Analysis.

Abstract:

Web-based GIS and geovisualization are increasingly expanding but still few examples exist with regard to the diffusion of climatic data. The Global Climate Monitor (GCM) (http://www.globalclimatemonitor.org) created by the Climate Research Group of the Department of Physical Geography of the University of Seville was used to characterize the spatial distribution of precipitation in the Iberian Peninsula. The concern about the high spatial-temporal variability of precipitation in Mediterranean environments is accentuated in the Iberian Peninsula by its physiographic characteristics. However, despite its importance in water resources management it has been scarcely addressed from a spatial perspective. Precipitation is characterized by positive asymmetric frequency distributions so conventional statistical measures lose representativeness. For this reason, a battery of robust and non-robust statistics of little used in the characterization of precipitation has been calculated and evaluated quantitatively. The results show important differences that might have significant consequences in the estimation and management of water resources. The realization of this study has been carried out using Open Source technologies and has implied the design and management of a spatial database. The results are mapped through a GIS and are incorporated into a web geovisor (https://qgiscloud.com/Juan_Antonio_Geo/expo) in order to facilitate access to them.

1

INTRODUCTION

Current climate research benefits from the existence of large and global climate databases are produced by various international organizations. The common denominator is the availability and accessibility under the 'open data' paradigm. Very often, these new datasets cover the entire earth at a more regular spatial distribution (normally gridded), with a longer and more homogeneous time span and are built under more-robust procedures. Many varied sources of information are at the basis of global datasets that are accessible on the reference web portals of the subject. It is important to note the wide availability of these global datasets and their quality. In most cases these datasets are distributed under open-database licenses. This distribution has favoured the increasingly widespread use of global data by scientists, and the emergence of countless references from studies based on these data (Folland et al., 2001; Jones &

Moberg, 2003; New, Hulme & Jones, 2000, etc.). However, the complexity of the very technical formats of distribution (netCDF or huge plain text files with millions of records) limits these datasets to a very small number of users, almost exclusively scientists. For non-expert users, it is important to develop and offer new environmental tools in an open and transparent manner because stakeholders, users, policy makers, scientists and regulators prefer it and demand it (Carslaw and Ropkins, 2012, Jones et al., 2014). These open data and open knowledge paradigm also referred by some as 'the fourth paradigm' (Edwards et al., 2011) responds, in relation to climate data, to the double challenge that climate science is currently facing; on the one hand, it has to guarantee the availability of data to permit more exploration and research and, secondly, it has to reach citizens. This leads directly to the use of Open Source technologies supported by an extensive worldwide community of users that provide tested evidence in very stressful applications. Such a large 79

Gutiérrez, J., Aguilar-Alba, M. and Naranjo, J. GIS and Geovisualization Technologies Applied to Rainfall Spatial Patterns over the Iberian Peninsula using the Global Climate Monitor Web Viewer. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 79-87 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved


and proven implementation in results and experience has also motivated the use of these open technologies in our research. Particularly, the Climate Research Group of the University of Seville has a remarkable experience with PostgreSQL/PostGIS being the core of the data management and research tools. Concerning the geovisualization, it is worth noting that it is a crucial element in applications for decision support; web services are particularly appropriate for this (Vitolo et al., 2015). Also, webmapping has become an effective tool for public access to information in general and to climate knowledge and climate monitoring in particular, specially considering the ongoing advances in web GIS and geovisualization technologies. Despite the fact that there are some very specialized geoviewers to access and download the data (for example, the Global Climate Station Summary by NCDC-NOAA) in most cases, these viewers are very general and/or offer poor geodisplays (European Climate Assesment). The Global Climate Monitor (GCM) system belongs to this field, more precisely to the field of researching possibilities of dissemination of monitored climatic information through the use of geospatial web viewers. In this work, we focus on showing the possibilities of the GCM in climatic research by analysing the spatial distribution of precipitation in the Iberian Peninsula located in southern Europe. Understanding the spatial distribution of rainfall is an important element for the management of natural resources in the Iberian Peninsula. With the exception of the northern mountain range, the Iberian Peninsula is included in the Mediterranean climate domain (De Castro et al., 2005; MartinVide, 2011b) showing the inter-annual irregularity characteristic of the this type of climate (GarcíaBarrón et al., 2011). Its main characteristics are marked fluctuations from rainy years to periods of drought together with an irregular intra-annual regime with minimum values of rainfall during the summer months (García-Barrón et al., 2013). Due to these characteristics decision-making in water management requires the delivery of accurate scientific information that provides objective criteria for the technical decisions in the water planning process directly affecting the environment and society (Krysanova et al., 2010; Cabello et al., 2015). Furthermore, in order to advance the knowledge of rainfall regime in the Iberian Peninsula, researchers have related the annual, seasonal or monthly volume to synoptic situations mainly linked to patterns of atmospheric circulation and weather types (Muñoz-Diaz and Rodrigo, 2006; López-

80

Bustins et al., 2008; Casado et al., 2010; HidalgoMuñoz et al., 2011; Cortesi et al., 2014; RíosCornejo et al., 2015). Nevertheless, there are few studies dedicated to the analysis of the spatial variation of the rainfall for the Iberian Peninsula. The present study introduces a complementary aspect calculating a set of robust and non-robust statistics for the characterization of precipitation. Robust statistical measures are not commonly used when characterizing neither precipitation nor for the estimation of water resources in environmental management and planning despite the positive skewness of the frequency distributions. The differences between both types of measurements have also been obtained quantitatively in order to evaluate the possible bias when estimating precipitation volumes. This work has involved the implementation of a spatial database of high volume that would allow the analysis and processing of data. Results can be viewed using GIS technologies and this information is disseminated and made accessible to final building up a new geoviewer (https://qgiscloud.com/ Juan_Antonio_Geo/expo).

2

STUDY AREA AND DATA

The Iberian Peninsula is located in the southwestern end of Europe, next to North Africa, and, thereby and surrounded by the Mediterranean Sea to the East and by the Atlantic Ocean to the West. Due to transition situation between the middle latitudes and the subtropical ones, and to its complex orography, its climatic characteristics and types are very diverse due to the complex patterns of spatio-temporal variability of most climatic variables (Garcia-Barrón et al., 2017). The GCM currently displayed corresponds to the CRU TS3.21 version of the Climate Research Unit (University of East Anglia) database, a product that provides data at a spatial resolution of half a degree in latitude and longitude, spanning from January 1901 to December 2012 on a monthly basis. From January 2013, the datasets that feed the system are the GHCN-CAMS temperature dataset and the GPCC First Guess precipitation dataset (Global Precipitation Climatology Centre) Deutscher Wetterdienst in Germany. The data that are currently offered in the display come from three main datasets: the CRU TS3.21, the GHCN-CAMS, and the GPCC first guess monthly product. The basic features of these products are shown in table 1.

GIS and Geovisualization Technologies Applied to Rainfall Spatial Patterns over the Iberian Peninsula using the Global Climate Monitor Web Viewer

Table 1: Basic features of the datasets included in the Global Climate Monitor. Dataset

CRU TS3.21

GHCN-CAMS

GPCC first guess

Spatial resolution

0.5ºx0.5º lat by lon

0.5ºx0.5º lat by lon

1ºx1º lat by lon

Time span

January 1901 to December 2012

January 1948 to present expired month

August 2004 to present expired month

Time scale

Monthly

Monthly

Monthly

Spatial Reference System

WGS84

WGS84

WGS84

Format

netCDF

Grib

netCDF

Variables

Total precipitation amount Average mean temperature

Average mean temperature

Total precipitation amount

Figure 1: Global Climate Monitor view.

The CRU TS3.21 dataset is a high-resolution grid product that can be obtained through the British Atmospheric Data Centre website or through the Climatic Research Unit website. It has been subjected to various quality controls and homogenization processes (Mitchell & Jones, 2005). It is important to note that this database is also offered in the data section of their website and in their global geovisualization service (Intergovernmental Panel on Climate Change, 2014). Apart from this organization, CRU TS is one of the most widely used global climate databases for research purposes. The GHCN and GPCC datasets are used to update monthly data. The data used for this study are historical series of monthly

precipitation from January 1901 to December 2016 that covers the entire Iberian Peninsula. Visually, they are represented in the form of a grid, in an area of 0.5ºx0.5º latitude-longitude (Figure 1). Therefore, the total volume of information takes into account the total months, years and cells that compose the spatial extent. The amount of information of more than 400,000 records with spatial connection requires the use of a database management system. Given the nature of the precipitation data grid, which is not graphically adjusted to the actual physical limits of the Iberian Peninsula, it is necessary to adapt the rainfall information to the limits of the field of study. For this purpose a vector layer containing the physical boundaries of the

81


different territorial units that make up the study area (Spain and Portugal) was also used. The 1: 1,000,000 territorial statistical units (NUTS) of the European territory for the year 2013 data have been obtained from the European Statistical Office (Eurostat, http://ec.europa.eu/eurostat/). The Coordinate Reference System through which this information is distributed is ETRS89 (EPSG: 4258), which is an inconvenience when combining this with the rainfall data projected in WGS84 Spatial Reference System used in the GCM. Therefore, through a geoprocess of coordinate transformation both systems were squared. The assembly of the available open-source technologies used in this study is shown in table 2. The use of spatial data server, map server, web application server and web viewers allow scientists to undertake these types of macro-projects based on the use of Big Data information. Table 2: Open-source technologies used in this study. Software / Application PostgreSQL / PostGIS PgAdmin QGIS QGIS Cloud

Use Spatial database management system, open source, which has been used in this work Open source administration and development platform for PostgreSQL Free and open source desktop GIS used for export database’s results Free geoviewer web used for results representation and facilitating their distribution

Particularly noteworthy is the role of the spatial data server PostGIS in handling spatial and alphanumeric information and charts based on a relational system PostgreSQL. The data are natively encoded in said system providing a high performance and allowing the use of any analytical functions required; but the most importantly reason for using PostgreSQL / PostGIS is its geographic relational database that make possible to carry out geoprocessing without having to leave the processing core. This allows the regionalization of data in a quasi-automatic way for any territorial scales of interest based on SQL language. Many tools in this field are more or less equivalent such as SciDB database that manages multidimensional cubes (especially suitable for satellite images processing) or some scientific libraries found in R or Python. Nevertheless these technologies present a different approach. While it is true that for raster spatial applications using time

82

series data is the ideal environment, it can be argued that in terms of data structure the same effect can be obtained with a more conventional relational approach. Such is again the case of R and Python programming language that cover the same scientific needs with different technologies. In any case, a geographic relational database can also replace these technologies in many scenarios like the one presented in this work. Another advantage is that PostgreSQL / PostGIS can be directly coupled to R (http://www.joeconway.com/doc/doc.html) thus obtaining a geostatistical database and enlarging the possibilities of use and applications. Through QGIS Cloud a web geoviewer was developed as an extension of the QGIS. The resulting maps of this study are represented in a geoviewer (https://qgiscloud.com/Juan_Antonio_Geo/expo). Open source systems fit the main aim of the Global Climate Monitor project: rapid and friendly geovisilization of global climatic dataset and indicators to experts and non-experts users instead of focusing in data analytics.

3

METHODOLOGY

The methodology followed in this work is presented in Figure 2. First, the data were downloaded from the GCM in csv format and converted into a database to be modelled. The proposed theoretical model defines the conceptual and relational organization which generates the physical database itself. The conceptual model is simple and consists of several tables: the meteorological stations table, the monthly precipitation table values and the table with the geographical limits of the Iberian Peninsula. Then, a series of queries are performed in SQL language on the monthly precipitation series of data in order to get the seasonal and annual values as well as the statistical measures calculated from them. The added value and contribution of using GIS and geospatial databases is that both allow massive geospatial time and space analysis and geovisualization. The aim is to get a series of statistics that will help us characterize the spatial distribution of the precipitation gridded series. The analysis of variability, dispersion, maximum and minimum and frequency histograms of the precipitation series in the Iberian Peninsula determined the need of using adequate statistical measures to fulfil our goals. This method of work by the management of a database system allows the analysis of precipitation


Figure 2: The methodology flow chart.

at three different levels; a joint analysis of the historical series, where all the monthly values recorded in the database are collected; seasonal analysis obtained by grouping months with a theoretically similar behaviour that allows the comparison between them; and finally intra-seasonal analysis, which gives the possibility of studying the statistically behaviour and variability of precipitation within the months of each season. In the Mediterranean area climate variables, and particularly precipitation, present an extremely variable and irregular behaviour, so the frequency distributions tend to be asymmetrical. Typically, non-robust statistics (mean, standard deviation and variation coefficient) are commonly used in this type of climatic studies. But these measures are susceptible to extreme values that may detract their statistical representativeness. For this reason we decided to incorporate robust statistics to eliminate this effect and to be able to assess the differences between them (robust and non-robust) in both absolute and relative statistics that are shown in table 3. It is important to note that, despite the relative simplicity of these calculations many results can be obtained due to the potential offered by the use of spatial databases. Table 4 shows the statistical measures calculated at different time scales (annual, seasonal and monthly) for each precipitation series. The result has provided a total of 136 outcomes, which were

viewed using GIS, each with their corresponding cartographies for the entire Iberian Peninsula. Table 3: Statistical measures calculated for each series. Centrality statistical Not Robust Mean () Robust

Median (Me)

Difference ( – Me)

Absolute statistical dispersion Standard deviation (s) Interquartile range (IRQ) (s – IRQ)

Relative statistical dispersion Coefficient of variation (CV) Interquartile coefficient of variation (ICV) -

Table 4: Statistical measures calculated at different time scales. Time scale Annual Seasonal Monthly Total

Measures of central tendency 3 12 36 51

Absolute measures of variation 3 12 36 51

Relative measures of variation 2 8 24 34

The statistics obtained for each of the cells are incorporated into the open source Geographic Information System QGIS. This is very useful when carrying out multitude of analysis processes or simply performing a cartographic representation of the results. Finally, each map outcome is included in a new web geoviewer (https://qgiscloud.com/ Juan_Antonio_Geo/expo).

83


4

RESULTS

The management of large volumes of precipitation information through spatial-temporal databases has made possible to obtain products of relevant climatic interest related to the spatial estimation of precipitation in the context of water resources management. The main results are presented first focusing on statistical measures of central tendency and measures of variability. The comparison and quantification of non-robust and robust statistical measures spatially represented in maps by means of GIS allow the evaluation of the effect produced when incorporating non-robust statistics rarely used. Relative dispersion statistical measures are also calculated to diminish the effect of the very different amounts of rain recorded in the Iberian Peninsula. Concerning measures of central tendency the picture obtained by calculating the mean or the median perfectly identified the three large homogeneous climatic zones traditionally described for the Iberian Peninsula. The first, called the Atlantic or humid region, corresponds to the Mesothermal climates, which extend over most of the North coast from Galician to the Pyrenees; the second, corresponding to semi-arid or sub-desert region, occupies the southeast of the peninsula around the province of Almeria; and the third is the most extensive region with Mediterranean climate that occupies the greater part of the Iberian Peninsula. The spatial division basically matches the 800-mm isohyet separating the humid zone from the Mediterranean ones and 300-mm that delimit the southeast semi-arid area. Comparing the robust (median) to the non-robust (mean) measures it can be stated that there is a general overestimation of precipitation. The map of the difference between mean and median shows the predominance of greenish tonalities which represent mean precipitation values above medium precipitation (Figure 3c). This is a consequence of the positive symmetry of the annual precipitation distributions due to the presence of very rainy years with respect to the rest of the series. The areas where precipitation is overestimated are mainly located in the south normally characterized by low levels of precipitation. In relation to the non-robust dispersion statistics (standard deviation) and robust (interquartile range) the most characteristic of both maps is the presence of a marked NW-SE gradient (Figure 4). Inside the Iberian Peninsula and great part of the south and southeast dispersion values are less than 150 mm linked to lower precipitation records. Therefore, where the annual totals are higher, a greater

84

dispersion is observed in the precipitation values. Standard deviation values are markedly higher than the interquartile range except for the humid zones of the north Atlantic coast. In the map of the differences between standard deviation and interquartile range it can be seen the latter concentrating the greatest differences in the Atlantic coast. These differences indicate the heterogeneity of the precipitation values as a function of their mean and median, which shows the high irregularity in certain areas of the Iberian Peninsula. This is a surprising result since these zones are usually characterized by their pluviometric regularity. a)

b)

c)

Figure 3: Cartography of annual centrality statistical (1901-2016); a) Mean precipitation, b) Median precipitation, c) Difference between mean and median.


a)

b)

contrast. The country becomes distinctly divided into the northern coast, where the stronger influence of Atlantic disturbances produces more regular daily rainfalls, and the rest of the territory. The Mediterranean depressions produce highly contrasting amounts (sometimes very large) especially the Mediterranean side of the Peninsular due to its scarce annual precipitation. Though the Pearson Coefficient of Variation (CV) has been already used for precipitation studies in the Iberian Peninsula (Martín-Vide, 2011a), the Coefficient of Interquartile Variation has not been used previously (Figure 5b). It shows a less defined pattern than the CV and with much higher values of variation. The most notable is the low variability, with respect to the rest of the territory registered in the northern part as well as the presence of higher levels in the Mediterranean coast and the Southwest area of Atlantic influence decreasing as it penetrates the interior of the peninsula. a)

c)

b)

Figure 4: Cartography of annual absolute statistical dispersion (1901-2016); a) Annual standard deviation, b) Annual interquartile range, c) Difference between standard deviation and interquartile range.

Finally, in order to remove the effect con magnitude of precipitation totals, relative dispersion statistics (Pearson Coefficient of Variation and Coefficient of Interquartile Variation) were calculate to compare different zones of the peninsula. In the first map (Figure 5) it can be observed a considerable decrease from north to south and even in most of the Mediterranean coast. The maps evidence considerable geographic coherence and identify the areas with most rainfall

Figure 5: Cartography of annual relative statistical dispersion (1901-2016); a) Annual coefficient of variation, b) Annual interquartile coefficient of variation.

These differences are not depreciable since in many areas, such as the southwest of the Peninsula corresponding to the Guadalquivir river valley, 80% of the water resources are destined to the demand of 85


irrigated agriculture. So far, the presence of many reservoirs manages to satisfy and balance changes in the availability of water, but the expected changes due to climate change, with increasing temperatures, evapotranspiration and extreme events threaten a management system that has already exceeded the natural limits of the resource. For this reason our results are particularly relevant in these areas where major imbalances and drought problems occur. There should be a precise estimation of water resources in order to be preparing for climate change adaptation and mitigation measurements. In addition, the simpler and more efficient way to show all the results obtained in this work is to make them accessible in a geovisor. This was made by using QGIS Cloud, a web geovisor developed as an extension of the QGIS that allows the publication in a network of maps, data and geographic services. This geovisor has all the potential of a cloud storage and broadcast system that provides such a spatial data infrastructure. In short, it is a platform through which all information and data capable of owning a geographic component can be shared, according to the standards of the Open Geospatial Consortium (OGC), represented on the web through WMS services and downloadable in WFS. Results of this study can be viewed in https://qgiscloud.com/ Juan_Antonio_Geo/expo.

5

CONCLUSIONS

The complexity of the spatial rainfall pattern in the Iberian Peninsula determined by many factors was such as the relief, the layout, orientation or altitude atmospheric circulation. All of them make even more difficult a generalized characterization of the Iberian precipitation spatial distribution. Using the data from the Global Climate Monitor it is possible to carry out this type of studies. This climatic data geo-visualization web tool can greatly contribute to provide an end-user tool for climatic spatial patterns discovery. Compared to other geoviewers it hast objective advantages such as the easy way to access and visualize climatic past and present (near real-time) data, fast visualization response time, variables and climatic indicators selection in a unique client environment. The fast and easy way for data downloading and exportation in some different accessible formats facilitates the development of climatic studies such as the one presented here.

86

Using the precipitation data series provided by the GCM, robust and non-robust statistics were calculated for the period 1901-2016 at an annual and seasonal scale. Robust statistical measures provided different and complementary knowledge of precipitation spatial distribution and patterns in the Iberian Peninsula revealing a general significant overestimation of precipitation. The consequences of this for water resources estimation and allocation are noteworthy for environmental management and water planning and should be taken into account. The two coefficients of variation used emphasized the high irregularity behaviour more than it has been traditionally considered and reveals different zones of maximum variability. The most novel contribution of this work is to incorporate non-robust statistical measures and the comparison with the most comely used ones. The results show important variations in the estimation of the amount of available water resource coming from precipitation. The estimation of the differences between statistics at monthly scales and by river basins is to be achieved. All this results can be useful in the knowledge spatial and temporal distribution of precipitation and, therefore, in the initial computations of the available water resources of river basins for water management (commonly estimated by few meteorological stations and not updated periods of time). Nevertheless, the effect of mountain ranges on the GCM data needs to be evaluated when considering river basins. In addition, web-based GIS and geovisualization enable that the results obtained can be displayed in maps and seen in a new web geovisor (https://qgiscloud.com/Juan_Antonio_Geo/expo). This is largely novel since usually obtained results in this type of studies are not shared by means of any tool to give greater diffusion through the web. Improving the interface and making it more userfriendly based on the experience of users is a constant goal of the Global Climate monitor project. In future work we would also like to evaluate the statistics used in this study at a global level and for large areas relating the results with climatic typologies. We will also like to compare the evolution and changes of precipitation amounts between different standard periods (climatic normal) and other climatic indicators. Other climatic variables will be also estimated and incorporated into the Global Climate Monitor geoviewer. This is an added value to this work concerning not only about the generation of quality climate information and knowledge, but also making it available to a large audience.


REFERENCES Cabello Villarejo, V., Willaarts, B.A, Aguilar-Alba, M. & Del Moral Ituarte, L. (2015). River basins as socioecological systems: linking levels of societal and ecosystem metabolism in a Mediterranean watershed. Ecology and Society, 20(3):20. Carslaw, D. C., & Ropkins, K. (2012). Openair — An R package for air quality data analysis. Environmental Modelling & Software, 27–28, pp. 52-61. Casado, M.J., Pastor, M.A. & Doblas-Reyes, F.J. (2010). Links between circulation types and precipitation over Spain. Physics and Chemistry of the Earth, Parts A/B/C, 35(9), pp.437–447. Climate Research Unit (n.d.). British Atmospheric Data Center. Retrieved from: http://www.cru.uea.ac.uk/data Cortesi, N., González-Hidalgo, J. C., Trigo, R. M., & Ramos, A. M. (2014). Weather types and spatial variability of precipitation in the Iberian Peninsula. International Journal of Climatology, 34(8), 2661– 2677. De Castro M, Martin-Vide J, Alonso S. (2005). El clima de España: pasado, presente y escenarios de clima para el siglo XXI. Impactos del cambio climático en España. Ministerio de Medio Ambiente: Madrid. Edwards, P.N., Mayernik, M.S., Batcheller, A.L., Bowker, G.C., Borgman, C.L., 2011. Science friction: Data, metadata, and collaboration. Soc. Stud. Sci. 41, 667– 690. doi:10.1177/0306312711413314 Folland, C.K., Karl, T.R., Christy, J.R., Clarke, R.A., Gruza, G.V., Jouzel, J.. Mann, M., Oerlemans, J., Salinger, M.J. & Wang, S.W. (2001). Observed Climate Variability and Change. In: Climate Change 2001: The Scientific Basis. Contribution of Working Group I to the Third Assessment Report of the Intergovernmental Panel on Climate Change. (pp. 99 181). Cambridge: Cambridge University Press. García-Barrón, L., Aguilar-Alba, M. & Sousa, A. (2011). Evolution of annual rainfall irregularity in the southwest of the Iberian Peninsula. Theoretical and Applied Climatology, 103(1–2), pp.13–26. García-Barrón, L., Morales, J. & Sousa, A. (2013). Characterisation of the intra-annual rainfall and its evolution (1837–2010) in the southwest of the Iberian Peninsula. Theoretical and applied climatology, 114(3–4), pp.445–457 García-Barrón, L., Aguilar-Alba, M., Morales, J. &Sousa, A. Forthcoming (2017). Intra-annual rainfall variability in the Spanish hydrographic basins, International Journal of Climatology. Global Precipitation Climatology Centre (n. d.) Product Access: Download. Retrieved from: http://www. dwd.de/ Hidalgo-Muñoz, J.M. et al. (2011). Trends of extreme precipitation and associated synoptic patterns over the southern Iberian Peninsula. Journal of Hydrology, 409(1), pp.497–511. Intergovernmental Panel on Climate Change (2014). Climate Change 2013 – The Physical Science Basis Working Group I Contribution to the Fifth Assessment

Report of the Intergovernmental Panel on Climate Change. Retrieved from: http://www.ipcc.ch/ report/ar5/wg1/ Jones, P.D. & Moberg, A. (2003). Hemispheric and largescale surface air temperature variations: An extensive revision and an update to 2001. Journal of Climate 16, pp. 206-223. Jones, W. R., Spence, M. J., Bowman, A. W., Evers, L., & Molinari, D. A. (2014). A software tool for the spatiotemporal analysis and reporting of groundwater monitoring data. Environmental Modelling & Software, 55, pp. 242-249. Krysanova, V. et al. (2010). Cross-comparison of climate change adaptation strategies across large river basins in Europe, Africa and Asia. Water Resources Management, 24(14), pp.4121–4160. López-Bustins, J. A., Sánchez Lorenzo, A., Azorín Molina, C., & Ordóñez López, A. (2008). Tendencias de la precipitación invernal en la fachada oriental de la Península Ibérica. Cambio Climático Regional Y Sus Impactos, Asociación Española de Climatología, Serie A, (6), 161–171. Martín Vide, J. (2011a): ‘Estructura temporal fina y patrones espaciales de la precipitación en la España peninsular’. Memorias de la Real Academia de Ciencias y Artes de Barcelona, 1030, LXV, 3, 119162. Martin-Vide J. (2011b). Patrones espaciales de precipitación en España: Problemas conceptuales. In Clima, ciudad y ecosistema, Fernández-García, F., Galán, E, Cañada, R. (eds). Asociación Española de Climatología Serie B, nº 5; 11-32. Mitchell, T. D. & Jones, P. D. (2005). An improved method of constructing a database of monthly climate observations and associated high-resolution grids. International Journal of Climatology, 25, pp. 693– 712. Muñoz-Díaz, D. & Rodrigo, F.S. (2006). Seasonal rainfall variations in Spain (1912–2000) and their links to atmospheric circulation. Atmospheric Research, 81(1), pp.94–110. New, M., Hulme, M. & Jones, PD. (2000). Representing twentieth century space–time climate variability. Part 2: development of 1901–96 monthly grids of terrestrial surface climate. Journal of Climate, 13. pp. 2217– 2238. Ríos-Cornejo, D. et al. (2015). Links between teleconnection patterns and precipitation in Spain. Atmospheric Research, 156, pp.14–28. Vitolo, C., Elkhatib, Y., Reusser, D., Macleod, C. J. A., & Buytaert, W. (2015). Web technologies for environmental Big Data. Environmental Modelling & Software, 63, pp. 185-198.

87

Land-use Classification for High-resolution Remote Sensing Image using Collaborative Representation with a Locally Adaptive Dictionary Mingxue Zheng1,2 and Huayi Wu1 1Key

Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, China 2Faculty of Architecture and the Built Environment, Delft University of Technology, Delft, The Netherlands [email protected], [email protected]

Keywords:

Classification, Locally Adaptive Dictionary, Collaborative Representation, High-Resolution Remote Sensing Image.

Abstract:

Sparse representation is widely applied in the field of remote sensing image classification, but sparsity-based methods are time-consuming. Unlike sparse representation, collaborative representation could improve the efficiency, accuracy, and precision of image classification algorithms. Thus, we propose a high-resolution remote sensing image classification method using collaborative representation with a locally adaptive dictionary. The proposed method includes two steps. First, we use a similarity measurement technique to separately pick out the most similar images for each test image from the total training image samples. In this step, a one-step sub-dictionary is constructed for every test image. Second, we extract the most frequent elements from all one-step sub-dictionaries of a given class. In the step, a unique two-step sub-dictionary, that is, a locally adaptive dictionary is acquired for every class. The test image samples are individually represented over the locally adaptive dictionaries of all classes. Extensive experiments (OA (%) =83.33, Kappa (%) =81.35) show that our proposed method yields competitive classification results with greater efficiency than other compared methods.

1

INTRODUCTION

Recently, high-resolution remote sensing images (HRIs) have been frequently occurred in many practical applications, such as in Cascaded classification (Guo et al., 2013), urban area management (Huang et al., 2014), and residential area extraction (Zhang et al., 2015). Especially, HRIs play an increasingly important role in land-use classification (Chen and Tian, 2015; Hu et al., 2015; Zhao et al., 2014). Natural images, are generally sparse, and therefore can be sparsely represented and classified (Olshausen and Field, 1997). Sparse Representation based Classification (SRC) (Wright et al., 2009) was a sparse linear combination of representation bases, i.e. a dictionary of atoms, and had been successfully applied in the field of image classification (Yang et al., 2009). But sparsity based methods were time-consuming. In contrast to sparsity based classification algorithms, Collaborative Representation based Classification (CRC) (Zhang et al., 2011) yielded a very competitive level of

accuracy with a significantly lower complexity. In (Zhang et al., 2012), Zhang et.al pointed out that it was Collaborative Representation (CR) that can represent test image collaboratively with training image samples from all classes, as image samples between different classes often share certain similarity. In (Li and Du, 2014; Li et al., 2014), Li et.al proposed two methods, Nearest Regularized Subspace (NRS) and Joint Within-Class Collaborative Representation (JCR), for hyperspectral remote sensing images classification. These methods also could probably be extended to classify for HRIs. The essence of a NRS classifier was a 𝑙2 penalty framed as a distance weighted Tikhonov regularization. This distance weighted measurement enforced a weight vector structure. Unlike the sparse representation based approach, the weights can be simply estimated through a closed-form solution, resulting in much lower computational cost, but the method ignored the spatial information at neighboring locations. To overcome this disadvantage of NRS, JCR was

88 Zheng, M. and Wu, H. Land-use Classification for High-resolution Remote Sensing Image using Collaborative Representation with a Locally Adaptive Dictionary. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 88-95 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Land-use Classification for High-resolution Remote Sensing Image using Collaborative Representation with a Locally Adaptive Dictionary

proposed. Both methods enhanced classification precision, but also created a serious problem as irrelevant estimated coefficients generated during processing were scattered over all classes, instead of concentrated in a particular one, therefore adding uncertainty to the final classification results. Additionally, these methods just considered the first “joint” of the original training samples, and forewent a second deep selection from them, which could be the basis of a more complete and non-redundant dictionary for HRI classification. In this paper, we focus on the CR working mechanism, and propose a high-resolution remote sensing image classification method using CR with a locally adaptive dictionary (LAD-CRC).The LAD-CRC method makes up of two stages. First, we use a similarity measure to separately pick out the most similar images for each test image from the total training sample images, constructing a one-step sub-dictionary for each test image. Second, each test image will share certain similarities with some of the training images, the one-step sub-dictionaries for these test images therefore highly correlate. Based on this correlation, we extract the most frequent elements from all one-step sub-dictionaries of total test images in a given class, and construct a two-step sub-dictionary for the given class. The total of the most frequent elements, that is, two-step sub-dictionary, means the locally adaptive dictionary of the given class. A test image therefore share a unique two-step sub-dictionary with the other test images in the same class. We also call two-step sub-dictionary per class as a locally adaptive dictionary. Test images are individually represented by the locally adaptive dictionaries of all classes. Extensive experiments show that our proposed method not only increases classification precision, but also decreases computing time. The remaining parts of this paper are organized as follows. Section 2 discusses basic CR theory. Section 3 details the proposed algorithm. Section 4 describes experimental results and analysis of the proposed algorithm. Conclusions are drawn in section 5.

2

𝑦 = 𝑋𝛼 + 𝜀 = 𝑋1 𝛼1 + ⋯ + 𝑋𝑖 𝛼𝑖 + ⋯ + 𝑋𝐶 𝛼𝐶 + 𝜀 = 𝑋𝑖 𝛼𝑖 + ∑𝐶𝑗=1,𝑗≠𝑖 𝑋𝑗 𝛼𝑗 + 𝜀 (1) where 𝛼 = [𝛼1 ; 𝛼2 ; … ; 𝛼𝐶 ] and 𝛼𝑖 is the coefficient associated with the class i, 𝜀 is a small threshold. A general CR model can be represented as 𝛼̂ = 𝑎𝑟𝑔𝑚𝑖𝑛𝛼 ‖𝑦 − 𝑋𝛼‖𝑝 𝑠. 𝑡. ‖𝛼‖𝑞 < 𝜀

(2)

where p and q equal to one or two. Different settings of p and q lead to different instantiations.

2.2

Reconstruction and Classification of HRIs via CR

The working mechanism of CR is that some high-resolution remote sensing images from other classes can be helpful to represent the test image when training images belonging to different classes share certain similarities. The USA land-use dataset in our experiment is a small sample size problem, and 𝑋𝑖 is under-complete in general. If we use 𝑋𝑖 to represent the test image y, the representation error will be very large, even when y belongs to the class i. One obvious solution to solve the problem is to use much more training samples to represent the test image y. For HRIs, we experimentally set p as two, q as one, and the Lagrange dual form of this case can be shown as ̂𝛼 = 𝑎𝑟𝑔𝑚𝑖𝑛𝛼 ‖𝑦 − 𝑋𝛼‖2 + 𝜆 ‖𝛼‖1

(3)

where the parameter λ is a tradeoff between the data fidelity term and the coefficient prior. We compute the residuals 𝑒𝑖 (𝑦) = ‖𝑦 − 𝑋𝑖 𝛼̂𝑖 ‖2 /‖𝛼̂𝑖 ‖2 , then identify the class of the test image y via 𝑐𝑙𝑎𝑠𝑠(𝑦) = 𝑎𝑟𝑔𝑚𝑖𝑛𝑖 {𝑒𝑖 }.

BASIC THEORY

In this section, we will introduce the general CR model with corresponding regularizations, for reconstructing a test image.

2.1

and all training image samples are denoted by 𝑋 . Denote by 𝑋𝑖,𝑗 ∈ ℜ𝑚 ×𝑛𝑖,𝑗 the 𝑗𝑡ℎ training image sample of the 𝑖 𝑡ℎ class, and denote by 𝑋𝑖 ∈ ℜ𝑚 ×𝑛𝑖 the training image samples of the 𝑖 𝑡ℎ class, then let 𝑋 = [𝑋1 , 𝑋2 , … , 𝑋𝐶 ] ∈ ℜ𝑚 ×𝑁 , 𝑁 = ∑𝐶𝑖=1 𝑛𝑖 .When giving a test sample 𝑦 ∈ ℜ𝑚 ×𝑛 from the class i, we represent it as

Collaborative Representation (CR)

3

THE PROPOSED METHOD

In this section, we will detail how to extract sub-dictionaries at each step, finally obtain a locally adaptive dictionary. We will present the complete algorithm process in the proposed method for HRI classification.

Suppose that we have C classes of training samples,

89


3.1

The set of features adopted in land-use classification (Mekhalfi et al., 2015) consisted of three types as follows: Histogram of Oriented Gradients (HOG) (Dalal and Triggs, 2005), Cooccurrence of Adjacent Local Binary Patterns (CoALBP) (Nosaka et al., 2011) and Gradient Local AutoCorrelations (GLACs) (Kobayashi and Otsu, 2008). The results showed the CoALBP produced the most accurate land use classification results. In our work, CoALBP features are utilized to construct the sub-dictionaries from the land-use dataset. In the representation format with CoALBP features, a high resolution remote sensing image is represented by a column vector.

3.2

One-step Sub-dictionary

Suppose we have C classes of test samples, all test samples are denoted by 𝑌 , the test samples of 𝑖 𝑡ℎ class are denoted by 𝑌𝑖 . Denote by 𝑌𝑡,𝑞 ∈ ℜ𝑚 ×𝑛𝑡,𝑞 the 𝑞 𝑡ℎ test sample of the 𝑡 𝑡ℎ class. As mentioned in 2.1, denote by 𝑋𝑖 ∈ ℜ𝑚 ×𝑛𝑖 the training samples of the 𝑖 𝑡ℎ class, then let 𝑋 = [𝑋1 , 𝑋2 , … , 𝑋𝐶 ] ∈ ℜ𝑚 ×𝑁 . Because of similarity among image samples, we just need to choose the most similar training samples for every test image, instead of complete training image samples. Here, we use similarity measurement principle to select out the most similar S training images in every 𝑋𝑖 to construct an one-step sub-dictionary of 𝑌𝑡,𝑞 , denoted by 𝑋𝑡, 𝑆𝑞 = [𝑋𝑡𝑞 ,1,𝑆𝑡𝑞,1 , … , 𝑋𝑡𝑞 ,𝑖,𝑆𝑡𝑞,𝑖 , … 𝑋𝑡𝑞,𝐶,𝑆𝑡𝑞,𝐶 ] (4) 𝑋𝑡𝑞 ,𝑖,𝑆𝑡𝑞,𝑖 is the sample set that includes the most

similar S training samples of the 𝑖 𝑡ℎ class with test image 𝑌𝑡,𝑞 ,where 𝑖 ∈ (1,2, … , 𝐶). And 𝑆𝑡𝑞,1 , 𝑆𝑡𝑞,2 , … , 𝑆𝑡𝑞,𝐶 are respectively subsets (1, 2, … , 𝑋𝑖 ), … , (1, 2, … , 𝑋𝐶 ), of (1, 2, … , 𝑋1 ), …, ∑𝐶𝑖=1 |𝑆𝑡𝑞 ,𝑖 | = C ∗ S, |𝑆𝑡𝑞 ,𝑖 | is the number of elements in subset 𝑆𝑡𝑞 ,𝑖 . The mathematical function of similarity measurement principle is as follow 𝑑=

√∑𝑛𝑖=1(𝑥𝑖

− 𝑦𝑖

)2

(5)

where 𝒙 = (𝑥1 , … , 𝑥𝑖 , … , 𝑥𝑛 ), 𝒚 = (𝑦1 , … , 𝑦𝑖 , … , 𝑦𝑛 ) are n vectors. The smaller the d value, the more similar x and y.

3.3

Two-step Sub-dictionary

From the section 3.2, the one-step sub-dictionary of all test samples of the 𝑡 𝑡ℎ class, denoted by

90

𝑋𝑡,𝑆 = [𝑋𝑡1,1,𝑆𝑡1,1 , … , 𝑋𝑡1,𝑖,𝑆𝑡 ,𝑖 , … 𝑋𝑡1,𝐶,𝑆𝑡1,𝐶 1 + …+ 𝑋𝑡𝑞 ,1,𝑆𝑡𝑞,1 , … , 𝑋𝑡𝑞 ,𝑖,𝑆𝑡𝑞,𝑖 , … 𝑋𝑡𝑞 ,𝐶,𝑆𝑡𝑞,𝐶

Feature Extraction

𝑋𝑡𝑌

+ …+

, … , 𝑋𝑡𝑌 ,𝑖,𝑆𝑡 ,𝑖 , … 𝑡 ,1,𝑆𝑡𝑌𝑡 ,1 𝑡 𝑌𝑡

𝑋𝑡𝑌

𝑡 ,𝐶,𝑆𝑡𝑌𝑡 ,𝐶

]

= [𝑋𝑡,1,𝑆t,1 , … , 𝑋𝑡,𝑖,𝑆𝑡,𝑖 , … , 𝑋𝑡,𝐶,𝑆𝑡,𝐶 ]

(6)

where 𝑋𝑡,𝑖,𝑆𝑡,𝑖 = [𝑋𝑡1,𝑖,𝑆𝑡 ,𝑖 , … , 𝑋𝑡𝑞 ,𝑖,𝑆𝑡𝑞,𝑖 , … , 𝑋𝑡𝑌 1

𝑡 ,𝑖,𝑆𝑡𝑌𝑡 ,𝑖

] (7)

are all selected training samples of the 𝑖 𝑡ℎ class. The two-step sub-dictionary, that is, the S samples that frequently occur in 𝑋𝑡,𝑆 , denoted by ̂ , … , 𝑋̂ 𝑋̂ , … , 𝑋̂ ] 𝑡,𝑆 = [𝑋𝑡,1,𝑆̂ 𝑡,𝑖,𝑆̂ 𝑡,𝐶,𝑆̂ 𝑡,1 𝑡,𝑖 𝑡,𝐶

(8)

All new selected training samples of the 𝑖 𝑡ℎ class is denoted by 𝑋̂ , the number is |𝑆̂ 𝑡,𝑖 | , 𝑡,𝑖,𝑆̂ 𝑡,𝑖 𝐶 ∑ and 𝑖=1 |𝑠̂ 𝑡,𝑖 | = 𝑆 . The locally adaptive dictionary of 𝑡 𝑡ℎ class is 𝑋̂ 𝑡,𝑆 .

3.4

The Flow of the Proposed Method for HRIs Classification

To summarize the proposed method, we show the following steps. 1) Given a test image 𝑌𝑡,𝑞 of the 𝑡 𝑡ℎ class, a similarity measurement principle is used to construct an one-step sub-dictionary of 𝑌𝑡,𝑞 from total training images of all classes, denoted by 𝑋𝑡, 𝑆𝑞 = [𝑋𝑡𝑞,1,𝑆𝑡𝑞,1 , … , 𝑋𝑡𝑞,𝑖,𝑆𝑡𝑞,𝑖 , … , 𝑋𝑡𝑞,𝐶,𝑆𝑡𝑞,𝐶 ] (9) After doing same process for other test images of the 𝑡 𝑡ℎ class, the one-step sub-dictionary of the 𝑡 𝑡ℎ class is 𝑋𝑡,S; 2) A two-step sub-dictionary of the 𝑡 𝑡ℎ class, that is, the first S columns those occur repeatedly in𝑋𝑡,𝑆 is construct, denoted by ̂ , … , 𝑋̂ 𝑋̂ , … , 𝑋̂ ] 𝑡,𝑆 = [𝑋𝑡,1,𝑆̂ 𝑡,𝑖,𝑆̂ 𝑡,𝐶,𝑆̂ 𝑡,1 𝑡,𝑖 𝑡,𝐶

3)

(10)

𝑋̂ 𝑡,𝑆 is also called the locally adaptive dictionary of the 𝑡 𝑡ℎ class; From the foregoing, we can obtain the proposed method as 2 ̂𝑡,𝑆 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝛹 { ‖𝑌𝑡,𝑞 − 𝑋̂ 𝛹 𝑡,𝑆 𝛹𝑡,𝑆 ‖2 } 𝑡,𝑆 2

+ 𝜆 ‖𝛹𝑡,𝑆 ‖1 }

(11)

̂𝑡,𝑆 refers to the local coefficient matrix where 𝛹 corresponding to the locally adaptive dictionary ̂ ̂ ̂ ̂ 𝑋̂ 𝑡,𝑆 , and 𝛹𝑡,𝑆 = (𝛹𝑡,1 ; … ; 𝛹𝑡,𝑖 ; … ; 𝛹𝑡,𝐶 );


After traversing all the classes, we get a global coefficient matrix. The label of the test HRI 𝑌𝑡,𝑞 is determined by the following classification rule

4)

8

9

10

11

12

13

14

15

16

17

18

19

20

21

𝑔

̂ ‖ ‖𝑌𝑡,𝑞 −𝑋𝑖 𝛹 𝑖

𝑐𝑙𝑎𝑠𝑠(𝑖) = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖=1,…,𝐶 { ̂𝑖𝑔 ‖ } ‖𝛹

𝑝

/ (12)

𝑝

where 𝑋𝑖 is a subpart of 𝑋 associated with the ̂𝑖𝑔 denotes the portion of the class i and 𝛹 ̂ 𝑔 for the recovered collaborative coefficients 𝛹 𝑡ℎ 𝑖 class. 5)

In sequence, we can get a 2-D matrix which records the labels of the HRIs in the last. Additionally, the specific scheme for the global coefficient matrix construction is shown as follows. ̂ 𝑔 construction Global coefficient matrix 𝛹 ̂𝑡,𝑆 ∈ Input: (1) The local coefficient matrix 𝛹 𝑆×𝑛𝑡,𝑞 ℜ ; (2) Indicator set I with N elements, and 𝐼𝑖 = 0, or 1, for 𝑖 = 1, … , 𝑁 , in which “1” means that the corresponding dictionary atom is active and “0” means inactive. Initialization: Set the initial global coefficient matrix ̂ 𝑔 ∈ ℜ𝑁 ×𝑛𝑡,𝑞 as a zero matrix, and an indicator v 𝛹 =1. For i = 1 to N if 𝐼𝑖 = 1; ̂ 𝑔 (𝑖, : ) = 𝛹 ̂𝑡,𝑆 (𝑣, : ); 𝛹 v ++ ; End if End For ̂ 𝑔. Output: The global coefficient matrix 𝛹

The USA land-use dataset (Yang and Newsam, 2010) is widely used for evaluating land-use classification algorithms. It includes 21 classes, each class has 100 images. 80 images are selected out as training samples per class, other 20 images per class are test samples. Then, the total number of training samples is 1680. Image samples of each land-use class are shown in Figure 1.

1

2

3

4

5

6

(1 agriculture; 2 airplane; 3 baseball diamond; 4 beach; 5 buildings; 6 chaparral; 7 dense residential; 8 forest; 9 freeway; 10 golf course; 11 harbor; 12 intersection; 13 medium residential; 14 mobile phone park; 15 overpass; 16 parking lot; 17 river; 18 runway; 19 sparse residential; 20 storage tanks; 21 tennis court).

4.1

Parameter Setting

The selection of sample number S in two steps is critical in LAD-CRC. Experimentally, we set the S value equal to 210.

Figure 2: The S value of locally adaptive dictionary per class.

RESULT AND ANALYSIS

4

Figure 1: Example images of USA land-use dataset.

7

In Figure 2, it shows the relationship between the number S of locally adaptive dictionary per class and the classification accuracy. The range of the number S in LAD-CRC is [50, 230], the step length is 10. There are two convex points with S equal to 140 and 210. The accuracy values on these two points are almost the same. But the accuracy tread is more stable around 210. In addition, 140 is not a suitable value as we compress the 1x1680 estimated coefficient vector to a 1x210 coefficient vector to show the rough distribution of estimated coefficients for all methods. It is more clear and concise to show the distribution of coefficients with the 1x210 vector. The regularized parameter λ is 0.1 in NRS and JCR, 0.001 in SRC and

91


CRC experimentally. Other parameters are the same in all five methods.

4.2

Result Comparison with Other Methods

Using the USA land-use dataset, we conduct many experiments to compare with results of SRC (Wright et al., 2009), CRC (Wright et al., 2009), NRS (Li et al., 2014), and JCR (Li and Du, 2014), algorithms. Classification accuracy is averaged over five cross-validation evaluations. To facilitate a fair comparison between our proposed algorithm and other approaches, a fivefold cross-validation is performed in which the dataset is randomly partitioned into five equal subsets. After portioning, each land-use class contains a subset of 20 images. Four of these subsets are used for training, while the remaining subset is used for testing. The results include average accuracy (OA) of all classes and Kappa coefficients are showed in Table 1. Table 1: Classification results for USA land-use dataset with the proposed LAD-CRC. OA (%) Kappa (%)

SRC

CRC

NRS

JCR

66.95 66.50

55.81 52.25

71.71 69.75

71.10 70.30

LAD-CRC 83.33 81.35

In Table 1, the compared results show that the locally adaptive dictionary in proposed method can greatly replace the whole dictionary (e.g., the whole training image samples), and improves classification accuracy (OA=83.33; Kappa=81.35). The idea of extracting two sub-dictionaries refines the information of total training sample information into a locally adaptive dictionary.

Figure 3: Confusion matrix for the land-use data set using the proposed method.

92

The average classification performances of the individual classes using our proposed method set with the optimal parameters are visually shown in the confusion matrix (Figure 3).The average accuracies occur along a diagonal shown in red to yellow cells in the figure, mostly focusing on 82.62±0.71%. Without loss of generality, in this paper, we randomly choose the fifteenth test image sample of the class 6 in fifth cross-validation dataset, to demonstrate classification performance of the proposed method. In Figure 4, Figure 4(a)-(j) show estimated construction coefficients and normalized residuals for all five methods. Figure 4(a), 4(c), 4(e), 4(g), and 4(i) show estimated construction coefficients, and the variable on x axis is the distribution of training samples for all 21 classes (e.g., label distribution), the range of training samples of the class 6 is [20, 101] in Figure 4(a), [51, 60] in Figure 4(c), 4(e), 4(g), and 4(i). The value on y axis is corresponding estimated construction coefficients of different classes. Figure 4(b), 4(d), 4(f), 4(h), and 4(j) show normalized residuals of different classes. It can be observed that all the approaches can identify the test sample image properly by the rule of the least error, but the coefficient values for different algorithms are largely different. From Figure 4 (a) and 4(b), estimated construction coefficients mostly locate on class 6 (from 20 to101 on the x axis), 8 (from 102 to 178 on the x axis), 17(from 182 to190 on the x axis) and 19(from 191 to 209 on the x axis), but there has the least normalized residuals in class 6, which means proposed method mainly unitizes training sample images in class 6 to construct the test sample image. From Figure 4(c) and 4(d) in SRC, the normalized residuals in class 1,4,6,9 and 11 all are little, and estimated construction coefficients almost focus on class 6 (from 51 to 60 on the x axis), it means that the test sample image is reconstructed by training sample images in class 6. Similarly, from Figure 4(e) and 4(f) in CRC, estimated construction coefficients mainly locate in class 6 (from 51 to 60 on the x axis), and normalized residual in class 6 obviously is the smallest. In NRS and JCR, from Figure 4(g) and 4(i), the distributions of estimated construction coefficients are irregular. But from Figure 4(h) and 4(j), the normalized residual on class 6 still is the smallest. Compared Figure 4(a) with 4(c), 4(e), there are many disturbances (estimated construction coefficients in class 8, 17 and 19). There are two reasons for these noises: (1) Due to the selection of sub-dictionaries at two steps, 210 selected training sample images are very similar to the test sample image of the class 6; (2)Even though 210 selected


training image samples mostly belong to the class 6, training sample images probably share certain similarity among some classes. Then, there should be some training samples of other classes in the 210 selected training samples. We call these classes “similar class”, such as class 8, 17, and 19. The

situations in such two reasons result that a part of estimated construction coefficients of the test image are scattered in “similar classes”. The distribution of normalized residuals in Figure 4(b) perfectly match the fact “similar class” causes. The coefficient disturbances of LAD-CRC just locate on “similar

Figure 4: estimated construction coefficients and normalized residuals among all method.

93


Figure 4: estimated construction coefficients and normalized residuals among all method (cont.).

class”. In addition, the estimated construction coefficients of CRC locate on all classes. Estimated construction coefficients in other classes make the very serious impact on computing residuals, which results that CRC achieves the worst classification result. Compared Figure 4(a) with Figure 4(g), 4(i), these irregular reconstruction coefficient distribution in Figure 4(g) and 4(i) perfectly prove the validity of proposed method by refining the information of total training sample information into a locally adaptive dictionary. To conclude, considering that all methods can identify the test image sample properly, the proposed method can select the most valuable training image samples. With the construction of a locally adaptive dictionary, we receive the best classification accuracy. However, it is easy to find that the results of four compared algorithms are approximately 10% lower than these they acquired in other datasets. We could give the probable reason. Generally, SIFT is the most

94

common feature descriptor for HRI classification. In the paper, we choose CoALBP features to collect HRIs information. LBP is a descriptor for rotation invariant texture classification. CoALBP is the extension of LBP to extract finer local details. The reason we choose CoALBP instead of SIFT is that the feature exploitation with the latter will take much more computation time than the former takes. Fortunately, the phenomenon that results are lower than these methods acquired in other datasets exists in all four compared algorithms without a special case. So the comparison results in Table 1 still can testify the performance of the proposed method, even under the impact of CoALBP features. Table 2: Speed for USA land-use dataset.

SRC Time (s) 5018.957

CRC 7.4216

NRS

JCR

LAD-CRC

26.8122

35.3689

2215.7087

In Table 2, the computation time each method consumes is showed. The computation time including


training and test processes the proposed method takes is less than SRC takes, but more than CRC, NRS and JCR take. In Table 2, the more accurate a method is, the more computation time is generally required. This demonstrates that accuracy comes at the cost of increasing computational efforts. It is time consuming to separately find out the most similar training images for each test image and the most frequent training images for every class with two sub-dictionaries. The process occupies most of the running time of the proposed method.

5

CONCLUSION

In this paper, experimental results clearly show that the proposed method obtains the best classification performance. It means the idea of training dictionaries at two steps is promising, and encourages me further to explore the direction. From Figure 4(a), there still are many disturbances (for example, estimated construction coefficients in class 8, 17 and 19). Effective methods for extracting discriminative information of different classes should be explored to decrease and even eliminate these disturbances. Besides, time consuming on sub-dictionaries is also a problem. To find out a way to reduce computing time is necessary. Parallel computing can be thought as an ideal direction in the future work.

REFERENCES Chen, S., Tian, Y., 2015. Pyramid of spatial relatons for scene-level land use classification. IEEE Transactions on Geoscience and Remote Sensing 53, 1947-1957. Dalal, N., Triggs, B., 2005. Histograms of oriented gradients for human detection, Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. IEEE, pp. 886-893. Guo, J., Zhou, H., Zhu, C., 2013. Cascaded classification of high resolution remote sensing images using multiple contexts. Information Sciences 221, 84-97. Hu, F., Xia, G.-S., Hu, J., Zhang, L., 2015. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sensing 7, 14680-14707. Huang, X., Lu, Q., Zhang, L., 2014. A multi-index learning approach for classification of high-resolution remotely sensed images over urban areas. ISPRS Journal of Photogrammetry and Remote Sensing 90, 36-48. Kobayashi, T., Otsu, N., 2008. Image feature extraction using gradient local auto-correlations, European conference on computer vision. Springer, pp. 346-358. Li, W., Du, Q., 2014. Joint within-class collaborative representation for hyperspectral image classification.

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7, 2200-2208. Li, W., Tramel, E.W., Prasad, S., Fowler, J.E., 2014. Nearest regularized subspace for hyperspectral classification. IEEE Transactions on Geoscience and Remote Sensing 52, 477-489. Mekhalfi, M.L., Melgani, F., Bazi, Y., Alajlan, N., 2015. Land-use classification with compressive sensing multifeature fusion. IEEE Geoscience and Remote Sensing Letters 12, 2155-2159. Nosaka, R., Ohkawa, Y., Fukui, K., 2011. Feature extraction based on co-occurrence of adjacent local binary patterns, Pacific-Rim Symposium on Image and Video Technology. Springer, pp. 82-91. Olshausen, B.A., Field, D.J., 1997. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision research 37, 3311-3325. Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y., 2009. Robust face recognition via sparse representation. IEEE transactions on pattern analysis and machine intelligence 31, 210-227. Yang, J., Yu, K., Gong, Y., Huang, T., 2009. Linear spatial pyramid matching using sparse coding for image classification, Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, pp. 1794-1801. Yang, Y., Newsam, S., 2010. Bag-of-visual-words and spatial extensions for land-use classification, Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems. ACM, pp. 270-279. Zhang, L., Yang, M., Feng, X., 2011. Sparse representation or collaborative representation: Which helps face recognition?, Computer vision (ICCV), 2011 IEEE international conference on. IEEE, pp. 471-478. Zhang, L., Yang, M., Feng, X., Ma, Y., Zhang, D., 2012. Collaborative representation based classification for face recognition. arXiv preprint arXiv:1204.2358. Zhang, L., Zhang, J., Wang, S., Chen, J., 2015. Residential area extraction based on saliency analysis for high spatial resolution remote sensing images. Journal of Visual Communication and Image Representation 33, 273-285. Zhao, L.-J., Tang, P., Huo, L.-Z., 2014. Land-use scene classification using a concentric circle-structured multiscale bag-of-visual-words model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7, 4620-4631.

95

Optimal Estimation of Census Block Group Clusters to Improve the Computational Efficiency of Drive Time Calculations Damon Gwinn1 , Jordan Helmick2 , Natasha Kholgade Banerjee1 and Sean Banerjee1 1 Clarkson

University, Potsdam, NY, U.S.A. Morgantown, WV, U.S.A. {gwinndr, nbanerje, sbanerje}@clarkson.edu, [email protected] 2 MedExpress,

Keywords:

Location Selection, Census Block Group, Affinity Propagation.

Abstract:

Location selection determines the feasibility of a new location by evaluating factors such as the drive time of customers, the number of potential customers, and the number and proximity of competitors to the new location. Traditional location selection approaches use census block group data to determine average customer drive times by computing the drive time from each block group to the proposed location and comparing it to all competitors within the area. However, since companies need to evaluate on the order of hundreds of thousands of potential locations and competitors, traditional location selection approaches prove to be computationally infeasible. In this paper we present an approach that generates an optimal set of clusters to speed up drive time calculations. Our approach is based on the insight that in urban areas block groups are comprised of a few adjacent city blocks, making the differences in drive times between neighboring block groups negligible. We use affinity propagation to initially cluster the census block groups. We use population and average distance between the cluster centroid and all points to recursively re-cluster the initial clusters. Our approach reduces the census data for the United States by 80% which provides a 5× speed when computing drive times. We sample 200 randomly generated locations across the United States and show that there is no statistically significant difference in the drive times when using the raw census data and our recursively clustered data. Additionally, for further validation we select 300 random Walmart stores across the United States and show that there is no statistically significant difference in the drive times.

1

INTRODUCTION

Location selection determines the feasibility of a new retail location by evaluating factors such as the drive time of customers to the new location, the number of potential customers, and the number and proximity of competitors to the new location. Locations that are distant from the customer base, out-positioned by a major competitor, or in a rural area with a low population density are less likely to succeed. Drive time computations for a new location are performed by using the census block group data in conjunction with drive time analysis tools, such as the Google Maps Distance Matrix API (Google, 2017). For a proposed location, a trade area is created around the location and drive times are computed from each block group within the trade area to the proposed location. The drive times are then averaged and compared with competing locations to determine if the proposed location is closer than the competition. However, since companies need to evaluate on the

order of thousands of potential locations and competitors, computing drive times from each census block group can be computationally infeasible. In this paper, we present an approach to reduce the computational overhead for drive time calculations by clustering neighboring block groups into a single point. Our insight is that census block groups in urban areas are in close proximity, as shown in Figure 1, making drive time calculations from each block group redundant as the differences in driving time between neighboring block groups are negligible. In this paper we present an approach to estimate an optimal set of census block group clusters. The novelty of our approach is a recursive algorithm to split large clusters into optimal-sized clusters that satisfy user-provided thresholds of population count and average distance between the cluster centroid and cluster members. We first generate an initial set of clusters using affinity propagation (Frey and Dueck, 2007) which automatically estimates the number of clusters for an input set of points. We recursively

96 Gwinn, D., Helmick, J., Banerjee, N. and Banerjee, S. Optimal Estimation of Census Block Group Clusters to Improve the Computational Efficiency of Drive Time Calculations. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 96-106 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Optimal Estimation of Census Block Group Clusters to Improve the Computational Efficiency of Drive Time Calculations

Figure 1: Two neighboring census block groups in Washington, DC. As shown by the Google Maps distance and drive time calculation, the block groups are 0.2 miles which equates to a 1 minute driving time. The difference in distance and drive time is insignificant and the block groups can be clustered to a single point. In this paper, we leverage block group proximity to cluster neighboring block groups into a single point to reduce computational overhead.

split clusters if their human population count or average distance from each member to the cluster centroid are higher than user-provided thresholds, and if there are more than 10 block groups in each cluster. We approximate the distances between each cluster member and the cluster centroid by using the haversine formula (Van Brummelen, 2012). The remainder of this paper is organized as follows: in Section 2 we discuss the related literature in location selection. Section 3 discusses our recursive threshold based cluster splitting approach. In Section 4 describe our dataset, and we show the computational improvements gained by clustering block groups. We discuss the practical and statistical differences in drive times using 200 random locations in Section 5. We discuss internal and external validity threats in Section 6. Finally, we conclude the paper in Section 7 and provide potential directions for future research.

2

RELATED WORK

Several approaches use a variety of features extracted from the data in location selection. The approach of Xu et al. (Xu et al., 2016) features such as distances to the city center, traffic, POI density, category popularity, competition, area popularity, and local real estate pricing to determine the feasibility of a location. The approach of Karamshuk et al. (Karamshuk et al., 2013) uses features mined from FourSquare along with supervised learning approaches using Support Vector Regression, M5 decision trees, and Linear Regression to determine the optimal location of a retail store. Social media platforms provide novel metrics to evaluate the feasibility of a location. Several ap-

proaches determine the popularity of a proposed location by using the reviews of users (Wang et al., 2016a), or by evaluating the number of user checkins and location centric data from platforms such as Twitter and FourSquare (Karamshuk et al., 2013; Qu and Zhang, 2013; Yu et al., 2013; Yu et al., 2016; Wang et al., 2016b; Chen et al., 2015). User comments posted on review sites provide insights on the personal experience of the consumer at an existing location or similar business. User check-in data provides popularity metrics for a geographical area based on the frequency and duration of a visit. Many approaches use optimal location queries to evaluate the effectiveness of a location by placing higher priority on locations that are closer to the proposed customer base (Xiao et al., 2011; Ghaemi et al., 2010). The approach of Ghaemi et al. (Ghaemi et al., 2012) uses nearest neighbors with results from past optimal location queries to address issues caused by moving locations and customers. Banaei et al. (Banaei-Kashani et al., 2014) propose reverse skyline queries to allow optimal location queries to handle multiple criteria such as distance to location and distance to competitors. Since a proposed location may not satisfy all criteria adequately, Kahraman et al. (Kahraman et al., 2003) use fuzzy techniques to reach a compromise between various criteria while evaluating the feasibility of a site. Fuzzy approaches have been used to determine the appropriate number of firestations at an airport (Tzeng and Chen, 1999), and the optimal location of new convenience stores (Kuo et al., 1999) and factories (Çebi and Otay, 2015; Yong, 2006). Approaches based on analytic hierarchy process (AHP) use human experts to weight location selection criteria and to generate a comprised location rank (Tzeng et al., 2002; Yang and Lee, 1997; Aras et al., 2004). Unlike the prior approaches, our work uses census block group data, and is most closely related to research in the area of retail location selection, service accessibility and market demands using census block groups (Bailey, 2003; Nallamothu et al., 2006; Carr et al., 2009; Guagliardo, 2004; Jiao et al., 2012; Branas et al., 2005; Farber et al., 2014; Blanchard and Lyson, 2002). Several approaches use census block groups and tracts to compute population and drive time estimates for access to trauma centers, hospitals, grocery stores, and supermarkets (Branas et al., 2005; Nallamothu et al., 2006; Carr et al., 2009; Guagliardo, 2004; Jiao et al., 2012; Farber et al., 2014; Blanchard and Lyson, 2002). Unlike our work, these approaches estimate drive times by using urban, suburban, and rural speed thresholds and population densities. In the absence of accurate drive time data for location 97


Figure 2: Comparison of census block group distribution for Minnesota and Utah. While both states are approximately 85,000 square miles, Utah has large uninhabited areas when compared to Minnesota.

Figure 3: Comparison of clustered census block groups for Minnesota and Utah. While both states have the same land area, Utah has only 51 clusters while Minnesota has 134 clusters. Hence, traditional clustering approaches cannot be applied as we have no prior knowledge of the number of optimal clusters. In our approach we use affinity propagation, which does not assume a base number of optimal clusters.

selection, the approach of Li et al. (Li et al., 2015) computes road segment times from public transit GPS data. Our work differs from these approaches in that all these approaches use unclustered census block groups and drive time estimates for evaluating the feasibility of a proposed location. Unclustered census data introduces computational overhead when selecting across multiple candidate locations and comparing to multiple competitors. Drive time estimates do not accurately depict the time taken by customers to reach the location. Instead we propose using exact drive times obtained from Google Maps, while using clustering to reduce computational overhead.

98

3 3.1

THRESHOLD BASED RECURSIVE CLUSTERING Base Clustering Algorithm

We use affinity propagation as our base clustering algorithm as it does not require the user to specify the number of clusters (Frey and Dueck, 2007). In our case, we have no prior knowledge on the optimal cluster size. Further, each state can have a different number of clusters based on the population distribution. For example, as shown in Figure 2, Utah and Minnesota have the same overall land area, but have different census block group distributions due to their


geography. As shown in Figure 3, Utah has 51 clusters while Minnesota has 134 clusters. Densely populated areas are combined into multiple clusters, while sparsely populated areas are combined into a single cluster. Minnesota has several densely populated areas, and hence requires a larger number of clusters to describe the state. We use the affinity propagation algorithm implemented in the Python skikit-learn toolkit using 2000 maximum iterations and 200 convergence iterations (Pedregosa et al., 2011). The convergence iterations control the number of iterations without any changes in the estimated clusters. A high maximum and convergence iteration provides higher certainty that the resultant clusters will not change.

3.2

Recursive Cluster Splitting

Our recursive cluster splitting method uses the initial clusters generated in Subsection 3.1, a user-provided upper bound d¯bound for the mean distance between the cluster centroid and each cluster point, and a userprovided upper bound pbound for the total population in each cluster as input. For the results shown in this paper, we set dbound to 5 and pbound to 20, 000. For each cluster c from Subsection 3.1, we compute the distance di between the cluster centroid and the ith point in the cluster, where i ∈ Ic and Ic represents the indices of all points in the cth cluster, as di = R · bi , where bi is given by bi = 2 atan2

√ p ai , 1 − ai .

(1)

(2)

The value of ai represents the haversine of the central angle between each point represented by its latitude φi and longitude λi to its cluster centroid represented by φc and λc , and is computed as ai = sin2

φc − φi λc − λi + cos φi · cos φc · sin2 . (3) 2 2

In Equation (1), R represents the radius of the earth at the equator, i.e., 3959 miles. For cluster c, we compute the mean distance d¯ for all points in the cluster to its centroid as 1 d¯ = ∑ di . |Ic | i∈ Ic

Algorithm 1: Recursive Cluster Splitting. Input: Sets of latitudes and longitudes for initial cluster points {{(φi , λi ) : i ∈ Icinit } : cinit ∈ Cinit }, Set of latitudes and longitudes for initial cluster centroids {(φcinit , λcinit : cinit ∈ Cinit }, and user-provided bounds d¯bound and pbound Output: Set of final clusters, O 1 for cinit ∈ Cinit do 2 Pcinit ← {(φi , λi ) : i ∈ Icinit } 3 O = split(Pcinit ,O) 4 return O end Procedure split(Pc ,O ) 1 Compute d¯c using Equation (4) 2 if (d¯c > d¯bound ∨ pc > pbound ) ∧ |Ic | > 10 then 3 Split cluster represented by points in Pc by clustering them into smaller clusters {Pc¯ : c¯ ∈ Cc } using affinity propagation 4 for c¯ ∈ Cc do 5 return split(Pc¯ ,O ) end else 6 O ← O ∪ Pc 7 return O end

(4)

We split cluster c into a second set of clusters using affinity propagation if d¯ is higher than the userspecified upper bound d¯bound or the population in the cth cluster pc is higher than pbound , and if the number of points in a cluster is greater than 10. For each

newly generated cluster, we recursively perform average distance computation and evaluation of the distance, population, and cluster point count to split them further till the user-defined constraints are met. Algorithm 1 summarizes the steps of our approach. The initial clustering algorithm runs in O(kn2 ) time and produces R clusters, where k represents the number of iterations until convergence and n represents the number of samples. In our case, the initial clustering algorithm runs with n = 220,334 points and k = 200. Each of the R clusters is reclustered in O(kmi 2 ) time, where k = 200 and mi represents the number of points in the ith cluster and i ∈ R.

3.3

Drive Times Computation

When evaluating the effectiveness of our approach, we compute exact drive times to a potential location from all points enclosed by a bounding box at a user specified distance (e.g. 5 miles). The bounding box is represented by coordinates of the north east and south west most points. All points within the bounding box are clustered census block groups generated by Al99


Algorithm 2: Bounding Box Computation. Parameters: MINLAT (min latitude):−90◦ , MAXLAT (max latitude):90◦ , MINLON (min longitude):−180◦ , MAXLON (max longitude):180◦ , R (radius of earth): 6, 371 km. Input: Distance d and location (φ1 ,λ1 ) d 1 φ= R 2 φmin = φ1 − φ 3 φmax = φ1 + φ 4 if φmin > MINLAT ∧ φmax < MAXLAT then sin φ 5 λ = sin−1 cos φ1

6

7 8 9 10 11

12 13 14 15

λmin ← λ1 − λ if λmin < MINLON then λmin ← λmin + 2π end λmax ← λ1 + λ if λmax > MAXLON then λmax ← λmax − 2π end else φmin ← max (φmin , MINLAT) φmax ← min (φmax , MAXLAT) λmin ← MINLON λmax ← MAXLON end

location denoted with latitude φ1 and longitude λ1 , we compute the north east location with latitude φmax and longitude λmax and the south west location with latitude φmin and longitude λmin . We use the Google Maps API to generate drive times for all points enclosed by bounding box to the location. For example, to compute the drive time and distance between starting location (44.66, -74.99) and ending location (44.67, -74.98), we call the mapping API using: ht tp://maps.googleapis.com/maps/api/distanc ematrix/json?units=imperial&origins=44.66 ,-74.99&destinations=44.67,-74.98. The returned JSON object is shown in Figure 4. The generation of the north east and south west most points of the bounding box are performed in O(1) time, while the drive time computations are performed in O(n) time, where n represents the number of points within the bounding box.

4 RESULTS We use the 2010 US Census Bureau Block Group dataset which consists of 220,334 unique block groups representing all 50 states, District of Columbia, and Puerto Rico (Census, 2010). The dataset consists of: • STATEFP or State Federal Information Processing Standards (FIPS) code, which is used to identify each state in the US, • COUNTYFP or county FIPS code, which is used to identify each county within the state, • POPULATION or the total population of the block group, • LATITUDE or the latitude of the block group center, and • LONGITUDE or the longitude of the block group center.

Figure 4: We use the drive times generated by the Google Maps API to determine the differences between raw census block group data and our recursively clustered data. The JSON object payload contains distance and drive time values from a given starting and ending location.

gorithm 1, and represent customers who are likely to visit the potential location. We compute the locations of the north east and south west most points of the bounding box by using the inverse haversine formula described in Algorithm 2. Given a distance d and the 100

Our approach reduces the size of the census dataset from 220,334 block groups to 41,442 clustered block groups, thereby reducing the dataset by 81.19%. On a per state basis, we see the highest reduction in Rhode Island, with a reduction from 815 block groups to 117 clusters resulting in a reduction of 85.64%. We see the lowest reduction in North Dakota, with a reduction from 572 block groups to 164 clusters, or a reduction of 71.33%. The average maximum distance from the cluster centroid across all states is 4.624 miles. The average distance from the cluster centroid to cluster points across all states is 2.300 miles. On a per state basis, we see the lowest average maximum distance from


Figure 5: Comparing results for raw census block group data, sub-optimal clustered data, and optimally clustered data for Rhode Island, Nevada, North Dakota, and Wyoming. A densely populated state, such as Rhode Island, or a state with dense population localities, can be described by fewer clusters. Sparsely populated states, such as North Dakota and Wyoming, require a larger number of clusters to define the population. (Figure best viewed in color).

the cluster centroid in the District of Columbia at 0.667 miles. The lowest average distance from the cluster centroid to cluster points is also in the District of Columbia at 0.381 miles. We see the highest average maximum distance from the cluster centroid in Alaska at 27.141 miles. The highest average distance from the cluster centroid to cluster points is also in Alaska at 11.369 miles. For a congested state with dense traffic patterns and low inner city speed limits, such as the District of Columbia, a low cluster cen-

troid to cluster point distance is ideal. On the other hand, for a sparsely populated state, such as Alaska, where speed limits are higher a larger cluster centroid to cluster point distance has minimal impact. The state to state variations in census block group reduction can be explained by the differences in land area and population distribution. As shown in Figure 5 census block group reduction is highest in Rhode Island as it is a densely populated state with 1021 individuals per square mile. States such as 101


Figure 6: A densely populated area contains several block groups in close proximity, while a sparesely populated area has larger distances between block groups. In our approach the densely populated area shown in the top right is reclustered further into smaller clusters to ensure each cluster point is less than 5 miles from the cluster centroid and the total population of the cluster is below 20,000. The sparsely populated area shown on the bottom right will also be reclustered using our approach, however our algorithm generates fewer sub clusters. (Figure best viewed in color).

Nevada, where the population density is low (26 individuals per square mile), but highly concentrated to a few localities also have a higher reduction (84.80%). On the other hand, states with a low population density, such as Wyoming with 6 individuals per square mile have the lowest reduction. For a densely populated state, such as Rhode Island, we start with 815 census block groups and generate a set of 27 sub-optimal clusters. These initial clusters are sub-optimal as the average maximum distance from the cluster centroid of 6.825 miles and an average cluster centroid to cluster points distance of 3.272 miles. Using our approach, we generate 117 optimized clusters. The average maximum distance from the cluster centroid is 2.285 miles, and the average distance from the cluster centroid to cluster points is 1.259 miles. On the other hand, for a sparsely populated state, such as South Dakota, we start with 654 census block groups and generate a set of 15 sub-optimal clusters. The average maximum distance in the sub-optimal clusters is 76.994 miles and the average cluster centroid to cluster points distance is 30.607 miles. Our approach generate 177 optimized clusters with a average maximum distance of 8.501 miles and an average cluster centroid to cluster points distance of 4.264 miles. To understand how localities with different population densities are handled by our approach, we show 102

the changes in cluster distribution after initial cluster and optimization for two localities in Connecticut in Figures 6 and 7. As shown in Figure 7, after initial clustering a densely populated areas, such as the Hartford area, has a large number of block groups in close spatial proximity. A sparsely populated area, such as the Salisbury area has very few census block groups with a larger distance between neighboring block groups. As shown in Figure 7, after optimization our approach generates clusters comprised of 10 or more census block groups in densely populated areas since the distance between cluster members is low. For sparsely populated areas, each cluster consists of 3-4 census block groups as they are spatially further apart from each other. The 80% reduction in the census dataset results in a 5× increase in computational efficiency on average. As shown in Figure 8 for our random location denoted by the diamond symbol and located at coordinates (41.766458,-72.677643), we generate 253 potential customers groups in a 5 mile bounding box using the census block group data with an average drive time of 10 minutes 14 seconds. Our approach generates 33 clustered customer groups with an average drive time of 10 minutes 5 seconds.


Figure 7: A densely populated area contains several block groups in close proximity, while a sparsely populated area has larger distances between block groups. By reclustering the densely populated area into multiple smalller clusters, we ensure that the drive time differences between the raw census data and clustered data are minimized. (Figure best viewed in color.)

Figure 8: Effect of clustering on reducing the number of drive time computations in a urban location, such as Hartfod, CT. The diamond indicates a proposed location, and the circles indicate block groups. The figure on the left shows the raw census block group data, while the figure on the right shows the clustered block group data.

5

EVALUATION

The typical location selection process involves the evaluation of drive times for several thousand locations across the country and making comparisons to several thousand competitors. We measure the performance of our optimized clustering approach by computing the difference in drive times for 200 random locations generated across the entire US. For each location we create a trade area at a radius of 5 miles from the location and compute the average drive time using both the census block group data and the opti-

mally clustered data. We apply a paired t-test and test the following hypotheses: NULL : the mean drive time for census block group data is no different from the mean drive time for optimally clustered data. Alternate : the mean drive time for census block group data is different from the mean drive time for optimally clustered data. We failed to reject the NULL hypothesis with a p-value of 0.1878. The difference in sample means for the census block group data and clustered data is 0.224301 minutes or 13.5 seconds. The 95% confi103


dence interval lies between [-33.53 seconds, 6.62 seconds]. For the census block group data, we compute drive times to 8855 consumer groups. By using the clustered census block groups we only compute drive times to 1570 locations, resulting in a 5.64× improvement in the number of computations. The differences in drive times obtained from the census and clustered census data are impacted by the number of clustered points found within a trade area for a proposed location. As shown in Figure 9, sparsely populated areas where the clustering reduces to the census block groups to 1 or 2 clusters results in a higher difference in drive times. In sparse areas, we observe drive time differences up to 2 minutes on average when comparing the census and clustered census data. For densely populated areas, where census block groups are in close spatial proximity, the drive time differences are less than 30 seconds on average. A 2 minute drive time difference in a sparsely populated area, where amenities are in general further apart, may be more acceptable to a consumer. To further validate our approach, we randomly selected 300 Walmart locations and computed the drive time using our optimized clustering approach and the raw census data. We failed to reject the NULL hypothesis with a p-value of 0.08782. The difference in sample means for the census block group data and the clustered data is 0.1464922 minutes or 8.8 seconds. The 95% confidence interval lies between [-18.89 seconds, 1.31 seconds].

6

Internal. The 2010 United States Census block group dataset contains 930 block groups with zero population. These block groups are located in uninhabited areas, such as lakes and nationals forests. Our approach is not affected as zero population block groups are either left unclustered (579 out of 930) as they are not candidates to becomes members of another cluster, or are consumed into a cluster where they do not add to the cluster’s population count. We use the haversine formula to compute distances from cluster members to the cluster centroid. The haversine formula provides the distance as the ‘crow files’, and does not factor in natural pathway obstructions for humans, such as bodies of water or mountains. For the purpose of our approach the haversine distance is used to determine the closeness of cluster members to the centroid, and not as an exact measure of distance. As shown in Figure 10 in sparsely populated areas the differences in the drive times between the census and the optimized cluster set is higher. For example, using a randomly generated location in Salisbury, CT, our approach reduces the number of drive time computations from 8 in the census data to 1 in the clustered census data. However, the drive time difference between the two is 4 minutes 38 seconds. In future, we intend to address these issues by extending the bounding box further out from the proposed location in sparsely populated areas. In this instance, if we increased the bounding box distance to 7.5 miles the differences in drive time reduces to 1 minute 52 seconds. Additionally, using a population weighted approach would remove this threat since these block groups would have no impact on the analysis. External. Our approach uses population data aggregated as census block groups. While census block groups are used only in the United States, our approach can be performed on census tracts which are used in several other countries, such as Australia, New Zealand, and United Kingdom.

7

Figure 9: Drive time differences measured in seconds for census vs. clustered census data. Drive time differences reduce as the number of clustered points in the neighborhood of a proposed location increases.

104

THREATS TO VALIDITY

DISCUSSION

In this paper we presented our approach for generating optimized census block group clusters for improving the efficiency of drive time calculations for location selection. Companies need to evaluate on the order of thousands of potential locations and competitors, computing drive times from each census block group can be computationally infeasible. Our optimization approach allows the user to specify distance


Figure 10: Effect of clustering on reducing the number of drive time computations in sparsely populated areas, such as Salisbury, CT. The diamond indicates a proposed location, and the circle indicate block groups. The figure on the left shows the raw census block group data, while the figure on the right shows the clustered block group data.

and population thresholds and generate a clustered US census block group dataset. Our clustering approach reduces the census block group data from 220,334 groups to 41,442 clustered groups. By reducing the census data set, we provide an average 5× speed up for the drive time computation process. We demonstrate the robustness of our approach by generating 200 random and 300 Walmart locations across the United States and using the Google Maps Distance Matrix API to generate actual drive times. The difference in drive times generated by the census and clustered census datasets have no practical or statistically significant difference. The largest differences in drive times between the census and clustered census data are found in sparsely populated areas. Citizens in these areas are more likely to be accepting of longer travel time due to the lack of amenities. The lowest differences in drive times are found in densely populated areas, where citizens are more likely to notice changes in time and distance. Our current census block group clustering approach uses the haversine formula to determine proximity of cluster members to the cluster centroid. In future, we will use geographic data to determine locations of natural obstructions, such as mountains and waterways, along with transportation data to use roadway speed limits to improve the accuracy of the clustering process using obstacle aware clustering techniques (Tung et al., 2001). Traffic patterns within urban areas influence drive time calculations. Our current approach generates clusters based on spatial proximity. In future, we will incorporate traffic data to optimize clusters based on congestion trends. Our current approach uses a population threshold of 20,000 and a distance threshold of 5 miles, in future we will investigate a broader set of thresholds to determine the most effective clustering approach. Our approach utilizes data from the United States, in fu-

ture we will investigate the generalizability of our approach by using census tract data from countries such as Australia, New Zealand, and United Kingdom.

REFERENCES Aras, H., Erdo˘gmus¸, S¸., and Koç, E. (2004). Multi-criteria selection for a wind observation station location using analytic hierarchy process. Renewable Energy, 29(8):1383–1392. Bailey, G. W. (2003). Market determination based on travel time bands. US Patent 6,604,083. Banaei-Kashani, F., Ghaemi, P., and Wilson, J. P. (2014). Maximal reverse skyline query. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 421–424. ACM. Blanchard, T. and Lyson, T. (2002). Access to low cost groceries in nonmetropolitan counties: Large retailers and the creation of food deserts. In Measuring Rural Diversity Conference Proceedings, November, pages 21–22. Branas, C. C., MacKenzie, E. J., Williams, J. C., Schwab, C. W., Teter, H. M., Flanigan, M. C., Blatt, A. J., and ReVelle, C. S. (2005). Access to trauma centers in the united states. Jama, 293(21):2626–2633. Carr, B. G., Branas, C. C., Metlay, J. P., Sullivan, A. F., and Camargo, C. A. (2009). Access to emergency care in the united states. Annals of emergency medicine, 54(2):261–269. C ¸ ebi, F. and Otay, I. (2015). Multi-criteria and multi-stage facility location selection under interval type-2 fuzzy environment: a case study for a cement factory. international Journal of computational intelligence systems, 8(2):330–344. Census, U. (2010). 2010 us census block group data. http://www2.census.gov/geo/docs/reference/cenpop2 010/blkgrp/CenPop2010 Mean BG.txt. Chen, L., Zhang, D., Pan, G., Ma, X., Yang, D., Kushlev, K., Zhang, W., and Li, S. (2015). Bike sharing station

105


placement leveraging heterogeneous urban open data. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 571–575. ACM. Farber, S., Morang, M. Z., and Widener, M. J. (2014). Temporal variability in transit-based accessibility to supermarkets. Applied Geography, 53:149–159. Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between data points. science, 315(5814):972–976. Ghaemi, P., Shahabi, K., Wilson, J. P., and Banaei-Kashani, F. (2010). Optimal network location queries. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 478–481. ACM. Ghaemi, P., Shahabi, K., Wilson, J. P., and Banaei-Kashani, F. (2012). Continuous maximal reverse nearest neighbor query on spatial networks. In Proceedings of the 20th International Conference on Advances in Geographic Information Systems, pages 61–70. ACM. Google (2017). Google maps distance matrix api. https://developers.google.com/maps/documentation/ distance-matrix/. Guagliardo, M. F. (2004). Spatial accessibility of primary care: concepts, methods and challenges. International journal of health geographics, 3(1):3. Jiao, J., Moudon, A. V., Ulmer, J., Hurvitz, P. M., and Drewnowski, A. (2012). How to identify food deserts: measuring physical and economic access to supermarkets in king county, washington. American journal of public health, 102(10):e32–e39. Kahraman, C., Ruan, D., and Doan, I. (2003). Fuzzy group decision-making for facility location selection. Information Sciences, 157:135–153. Karamshuk, D., Noulas, A., Scellato, S., Nicosia, V., and Mascolo, C. (2013). Geo-spotting: mining online location-based services for optimal retail store placement. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 793–801. ACM. Kuo, R., Chi, S., and Kao, S. (1999). A decision support system for locating convenience store through fuzzy ahp. Computers & Industrial Engineering, 37(1):323– 326. Li, Y., Zheng, Y., Ji, S., Wang, W., Gong, Z., et al. (2015). Location selection for ambulance stations: a datadriven approach. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, page 85. ACM. Nallamothu, B. K., Bates, E. R., Wang, Y., Bradley, E. H., and Krumholz, H. M. (2006). Driving times and distances to hospitals with percutaneous coronary intervention in the united states. Circulation, 113(9):1189– 1195. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning

106

in Python. Journal of Machine Learning Research, 12:2825–2830. Qu, Y. and Zhang, J. (2013). Trade area analysis using user generated mobile location data. In Proceedings of the 22nd international conference on World Wide Web, pages 1053–1064. ACM. Tung, A. K., Hou, J., and Han, J. (2001). Spatial clustering in the presence of obstacles. In Data Engineering, 2001. Proceedings. 17th International Conference on, pages 359–367. IEEE. Tzeng, G.-H. and Chen, Y.-W. (1999). The optimal location of airport fire stations: a fuzzy multi-objective programming and revised genetic algorithm approach. Transportation Planning and Technology, 23(1):37– 55. Tzeng, G.-H., Teng, M.-H., Chen, J.-J., and Opricovic, S. (2002). Multicriteria selection for a restaurant location in taipei. International journal of hospitality management, 21(2):171–187. Van Brummelen, G. (2012). Heavenly mathematics: The forgotten art of spherical trigonometry. Princeton University Press. Wang, F., Chen, L., and Pan, W. (2016a). Where to place your next restaurant?: Optimal restaurant placement via leveraging user-generated reviews. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 2371– 2376. ACM. Wang, Y., Jiang, W., Liu, S., Ye, X., and Wang, T. (2016b). Evaluating trade areas using social media data with a calibrated huff model. ISPRS International Journal of Geo-Information, 5(7):112. Xiao, X., Yao, B., and Li, F. (2011). Optimal location queries in road network databases. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 804–815. IEEE. Xu, M., Wang, T., Wu, Z., Zhou, J., Li, J., and Wu, H. (2016). Demand driven store site selection via multiple spatial-temporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, page 40. ACM. Yang, J. and Lee, H. (1997). An ahp decision model for facility location selection. Facilities, 15(9/10):241– 254. Yong, D. (2006). Plant location selection based on fuzzy topsis. The International Journal of Advanced Manufacturing Technology, 28(7):839–844. Yu, Z., Tian, M., Wang, Z., Guo, B., and Mei, T. (2016). Shop-type recommendation leveraging the data from social media and location-based services. ACM Transactions on Knowledge Discovery from Data (TKDD), 11(1):1. Yu, Z., Zhang, D., and Yang, D. (2013). Where is the largest market: Ranking areas by popularity from location based social networks. In Ubiquitous Intelligence and Computing, 2013 IEEE 10th International Conference on and 10th International Conference on Autonomic and Trusted Computing (UIC/ATC), pages 157–162. IEEE.

ResPred: A Privacy Preserving Location Prediction System Ensuring Location-based Service Utility Arielle Moro and Benoˆıt Garbinato Institute of Information Systems, University of Lausanne, Lausanne, Switzerland {arielle.moro, benoit.garbinato}@unil.ch

Keywords:

Location Prediction, Location Privacy-preserving Mechanism, Threat Model, Inference Attack, Locationbased Services.

Abstract:

Location prediction and location privacy has retained a lot of attention recent years. Predicting locations is the next step of Location-Based Services (LBS) because it provides information not only based on where you are but where you will be. However, obtaining information from LBS has a price for the user because she must share all her locations with the service that builds a predictive model, resulting in a loss of privacy. In this paper we propose ResPred, a system that allows LBS to request location prediction about the user. The system includes a location prediction component containing a statistical location trend model and a location privacy component aiming at blurring the predicted locations by finding an appropriate tradeoff between LBS utility and user privacy, the latter being expressed as a maximum percentage of utility loss. We evaluate ResPred from a utility/privacy perspective by comparing our privacy mechanism with existing techniques by using real user locations. The location privacy is evaluated with an entropy-based confusion metric of an adversary during a location inference attack. The results show that our mechanism provides the best utility/privacy tradeoff and a location prediction accuracy of 60% in average for our model.

1

INTRODUCTION

In recent years, predicting future locations of users has become an attractive topic for both the research community and companies. Location prediction can boost the creation of new Location-Based Services (LBS) in order to help users in their daily activities. For example, a LBS could send personalized information to users, such as the menu of different restaurants the users could like in the vicinity of a location in which they will probably be at a specific time, e.g., Monday between 11:30 and 12:00 am. In order to obtain future locations of a user, a LBS needs to build a predictive model containing spatial and temporal information. However, this leads to a first location privacy issue because the user must send all her raw locations to a third-party entity as described in Figure 1 (a). In this architecture, the LBS, which can be malicious, is installed on the mobile device of the user and gathers all user locations. To preserve location privacy, the idea is to create a location predictive model in a trusted component that can be stored at the operating system level of the mobile device. In this context, the trusted component itself will provide the future locations of the user to the LBS as depicted in Figure 1 (b). Even after a large number of requests

performed by the LBS, it should not be able to reconstruct the entire predictive model of the user but may have a good partial view of her model. As a result, this is a undeniable second location privacy issue. It has been demonstrated in the literature that sharing accurate locations has a real cost for a user because a potential adversary cannot only discover a lot of sensitive information related to the user but also identify her by just performing simple location attacks as described by Krumm in (Krumm, 2007). In addition, the authors of (Zang and Bolot, 2011) show that a few number of user’s locations only might highly compromise the location privacy of a user. Because of the availability of different positioning systems on mobile devices, LBS are very convenient for daily activities. Consequently, users cannot completely avoid using LBS. However, users must know that it is fundamental to preserve their privacy when they are using LBS. Currently, users can only enable or disable the access to locations for specific applications and sometimes reduce the precision of the locations obtained with a positioning system. These options depend on the operating system itself. These simple choices are not adapted to the context of our work because we want to preserve the location privacy of the user at a higher level, which is a loca107

Moro, A. and Garbinato, B. ResPred: A Privacy Preserving Location Prediction System Ensuring Location-based Service Utility. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 107-118 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved


Location-based service

predictLoc( Location-based service

predictLoc(

tfuture)

locpredicted

loccurrent

Positioning system

rutility)

locpredicted

Location-privacy process

=> LBS utility

Location prediction service getCurrentLoc()

tfuture,

ResPred system

Location-based service

User privacy

Location trend model

getCurrentLoc()

loccurrent

Positioning system

getCurrentLoc()

loccurrent

Positioning system

(a) First architecture

(b) Second architecture

(c) Third architecture including ResPred system

Figure 1: Problem/contribution overview through three different system architectures.

tion prediction level. In order to protect raw locations of a user, some existing Location Privacy Preserving Mechanisms (LPPMs) can be applied, such as spatial perturbation, spatial cloaking, sending dummy locations as well as spatial rounding, as discussed in (Krumm, 2007; Gambs et al., 2011; Agrawal and Srikant, 2000; Gruteser and Grunwald, 2003; Kido et al., 2005). Nevertheless, these mechanisms may quickly decrease the utility level of a LBS as the level of protection increases, up to the point when the LBS becomes unusable. In this paper, we present a privacy preserving location prediction system called ResPred, res and pred mean respect (i.e., respect the privacy of users) and prediction respectively. This system allows LBS to request future location of users. For instance, a LBS can display information containing future public transportation departures located in the vicinity of the predicted location returned by ResPred on the mobile device of the user in advance. Figure 1 (c) presents the ResPred system that contains two components: one component focuses on the location prediction and the second on the location privacy. We assume that the ResPred system is created at the operating system level of the mobile device and that the ResPred system and the positioning system are trusted. The system includes a location prediction component based on a statistical location trend model and a location privacy component helping to blur the predicted locations by finding an appropriate tradeoff between the LBS utility and the user privacy preference expressed as a maximum percentage of utility loss. We also assume that the LBS is untrusted, which indicates that it is a possible adversary. As depicted in Figure 1 (c), the LBS requests the future location of the user by indicating a time duration between the current time and the time of the desired predicted location and the system returns a predicted location that will be found by exploring the location trend model and protected by our LPPM. The predicted location is more specifically transformed according to the required utility level of the LBS and the maximum util108

ity level that the user is willing to sacrifice in order to protect her location privacy. We evaluate our system from a utility/privacy perspective, which is the crucial aspect of our approach. In addition, we compute the location prediction accuracy of the location trend model. We chose real mobility traces coming from two datasets, thePrivaMov dataset described in (Ben Mokhtar et al., 2017) and a private dataset collected by a researcher in Switzerland. The first part of the utility/privacy evaluation consists in assessing the utility level of our LPPM and two other well-known mechanisms described in the literature, namely the rounding and the Gaussian perturbation. The second part of the utility/privacy evaluation focuses on the measurement of the confusion level of an adversary performing a location attack on the received predicted locations from the ResPred system. The metric used to evaluate this confusion level is based on the well-known Shannon entropy. The results show that our location privacy preserving mechanism provides the best utility/privacy tradeoff compared to the other evaluated mechanisms as well as a good location prediction accuracy for the analyzed users. The contributions of this paper are listed below. • We describe a system, called ResPred, allowing LBS to request future location of a user. • We present a statistical model containing location trends of a user per time slice, helping to extract short, mid and long-term predicted locations. • We describe a LPPM enabling to reach an appropriate utility/privacy tradeoff. • We use real user locations to assess our system and, more specifically, its two components. The paper is organized as follows: in Section 2 we begin with the description of the system model containing the formal definitions used in the paper. Section 3 presents the problem addressed in this paper, while the ResPred system is described in Section 4. Then, we present the evaluation of the system from a utility/privacy perspective in Section 5. In addition, we also evaluate the location prediction accuracy of

ResPred: A Privacy Preserving Location Prediction System Ensuring Location-based Service Utility

the location trend model of the ResPred system. We detail the closest work to the two main subjects of this paper in Section 6, which are the location privacy as well as the location prediction. Finally, we highlight the most important findings of the paper and discuss future work in Section 7.

2

SYSTEM MODEL

This section focuses on describing the key definitions used to present our system. In order to facilitate the analysis of locations of a user, the time is discretized. We also introduce Regions Of Interest (ROIs) on which the location predictive model is based. Finally, we present the threat model that describes the context used to evaluate the location privacy.

2.1

User and Locations

We consider that a user is moving on a geodesic space and is owning a mobile device that is able to detect her locations as well as when they are captured via a positioning system, e.g., GPS, WiFi or radio cells. A location is described as a triplet loc = (φ, λ,t) where φ and λ are the latitude and longitude of the location in the geodesic space, and t is the time when the location was obtained from the positioning system. Locations are formally represented as a sequence L = hloc1 , loc2 , · · · , locn i. A subsequence of successive location of L is described as follows lsubi = hloc1 , loc2 , · · · , locm i in which the first location of this subsequence is noted lsubi .loc f irst and the last location is lsubi .loclast . We can express the latitude, longitude and time of a location loci by directly writing loci .φ, loci .λ and loci .t respectively.

2.2

Temporal Discretization

In order to discretize time, we compute n slices generated according to the chosen temporal granularity and time span, e.g., every 20 minutes during one week. A time slice is a triplet defined as follows ts = (tstarting ,tending , index) where tstarting (Monday - 7:00 am) and tending (Monday - 7:20 am) represent the starting time and ending time of the time slice and index is its unique identifier ranging between 1 and n (n represents the total number of computed time slices). For instance, if we generate all time slices having a duration of 20 minutes during a period of 1 week, we will obtain 504 time slices. All the possible time slices are represented as a sequence called timeslices, such that timeslices =

hts1 ,ts2 , · · · ,tsn i. In addition, we introduce a function called convert(hloc1 , loc2 , · · · , locm i) translating a sequence of one or several successive locations into a sequence of one or several successive time slices called timesliceTab, m being the total number of location(s) to convert. This sequence is described as follows: timesliceTab = hts1 ,ts2 , · · · ,tsn i in which n is the total number of successive time slices.

2.3

Regions of Interest

A region of interest (ROI) is defined as a circular area visited by a user during a certain period of time, which is a quadruplet of the form roi = (φ, λ, ∆r, visits). Items φ and λ are the coordinates of the center of the ROI in a geodesic space. ∆r is the radius of the ROI and visits is a sequence of subsequences of L such as visits = hlsub1 , lsub2 , · · · , lsubm i in which each subsequence of successive locations is contained in L such that ∀lsubi ∈ visits, lsubi ⊂ L and lsubi .loclast .t < lsubi+1 .loc f irst .t. Each visit of a ROI has a duration equal or greater than a threshold, called ∆tmin , such as ∀lsubi ∈ visits, lsubi .locm .t −lsubi .loc1 .t >= ∆tmin . In addition, all locations of the visits are contained in the ROI spatially described by the first three items of it, i.e., latitude, longitude and radius. The set containing all ROIs of a user can be noted as follows: rois = {roi1 , roi2 , · · · , roin }. The last and important characteristic of the ROI is that there is no spatial intersection between ROIs. This means that, if two ROI candidates intersect during the discovery process of ROIs, they will be merged and a new ROI is created from these two ROI candidates.

2.4

Threat Model

We consider a threat model that takes into account a honest but curious adversary in the form of a LBS using ResPred. The LBS will try to infer future locations of the user based on a location history gathered by requesting ResPred. This location history contains all predicted locations sent by ResPred and consists in its unique background knowledge on which the location attack will be performed. This history is not complete because we consider that the LBS will not request ResPred constantly but a limited number of times in a random manner during a certain time slice or by following the usual use of the LBS by the user, e.g., everyday at the end of the afternoon. The honest but curious behavior of the LBS also means that it will not try to break the sharing protocol or obtain the location predictive model of the system ResPred. In addition, we consider that the LBS always gives adapted parameters to its service to the ResPred system, more 109


specifically the values of the parameters ∆t f uture and ∆rutility as depicted in Figure 1 (c) or Figure 2.

Location-based service 1

predictLoc(

tfuture,

rutility)

locpredicted

4

ResPred system

3

PROBLEM STATEMENT

Considering that a LBS wants to estimate the future location of a user, it needs to create a predictive model of the user. In order to reach this goal, the LBS will constantly collect locations of the user to update her model as shown in Figure 1 (a). However, the location privacy of user is entirely compromised because all her raw locations are regularly shared with the LBS. This means that all sensitive information related to the user is given to a third-party entity. For example, the LBS can discover the following sensitive information related to the user from her raw locations: her home and work places but also her likes and dislikes about religion and/or politics. The first solution is to delegate the creation of the predictive model to a service at the operating system level that we consider as trusted, as shown in Figure 1 (b). In this context, the only service that has access to the raw locations of the user coming from the positioning system is the dedicated service. The latter provides predicted locations to the LBS that needs them to operate properly. Although the location privacy of the user is increased in this context, there is still a location privacy issue about the predicted locations shared with the LBS. Because of all the predicted locations gathered by the LBS, the latter can always infer precise location habits of the user, especially when it is requesting the trusted service for the same future time every day for instance. Consequently, the challenge is to protect as much as possible the location privacy of the user in the context of the sharing of her predicted locations with a LBS. Although there exist various LPPMs in the literature, they do not necessarily meet the utility requirement of a LBS. This means that they can easily compromise the proper functioning of the LBS until reaching the point it becomes unusable for the user. For example, the location information provided by the LBS can be inaccurate or simply erroneous because the precision of the prediction has been made too low by the LPPM. As a result, the user might stop using the LBS. As discussed in the introduction, our approach consists in building a system, including a location predictive model as well as an adapted LPPM, that takes into account the utility requirement of the LBS and the utility/privacy tradeoff expressed by the user as indicated in Figure 1 (c) or Figure 2.

110

Location-privacy process => LBS utility

User privacy

3


2

getCurrentLoc()

loccurrent

Positioning system

2

predictLoc(loccurrent,

3

protect(tmpLocpredicted,

tfuture) = tmpLocpredicted pmaxUtilityLoss,

rutility) = locpredicted

Figure 2: ResPred system overview.

4

SYSTEM OVERVIEW

As described in Figure 2, ResPred contains two components. The first component focuses on location prediction, while the second component relates to location privacy. Consequently, the first component is responsible for the prediction of the future location of the user and includes her predictive location model, called location trend model. The second component aims at protecting the predicted location computed by the first component and uses a LPPM called utility privacy tradeoff LPPM. A request of a LBS consists in asking where a user will be in the future. As described in Equation 1, the LBS requests the future location by specifying the time duration expressed by ∆t f uture in seconds from the current time, e.g., 7200 seconds (2 hours) from now. The LBS also indicates its required utility ∆rutility that allows it to operate properly. For instance, if a LBS must call a taxi for a user in advance, the LBS will indicate an utility of a short distance in meters, such as 500 meters. A long distance could compromise the use of the taxi service itself and the related LBS because it could display inaccurate information to the user. The returned value is a location expressed by a pair loc predicted = (φ, λ). predictLoc(∆t f uture , ∆rutility ) = loc predicted

(1)

To summarize, ResPred will answer the following question: Where will be the user in ∆t f uture second(s) from now?

4.1

Location Prediction Component

The location prediction component contains a predictive model that represents the location trends of a user organized per time slice. As mentioned in Section 2,


5

ROI discovery

3

1

3

6 4

TS1 TS2 TS3 TS4

2

2 1

1 6

4

2

2

…

3 2

2

5

Temporal and spatial matching

TS5

2

3

6

…


Figure 3: From ROIs to location trend model.

time is discretized into time slices during a given period of time, such as 504 time slices during one week (i.e., the duration of one time slice is 20 minutes). A location trend model is an array in which each cell contains all the possible ROIs or successive ROIs visited during a specific time slice. Figure 3 describes the creation process of the location trend model. Firstly, the ROI discovery process enables to discover all the ROIs of a user by analyzing the raw locations of the user. Secondly, all raw locations are marked with a specific ROI and a specific time slice as specified in the Temporal and spatial matching step. This step helps to pre-process the locations for the creation of the location trend model. Finally, we discover the structure of the location trend model in which we collect all the ROIs or successive ROIs visited during each time slice. Since the location trend model is a statistical model, each visited ROI or successive visited ROIs stored for a given time slice have a visit counter. This enables to highlight the location habits of the user per time slice, i.e., the ROIs or successive ROIs that are the most visited by the user during a time slice. In addition, this allows the component to find the predicted locations to answer the LBS requests. As depicted in Figure 2, the location trend model will have to solve the following request expressed in Equation 2 and return a temporary predicted location tmpLoc predicted . The latter is not the final predicted location sent to the LBS at the end of the process because tmpLoc predicted must be protected by the LPPM of the location privacy component. predictLoc(loccurrent , ∆t f uture ) = tmpLoc predicted (2) In order to find the tmpLoc predicted , the location prediction component starts by searching the target time slice corresponding to the time slice that includes the future time computed by adding the ∆t f uture duration to the current timestamp, i.e., loccurrent .t. After having found this target time slice, the location trend model is analyzed to find the location trends corresponding to the target time slice expressed as ROI(s). The tmpLoc predicted is a triplet such as tmpLoc predicted = (φ, λ, ∆r). Item ∆r is a radius that is the accuracy of the temporary predicted location. There are two cases now to compute the items of the tmpLoc predicted . Firstly, if the analysis highlights that

the most likely visited location in the target time slice corresponds to one ROI, the temporary predicted location has the same latitude, longitude and radius as those of the ROI. Secondly, if the analysis shows that the most likely visited locations are two or several successive ROIs, the component merges all the ROIs into one single ROI and computes a new latitude, a new longitude and a new radius, which correspond to the items of the tmpLoc predicted . In addition, it is important to note three specific location prediction scenarios that can occur during the prediction process. The best scenario is that the component finds the most likely ROI or successive ROIs to compute the tmpLoc predicted by exploring the location trends of the target time slice. Secondly, it can happen that all ROIs or successive ROIs have the same visit counter value. In this context, the last visited ROI or successive ROIs are used to compute the tmpLoc predicted . Finally, it is also possible that there is no ROI or successive ROIs recorded for the target time slice. For this unique and specific problem, the component explores previous time slices until finding a visited ROI or successive ROIs to compute the tmpLoc predicted .

4.2

Location Privacy Component

The goal of the location privacy component is to protect as much as possible the temporary predicted location found by the location prediction component. The LPPM that will be applied on the tmpLoc predicted depends on two aspects: the LBS utility ∆rutility given by the LBS and the user privacy preference given by the user expressed as a maximum utility loss percentage ∆pmaxUtilityLoss . This means that the LBS can provide useful and relevant information in a radius, which is the LBS utility in meters, around a reference location. Beyond this distance, there is no guarantee that the LBS is able to operate properly or to provide a reliable information to the user. For example, if the LBS is an application of a taxi company and asks a predicted location, at the end of day when the user usually requests the LBS for a taxi, in order to anticipate the user’s request, the LBS will indicate a close utility in meters in order to not be far from the user in a future time. The maximum utility loss is expressed as a percentage that clearly indicates the maximum utility that the user is willing to sacrifice in order to protect her location privacy. Consequently, its value is a percentage ranged between 0 included and 1 not included. 0 is included and means that the user simply does not want to lose any LBS utility. 1 is not included because this would mean that the LBS cannot work properly if this value is reached. Equation 3 describes the request handled by the component includ111


p

p p p

r r

p

x

r r

r

x

r

p p

r r

r

p

p

p

p

p

r

p px

p

r

r

r

x

r

r p

r p

Figure 4: Computing new coordinates when the radius of the reference zone is adjusted, i.e., greater or smaller than the radius of tmpLoc predicted . p

p

r

r

r x r

p p

p

r

r

r

r

r p

r

r

r

x r

r

r r

p

r

r p

r

x

p p

r r

x r r

p

p p p

x r

p

r

p

p p

p

x p p

p p

p

p

Figure 5: Three possible random generations of new coordinates (position x in gray) according to a high maximum utility loss percentage.

ing the LBS utility ∆rutility and the maximum utility loss percentage ∆pmaxUtilityLoss . protect(tmpLoc predicted , ∆pmaxUtilityLoss , ∆rutility ) = loc predicted (3) The location privacy preserving mechanism works in the following manner. The component firstly creates a reference zone zonere f that has a latitude and a longitude corresponding to those of the tmpLoc predicted and a radius equals to the LBS utility ∆rutility . The goal of the component is now to change the latitude and the longitude of the tmpLoc predicted by computing new coordinates. The component will create a new zone, called zonenew , which is a zone having the new generated latitude and longitude as coordinates and a radius equals to the LBS utility ∆rutility . In order to compute these new coordinates, the component firstly generates a random angle that indicates the direction of the new coordinates. Then, a latitude and a longitude are generated randomly in the direction of the angle between 0 and a threshold value corresponding to the case where there cannot have any intersection between zonere f and zonenew , i.e., 2 × ∆rutility . Now the component must carefully check if the protected percentage of the zonere f is not greater than the maximum utility loss percentage indicated by the user, i.e., pmaxUtilityLoss . In order to check this condition, the component computes the area of the intersection between the reference zone zonere f and the new zone zonenew . The area of this intersection is divided by the area of the zonere f in order to obtain a released percentage preleased , which is shared with the LBS. Finally, the component computes the protected percent112

age that is equal to: p protected = 1 − preleased . The new coordinates are validated only if p protected is lower or equal to the maximum utility loss percentage given by the user. If it is not the case, new coordinates are generated until meeting this condition. When this condition is met, loc predicted is created with a latitude and a longitude corresponding to the new coordinates and is sent to the LBS. Therefore, there is a clear link between the utility that the user is willing to lose and her location privacy because the greater the pmaxUtilityLoss , the better the user protects her location privacy. Equation 4 summarizes the checking of this condition. The function area enables to compute the area of the elements passed as parameters. 1−

area(zonere f ∩ zonenew ) 0 Then we compute the utility average of a target time slice by dividing the number of predicted locations that meet the utility condition by the total number of predicted locations sent to the LBS for this target time slice. Finally, we calculate the average of the utility results obtained for all the target time slices in order to obtain the utility result of the scenario. 5.6.2 Location Privacy Metric The location privacy metric corresponds to a metric that evaluates the degree of confusion of an adversary, the LBS in our case, during a location attack on the predicted locations received from ResPred. The metric is based on the Shannon entropy that can compute


(a) Temporary predicted location

(b) Gaussian perturbation (parameter: 0.005)

(c) Rounding mechanism (rounded to 2 decimals)

(d) Utility privacy tradeoff LPPM (parameter: 0.9)

Figure 6: Visual description of the impact of the different LPPMs on a temporary predicted location.

a level of uncertainty as described in (Shokri et al., 2011). As mentioned in Section 2.4, the location attack performed by an adversary consists in trying to discover one location amongst all of the predicted locations sent by ResPred for a specific target time slice considering that the adversary knows how the time is discretized in our location trend model. The goal of a LPPM is to confuse the adversary in order to reduce its probability of finding one single location for a target time slice. In order to compute the location privacy, we will create a grid that discretizes the space and compute the density proportion pdensity of each visited cell of this grid. The density proportion pdensity is the number of predicted locations out of the total number of predicted locations visited per visited cell of the grid during the target time slice. Each cell of the grid is a rectangle of approximately 100 meters per 180 on an average, i.e., a difference of 0.001 between two successive latitudes or longitudes. Equation 6 describes the computation of the location pri-

vacy for a specific target time slice in which i is the index of the ith visited cell by the user, n is the total number of visited cells by the user during the target time slice. A low entropy result means a low confusion of the adversary, while a high entropy result means a high uncertainty. n

reslocationPrivacy = − ∑ pdensityi log2 pdensityi

(6)

i=1

Finally, we compute an average result for each scenario in the same way as for the utility metric (described at the end of the previous section).

5.7

Results

The average of the location prediction accuracy of the location trend model for all evaluated users is equal to 60%. In addition, we obtain a minimum and a maximum location prediction accuracy of 16% and 90% respectively. 115


Table 2: Utility / location privacy results. LPPM/Result Utility privacy tradeoff LPPM Grid-based rounding Gaussian perturbation

Utility result 1.0 0.62 0.50

Location privacy result 2.81 0 2.78

Regarding the utility/privacy tradeoff evaluation, we firstly compute the average of the utility results of all LBS scenarios per user and, secondly, we calculate average utility results of all users. We do exactly the same for the location privacy results. The results are summarized in Table 2. We can clearly see that our LPPM, i.e, the utility/privacy tradeoff LPPM, has the best utility/privacy tradeoff because the utility result and the location privacy result reach the highest values. This means that our LPPM meets the LBS utility requirements and is also able to protect the location privacy of the user according to her privacy preference. Although the Gaussian perturbation has also a high location privacy result, a reasonable utility result is not reached. The location privacy result of the grid-based rounding is equal to 0 indicating that the adversary has no confusion because the modified locations, i.e., predicted locations, are always the same for a target time slice. The Gaussian perturbation has the advantage of blurring the location via a single parameter expressing a distance, while the grid-based mechanism requires the creation of a grid that can take a substantial time and its exploration before being able to blur a location. Although our mechanism must check a location privacy condition, it computes the new coordinates within a reasonable time. Finally, we can see the blurring impact of the different LPPMs on a temporary predicted location in Figure 6. In Figure 6 (a), we can see the center as well as the radius of a temporary predicted location, both depicted with a marker and a circle. In Figure 6 (b) and (d), 100 new locations, depicted with new markers, are created according to the corresponding LPPM. Regarding the rounding, the coordinates have only been rounded to two decimals in the figure but in the context of the evaluation with a spatial grid, we would have obtained 100 times the same location because the structure of the grid is fixed and the nearest location is always the same for a single location to blur.

6

RELATED WORK

The related work below tackles the two main subjects of the paper that are the following: the description of existing LPPMs as well as the different predictive models presented in the literature that are used to compute future user locations. 116

6.1

Location Privacy Preserving Mechanisms

In a location prediction context, we consider that we need to protect the predicted location that is sent to a LBS as mentioned in Section 3. To reach this goal, there exist various mechanisms to protect the predicted location, such as applying a spatial perturbation (Agrawal and Srikant, 2000; Armstrong et al., 1999; Gambs et al., 2011), using a spatial cloaking mechanism (Gruteser and Grunwald, 2003), sending dummy locations (Kido et al., 2005) or using a rounding mechanism (Agrawal and Srikant, 2000; Krumm, 2007). Applying a spatial perturbation enables to spatially modify a location as mentioned by several authors in (Armstrong et al., 1999; Gambs et al., 2011). As described in these papers, we can add spatial noise to the coordinates of a location. However, the more noise is added to the location sent to the LBS increases, the more the LBS utility decreases in our context because a LBS may provide information that is not related to the raw predicted location, depending on the level of protection. In the case of the spatial cloaking presented by Gruteser and Grunwald in (Gruteser and Grunwald, 2003), the predicted location should only be sent if the user is considered as k-anonymous, meaning that the user cannot be distinguishable from at least k − 1 other users. This technique is unfortunately not realistic in our context and not easy to implement especially in the case where the mobility models of users are not centralized or shared in a common server. As detailed in (Kido et al., 2005), sending dummy locations is interesting in order to add noise if and only if multiple predicted locations can be sent to a LBS. However in our system, it is impossible to use this LPPM because only one predicted location must be sent to a LBS as an answer to a predictive request supported by ResPred. Utilizing a rounding mechanism, as described in (Agrawal and Srikant, 2000; Krumm, 2007), can be considered because the predicted location is changed into a new location corresponding to a nearest reference point. If we consider that space is discretized and described with multiple reference points (the vertices of each cell of a grid for instance), the mechanism consists in modifying a location into a new location corresponding to the nearest reference vertex of the cell in which the location is as indicated in the papers cited previously. Cryptography techniques could be also used to protect locations sent to third parties as mentioned in (Hendawi and Mokbel, 2012) but our work is not focused on this kind of privacy/security strategies. To summarize and according to the best of our knowl-


edge, there is no LPPM that can find an appropriate tradeoff between the utility and the privacy in a location prediction context. For our utility/privacy evaluation, we chose the closest LPPMs to our work, that are the rounding and the spatial perturbation as detailed in the previous section.

6.2

Location Prediction Requests and Models

As detailed in the complete survey in (Hendawi and Mokbel, 2012), various techniques exist to predict future locations of users. In the literature, there exist different location predictive models for different types of location prediction requests, such as predicting a future location based on a time duration (Jeung et al., 2008; Sadilek and Krumm, 2012), predicting the next location that will probably be reached by a user (Gambs et al., 2012; Gidófalvi and Dong, 2012; Ying et al., 2011), etc. Some location predictionbased papers focus on other location-based predictive requests, such as the prediction of the staying time in a particular ROI or when the user will reach or leave a ROI (Gidófalvi and Dong, 2012), the prediction of the number of users reaching a specific zone (Chapuis et al., 2016) and much more. Other remaining works are focused on range queries that enable to identify if one or multiple user(s) will be in a specific area during a specific time window. In (Xu et al., 2016), the authors describe a way to prune an order-k Markov chain model in order to efficiently compute long-term predictive range queries. The main focus of our paper, in terms of prediction, is to comnpute a future location of a user based on a time duration from the current time. In the literature, it is shown that some predictive models can work better for near location predictions and others are more suited for distant location predictions. In (Jeung et al., 2008), the authors present a hybrid prediction model for moving objects. For near location predictions, their model uses motion functions, while for distant location predictions, their model computes the predicted location based on trajectory patterns. The structure in which they store the trajectory patterns of a user is a trajectory pattern tree. However, they do not evaluate their model with real mobility traces. Their predictive model is close to our location trend model because they use the notion of patterns based on spatial clusters to fill their model. Nevertheless, the structure of their final model is clearly not the same as ours because they create a trajectory pattern tree. Sadilek and Krumm propose a method to predict long-term human mobility in (Sadilek and Krumm, 2012) up to several days in the future. Their method,

which can highlight strong pattern of users, uses a projected eigendays model that is carefully created by analyzing the periodicity of the mobility of a user as well as other mobility features. This work highlights that it is crucial to extract strong patterns for longterm predictions. The location trend model we propose in the ResPred system is close to the model presented by Sadilek and Krumm. However, our model is different in that it is based on ROIs and not on raw locations and takes less features into account.

7

CONCLUSION

In this paper, we presented a system called ResPred that enables to compute predicted locations of a user for LBS. This system contains two components. The first component focuses on location prediction by including a predictive model based on location trends expressed as ROI(s). The second component aims at protecting the location privacy of the user by finding an appropriate tradeoff between a utility specified by the LBS and a location privacy preference indicated by the user that is expressed as a maximum utility loss percentage. The results clearly show that our LPPM provides the best utility/location privacy tradeoff compared to two other existing LPPMs. In addition, the location trend model is promising if we look at the location prediction accuracy results, especially in the context of location prediction according to a certain time duration in the future. Future work will consist in extending the evaluation to more users by finding a dataset having rich user datasets, which is a real need for the research community. We will also design other inference attacks in order to evaluate the location privacy and maybe compare the computing cost of the different LPPMs. And finally, we will compare the location trend model to other existing close models for similar requests regarding short, mid and long-term location predictions.

REFERENCES Agrawal, R. and Srikant, R. (2000). Privacy-preserving data mining. In ACM Sigmod Record, volume 29, pages 439–450. ACM. Armstrong, M. P., Rushton, G., and Zimmerman, D. L. (1999). Geographically masking health data to preserve confidentiality. Statistics in Medicine, vol. 18:497–525. Ben Mokhtar, S., Boutet, A., Bouzouina, L., Bonnel, P., Brette, O., Brunie, L., Cunche, M., D ’alu, S., Primault, V., Raveneau, P., Rivano, H., and Stanica, R. (2017). PRIVA’MOV: Analysing Human Mobility

117


Through Multi-Sensor Datasets. In NetMob 2017, Milan, Italy. Chapuis, B., Moro, A., Kulkarni, V., and Garbinato, B. (2016). Capturing complex behaviour for predicting distant future trajectories. In Proceedings of the 5th ACM SIGSPATIAL International Workshop on Mobile Geographic Information Systems, pages 64–73. ACM. Gambs, S., Killijian, M.-O., and Nún˜ ez del Prado Cortez, M. (2011). Show me how you move and i will tell you who you are. Trans. Data Privacy, 4(2):103–126. Gambs, S., Killijian, M.-O., and Nuñez Del Prado Cortez, M. (2012). Next place prediction using mobility Markov chains. In MPM - EuroSys 2012 Workshop on Measurement, Privacy, and Mobility - 2012, Bern, Switzerland. Gidófalvi, G. and Dong, F. (2012). When and where next: individual mobility prediction. In Proceedings of the First ACM SIGSPATIAL International Workshop on Mobile Geographic Information Systems, pages 57– 64. ACM. Gruteser, M. and Grunwald, D. (2003). Anonymous usage of location-based services through spatial and temporal cloaking. In Proceedings of the 1st international conference on Mobile systems, applications and services, pages 31–42. ACM. Hendawi, A. M. and Mokbel, M. F. (2012). Predictive spatio-temporal queries: A comprehensive survey and future directions. In Proceedings of the First ACM SIGSPATIAL International Workshop on Mobile Geographic Information Systems, MobiGIS ’12, pages 97–104, New York, NY, USA. ACM. Jeung, H., Liu, Q., Shen, H. T., and Zhou, X. (2008). A hybrid prediction model for moving objects. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pages 70–79. Ieee. Kido, H., Yanagisawa, Y., and Satoh, T. (2005). An anonymous communication technique using dummies for location-based services. In Proceedings of the International Conference on Pervasive Services 2005, ICPS ’05, Santorini, Greece, July 11-14, 2005, pages 88–97. Krumm, J. (2007). Inference attacks on location tracks. In Proceedings of the 5th International Conference on Pervasive Computing, PERVASIVE’07, pages 127– 143, Berlin, Heidelberg. Springer-Verlag. Kulkarni, V., Moro, A., and Garbinato, B. (2016). A mobility prediction system leveraging realtime location data streams: poster. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking, pages 430–432. ACM. Sadilek, A. and Krumm, J. (2012). Far out: Predicting longterm human mobility. In AAAI. Shokri, R., Theodorakopoulos, G., Le Boudec, J.-Y., and Hubaux, J.-P. (2011). Quantifying location privacy. In Proceedings of the 2011 IEEE Symposium on Security and Privacy, SP ’11, pages 247–262, Washington, DC, USA. IEEE Computer Society. Xu, X., Xiong, L., Sunderam, V., and Xiao, Y. (2016). A markov chain based pruning method for predictive range queries. In Proceedings of the 24th ACM

118

SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS ’16, pages 16:1–16:10, New York, NY, USA. ACM. Ying, J. J.-C., Lee, W.-C., Weng, T.-C., and Tseng, V. S. (2011). Semantic trajectory mining for location prediction. In Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 34–43. ACM. Zang, H. and Bolot, J. (2011). Anonymization of location data does not work: A large-scale measurement study. In Proceedings of the 17th Annual International Conference on Mobile Computing and Networking, MobiCom ’11, pages 145–156, New York, NY, USA. ACM.

Elcano: A Geospatial Big Data Processing System based on SparkSQL Jonathan Engélinus and Thierry Badard Centre for Research in Geomatics (CRG), Laval University, Québec, Canada [email protected], [email protected]

Keywords:

Elcano, ISO-19125, Magellan, Spatial Spark, GeoSpark, Geomesa, Simba, Spark SQL, Big Data.

Abstract:

Big data are in the midst of many scientific and economic issues. Furthermore, their volume is continuously increasing. As a result, the need for management and processing solutions has become critical. Unfortunately, while most of these data have a spatial component, almost none of the current systems are able to manage it. For example, while Spark may be the most efficient environment for managing Big data, it is only used by five spatial data management systems. None of these solutions fully complies with ISO standards and OGC specifications in terms of spatial processing, and many of them are neither efficient enough nor extensible. The authors seek a way to overcome these limitations. Therefore, after a detailed study of the limitations of the existing systems, they define a system in greater accordance with the ISO19125 standard. The proposed solution, Elcano, is an extension of Spark complying with this standard and allowing the SQL querying of spatial data. Finally, the tests demonstrate that the resulting system surpasses the current available solutions on the market.

1

INTRODUCTION

Today, it becomes crucial to develop systems able to manage efficiently huge amounts of spatial data. Indeed, the convergence of the Internet and cartography has brought forth a new paradigm called “neogeography”. This new paradigm is characterized by the interactivity of location based contents and the possibility for the user to generate them (Mericksay and Roche, 2010). This phenomenon, in conjunction with the arrival in the market of new captors like the GPS chips in smartphones, resulted in the inflation of production and retrieval of spatial data (Badard, 2014). This new interest for cartography makes the process more complex as it becomes more and more difficult to manage and represent such large quantities of data by use of conventional tools (Evans et al, 2014). The Hadoop environment (White, 2012), currently one of the most important projects of the Apache Foundation, is a de facto standard for the processing and management of Big data. This very popular tool, involved in the success of many startups (Fermigier, 2011), implements MapReduce (Dean and Ghemawat, 2008), an algorithm that allows the distribution of data processing among the

servers of a cluster for a faster execution. The data to process are also distributed among the servers by the Hadoop Distributed File System (HDFS), which is provided by default with Hadoop. The result is a high degree of horizontal scalability, which can be defined as the ability to linearly increase the performances of a multi-server system to meet the user’s requirements in terms of processing time. A real ecosystem of interoperable elements has been built up around Hadoop, which enables the management of such various aspects as streaming (e.g. Storm), serialization (e.g. Avro) and data analysis (e.g. Hive). In 2014, the University of Berkeley's AMPLab started to develop a new element of the Hadoop ecosystem, which has since been taken over by the Apache Foundation, namely Spark (http://spark.apache.org/), which offers an interesting alternative to HDFS and MapReduce. In Spark, data and processing codes are distributed together in small blocks called RDD (“Resilient Distributed Dataset”) on the whole cluster RAM. This architectural choice, which strongly limits hard drive accesses, makes Spark up to ten times faster than conventional Hadoop use, in some cases (Zaharia et al, 2010), although at the cost of a greater RAM load (Gu and Li, 2013). Furthermore, a

119 Engélinus, J. and Badard, T. Elcano: A Geospatial Big Data Processing System based on SparkSQL. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 119-128 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved


part of Spark called Spark SQL (Armbrust et al, 2015) dress up Spark RDD with a supplementary level called “DataFrames” which allows to organize received data from Spark into temporary tables and to query them with SQL language. Spark SQL minimizes as well the duration of Spark processes, thanks to the strategic optimization of queries and the serialization of data. Finally, it allows the definition of personalized data types (UDT: “User Defined Type”) and personalized functions (UDF: “User Defined Function”), which permit respectively to get new kinds of data and processing available from SQL. This opportunity to query Big data thanks to SQL is of paramount importance as it helps in their analysis, with the goal of a better understanding of the phenomena they represent on the ground. It also empowers analysts with new analytical capabilities, using a query language they already master on the day to day. Together with the availability of a growing amount of geospatial data, it is profitable to use these capabilities to analyze the spatial component of this huge amount of information, which, according to Franklin, is present in 80% of all business data (Franklin and Hane, 1992). According to a frequently mentioned research of the McKinsey cabinet (Manyika et al, 2011), a better use of Big data spatial localization could grant 100 billion USD to services providers and in the range of 700 billion USD to final users. Lastly, spatial Big data management finds itself in the midst of many important economical, scientific and societal issues. In this respect, Spark appears again as a promising solution, because it processes the spatial data at least more than 7 times faster than Impala, another Hadoop element managing the SQL (You, et al, 2015). Today, some systems relying on Hadoop enable the management of massive spatial data, such as Hadoop GIS (Aji et al, 2013), Geomesa (Hugues et al, 2015) and Pigeon (Eldawy and Mokbel, 2014). But they are mainly about prototypes than mature technologies (Badard, 2014). In addition, most of them only relies on the core version of Hadoop without fully scaling the processing power of the RAM like it is achieved by Spark. For example, Spatial Hadoop (Eldawy and Mokbel, 2013) only uses the Map Reduce algorithm of Hadoop. Among these systems, only five propose a management of spatial data relying on Spark. The first two systems, Spatial Spark (You, et al, 2015) and GeoSpark (Yu et al, 2015) only add a management of the spatial component to the basic version of Spark, which do not fully take advantage

120

of all the capabilities (e.g. SQL querying) and performance of Spark. Hence, the current Spatial Spark version can only interact with data in command line mode instead of managing SQL queries. GeoSpark only uses its own spatial extension of the Spark RDD type, which does not directly comply with Spark SQL (Yu, 2017). The third, Magellan (Ram, 2015), defines spatial data types directly available in Spark SQL, but without correctly managing some spatial operations like the union of disjointed polygons, the symmetric differences involving more than a geometry and the creation of an envelope. The fourth, Simba (Xie et al, 2016), enables the querying of data in SQL for points only and without the possibility to trigger standard spatial functions. At last, the fifth prototype is the Geomesa extension, which can be used from Spark. The system is anyway limited in the spatial operations that it offers because it has been natively designed only for the research of points included in an envelope. Furthermore, it presents limited performances (Xie et al, 2016) in comparison with other solutions. That apparently could be explained by the fact that it imposes the use of a key store technology (Accumulo, https://accumulo.apache. org/) to store the spatial data to process. As a conclusion, there is presently no system for the management of geospatial data that fully manages all kinds of 2D geometry data types and that enables their efficient and actionable SQL querying. Each model implemented in the five studied prototypes which pursue a similar goal presents limited capacities both on the types of geometry they support as well as on the spatial processing capabilities they offer. Details about this last point are given in the next section.

2

LIMITS IN THE GEOSPATIAL CAPABILITIES SUPPORTED BY CURRENT SOLUTIONS

In order to assess the capabilities of the different geospatial Big data management systems currently relying on Spark to fully manage the 2D spatial component, the ISO-19125 standard can profitably be used as a guideline. Indeed, the two parts of this standard respectively describe the 2D geometry types and the geospatial functions and operators (ISO 19125-1, 2004) and their expression in the SQL language (ISO 19125-2, 2004) that a system must implement to basically store 2D geospatial data and support its querying and its analysis in an

Elcano: A Geospatial Big Data Processing System based on SparkSQL

interoperable way. In this context, we will first introduce the geometry types supported by the different systems. Then we will analyze which spatial functions they give access to and whether they can be extended to easily implement the missing ones. Finally, we will study how they manage the spatial indexation issue, which is crucial when dealing with geospatial data.

2.1

Geometry Data Types

A system complying with ISO 19125-1 is supposed to handle the seven main 2D geometry types that can be built by linear interpolation. These can be divided into three simple types (point, polyline and polygon) and into four composite types (multipoint, multipolyline, multipolygon and geometry collection). Here is a study of how the current systems meet this standard. Spatial Spark and GeoSpark integrate all these types of geometries because their model relies on the use of the JTS (“Java Topology Suite”, https://www.locationtech.org/proposals/jts-topologysuite) library, which has been designed to meet ISO standards and OGC recommendations (Davis and Aquino, 2003). Geomesa also manages all the geometries in its current version (Commonwealth Computer Research, 2017), while Simba only manages the point. The case of the HortonWorks Magellan system is more mixed. It enables the processing of points, polylines and polygons. This may seem sufficient if one assumes, as one of the designers of the system (Sriharasha, 2016) does, that compound geometries are reducible to tables of geometries. But in reality, such an approach can only lead to a dysfunctional system. Indeed, by not being able to explicitly create actual complex geometry, such arrays are not allowed as operands of a spatial function and their return as a result of a spatial operation like the union of disjoint polygons causes a type error. In addition to its development, Magellan's limitations are also due to the use of ESRI Tools as a spatial library. The latter does not make it possible to process all the 2D geometry types defined by the ISO-19125 standard. It lacks the geometry collection type, while the multi-polygon type is only partially implemented. Furthermore, the adaptation of WKT (“Well-Known Text”) provided by ESRI Tools does not comply with the ISO standards and the OGC recommendations. The limitations of the different solutions studied in relation to the requirements of ISO-19125-1 are summarized in Table 1. Those related to ESRI Tools

have been added to give an idea of the limits that they involve on the evolution of Magellan. Table 1: Coverage of the different 2D geometry types specified by ISO-19125 in studied prototypes. Geo Spatial Simba Geomssa Magellan ESRI Spark Spark Point Yes Yes Yes Yes Yes Yes Polyline Yes Yes No Yes Yes Yes Polygon Yes Yes No Yes Yes Yes Multi-Point Yes Yes No Yes No Yes Multi-polyline Yes No Yes No Yes Yes Multi-polygon Yes Yes No Yes No In part Collection Yes Yes No Yes No No

2.2

Spatial Functions and Operators

(ISO 19125-2, 2004) specifies how the spatial functions (relations, operations, metric functions and methods), a spatial data management system should implement in SQL to comply with the ISO 19125-1 standard. It does not specify the way these methods have to be implemented. It only defines their signatures. These functions define the minimal set of operations a system must implement to enable basic and advanced spatial analysis capabilities. Even if these functions have been defined for querying data in classic spatial DBMS, their usage in geospatial Big data management systems still pertain. Nevertheless, the application of the ISO-19125-2 standard requires a system allowing SQL queries and personalized SQL functions. This section details how the five studied systems partly implement the standard and describes their extension capabilities. Spatial Spark only uses the core of Spark. Indeed, it allows to work with RDD’s but not with DataFrames or SQL queries. In this context, the application of the ISO 19125-2 standard to Spatial Spark seems impossible without a full reimplementation. As we saw, GeoSpark extends the RDD type of Spark, and is therefore not directly compatible with Spark SQL. Nevertheless, one of its developers indicates that the integration of this point is planned for a future version of the system and that there would be an indirect way of changing these RDD’s in DataFrames (Yu, 2017). But neither does he describe a general process for it, nor how to apply SQL queries afterwards. Indeed, the current version of GeoSpark does not seem to be compliant with the ISO 19125-2 standard because all geometry types cannot be managed from SQL queries. Simba released its own adaptation of Spark SQL, which might enable the use of SQL queries and the

121


creation of User Defined Functions. In practice however, the only accessible geometry is the point. Furthermore, the syntactic analyzer does not always work properly. By example, it forces to write “IN” before “POINT(x, y)” even without a context of inclusion. Simba is therefore not a mature and reliable solution that could meet the ISO 19125-2 standard. Until recently, Geomesa’s Spark extension only used Spark’s core. But a recent version tries to integrate Spark SQL. However, this solution remains restrained by the mandatory use of the CQL format and the Accumulo database (Commonwealth Computer Research, 2017). Indeed, Geomesa does not allow an autonomic and agnostic implementation of ISO-19125-2. Magellan does not directly manage SQL either. But it defines User Data Types for the point. It is therefore tempting to assume that the addition of Used Defined Functions to its model should be enough to allow the SQL functions of the ISO19215-2 standard. In practice however, the extension of Magellan with these functions only covers two thirds of spatial relations, half of the spatial operations and a small part of spatial methods specified by the ISO-19125-2 standard. These limitations are due to both implementation errors and the choice of the ESRI Tools library, which only partially meets the ISO-19125-2 standard. In their current states therefore, none of the studied systems totally comply with the ISO-19125 standard.

2.3

Spatial Indexation Management

Spatial indexation can be defined as the reorganization of spatial data, typically by using their proximity relations, with the purpose of accelerating their processing (Eldawy and Mokbel, 2015). Four of the studied systems provide a spatial indexation component, but which is never both efficient and extensible. The spatial indexation component of Spatial Spark uses directly the methods of the JTS spatial library, which is not conceived for Big data processing in a multi-server environment. GeoSpark proposes a more integrated and efficient spatial indexation module (Yu, 2017), but without the possibility of managing it with SQL queries. The indexation component of Simba is described as more efficient by its developers (Xie et al, 2016), but has important limitations and bugs we already covered. Finally, Geomesa offers poor performances because it relies on a specific database system (Xie et al, 2016), which drastically increases the processing time.

122

2.4

Synthesis of Limitations

Table 2 sums up the main limitations of the studied systems. It first recalls their most problematic limitations. Then it reminds the geometry types they support and as a result their degree of conformance to the ISO 19125 standard. Next, it indicates whether they manage SQL and whether they comply with the ISO-19125-2 standard. Table 2: Limitations of current spatial Big data processing systems. Spatial Magellan Spark GeoSpark Use a limited Inextensib Main spatial le to SQL limitation library Only Types of All simples geometries Yes ISO-19125-1 In part No, but SQL management extensible

No

ISO-19125-2

In part (by extension)́

No

Spatial indexation

No

Simba

Geomesa

Force to Syntactic use a bugs, ́ no NoSQL extensible database Only point

All

In part Yes (replace Spark SQL)

Yes

No

Yes, limited bý CQL In part

Yes, but not efficient or not extensible

Next section presents a new system designed for the efficient and interoperable management and rapid processing of geospatial Big data (vector data only). It relies on Spark and overcomes identified limitations present in current state-of-the-art solutions. This prototype is named Elcano. Its release as an open source project has not yet been performed but it is envisaged.

3

PRESENTATION OF ELCANO

The main objective leading the design of Elcano is to model a spatial Big data processing and management system that surpasses the other systems studied here. It must then integrate each 2D geometry types defined in the ISO-19125 standard. It must also enable the use of associated spatial functions, in order to improve the analysis of spatial phenomena. All spatial relations, operations and methods defined by the ISO-19125-2 standard must then be implemented by Elcano. For example, a call to the SQL function ST_Intersects has to indicate if two generic geometry objects intersect or not. The


system must also allow to load spatial data in a simple and generic way, for it to be easy to feed and to extend toward other formats. It also must ensure data persistence in memory, in a compact manner so that it enables a faster processing of the geospatial component. It has to be easily extensible in order to potentially support new geometry types or extensions to the geometry types defined in the ISO19125 standard (for example the inclusion of elevation in geometric features definition, i.e. 2.5D data). Finally, it must offer good processing performances in comparison with current processing systems. A model seeking to meet these objectives is presented and justified below.

3.1

Architecture

Figure 1 illustrates the model on which Elcano is based. In this model, classes of the geometry package integrate the elementary geometries and spatial functions linked to Elcano.

and data retrieval is managed by the “Table” class, together with the support of the conversion methods from the GeometryFactory class. Finally, the index package deals with the indexation of spatial data for its faster processing. Details on the way these different capabilities are implemented are given in the next sections.

3.1.1 2D Geometry Types Management The geometry package of Elcano contains a concrete class for each geometry type described in the ISO 19125 standard. These classes use the JTS spatial library, which is specifically conceived to comply with many ISO standard (including ISO 19125) and the OGC recommendations (Davis and Aquino, 2003). This choice avoids the problems faced by Magellan, which are due to the integration of an inadequate spatial library like stated above. The system could have used JTS classes directly, as Spatial Spark and GeoSpark do, but for optimization purposes, it seemed interesting not to be constrained by the implementation of a chosen spatial library. To this end, the Elcano geometry package uses a JTSindependent class hierarchy by applying the “proxy” design pattern (Gamma et al, 1994). This choice of conception allows also to accelerate, whenever possible, the JTS methods by overwriting them.

3.1.2 Spatial SQL Functions Management In order to make the spatial functions and operators defined in ISO 19125 available as SQL functions in Elcano, different User Defined Functions (UDF) has been defined. All these functions are in fact shortcuts to the different methods supported by the different geometry types (i.e. classes included in the geometry package) and specified in the ISO 191125 standard. The build() method of the SqlLoader class in the loader package is in charge of declaring all these functions at the initialization stage of the application.

3.1.3 Spatial Data Persistence

Figure 1: Elcano’s model.

The loader package enables the use of SQL spatial functions. Data persistence for processing

Elcano provides a unified procedure for the loading of all 2D geometry types and their persistence. The Table class of Elcano enables the definition of geometric features in WKT. WKT is a concise textual format defined in the ISO 19125 standard. Elcano thus allows to load tabular data (for example from a CSV file where the geometry component of each row is defined in WKT) in the form of an SQL temporary table. The management of more specific formats like JSON (Bray, 2014), GeoJSON or

123


possible spatial extensions to Big data specific file formats like Parquet (Vorha, 2016) could also be easily added to the system by simply inheriting the Table class.

3.1.4 Data Types Extensibility The GeometryFactory class implements the “abstract factory” design pattern (Vlissides et al, 1995) and allows the extensibility of Elcano. Other geometry types than those defined in the ISO-19215 standard could thus be added in the future, such as Triangles and TINs in order to manage DTMs (Digital Terrain Models).

3.1.5 Spatial Indexation The index package of Elcano contains all classes in charge of the spatial indexation of data stored in Elcano. It drastically speeds up all spatial processes. This component is inspired from the one implemented in Spatial Spark but with some hooks for better performance. Its use is illustrated in the benchmark section. Its detailed description is though out of the scope of the present paper. It will be described and detailed later in another publication.

4

BENCHMARK

The present section compares the performances of Elcano with another spatial data management and processing systems using Spark, aka. Spatial Spark. They are also compared with a well-known and widely used classical spatial database management system (DBMS): PostGIS (Obe and Hsu, 2015). Spatial Spark has been chosen among the studied systems that manage spatial indexation because it is the only one that could be extended to support SQL queries (by performing an important reimplementation though). So, it is the only one of the tested prototype that really compares to Elcano. As for PostGIS, it is to our point of view, a reference implementation of the ISO 19125 standard with which we can compare. In addition, it proposes efficient and reliable spatial indexation methods. For the needs of this benchmark, Elcano and Spatial Spark have been installed on a cluster of servers using a master server with 8 Go of RAM and nine slave servers with 4 Go of RAM. Each of these computers uses the CentOS 6.5 operating systems and height Intel Xeon 2.33 GHz processors. PostGIS has been optimized with the pgTune library

124

(https://github.com/le0pard/pgtune) and tested in comparable conditions. In each of the 3 tests performed, we count the number of resulting elements from a spatial join between two tables. We group the elements of these tables by pair, according to a given spatial relation, namely the intersection. This spatial relation has been chosen because it implies complex and sometimes time consuming processing. The use of a fast and reliable spatial indexation system is also of importance in such a process. The contents of the tables used in the test is fixed. The management of changing data is out of the scope of the tests. Test 1 compares the execution time of the three systems with a raise in data volume. It consists in counting the intersections between an envelope around Quebec province and seven sets of points randomly dispatched in an envelope around Canada. These seven sets contain respectively 1000, 10 000, 100 000, 1 million, 100 million and one billion points. Table 3 presents a synthesis of the first test results for the three studied systems. In order to facilitate their comparison, the duration cumulates the indexation time and the first query time. It appears that performances of Elcano performances are better than those of PostGIS and Spatial Spark beyond one million points. PostGIS is the best choice for lower volumes but encounter a significant slowdown after a certain threshold: it requires many hours to process 100 million points against five minutes for Elcano. The difference between Spatial Spark and Elcano is more tenuous but increases in favor of Elcano as data volume increases. The drop in PostGIS performances when data volume increases is probably explained by its weak horizontal scalability: this system is not designed for Big data management. In return, performances of Elcano when compared to Spatial Spark can be explained by its usage of Spark SQL. Indeed, the latter uses specific query optimizations and Spark’s caching system (Armbrust et al, 2015). But for low data volumes (under one million points), the classical PostGIS solution is better, probably because of its simpler distributed treatments architecture. In a similar way, the best performances of Spatial Spark between one and ten million points can probably be explained by the additional treatments imposed by the use of Spark SQL by Elcano.


Table 3: Test 1 – Processing time with a raise of the data volume.

Volume (points)

PostGIS Spatial Spark Elcano (ms) (ms) (ms)

1000

234

6 543

9 516

10 000

326

6 622

9 714

100 000

3 783

8 301

9 030

1 000 000

29 898

8 301

10 747

10 000 000

269 257

20 487

17 099

100 000 000 5 752 821

55 017

37 378

More than 10 hours

399 100

273 074

1 000 000 000

Test 2 compares the horizontal scalability of Elcano and Spatial Spark for a number of servers from one to nine. It compares the count of the intersections between the envelope of the Quebec province and one billion points randomly dispatched in a bounding box of Canada. PostGIS performance is not measured for this test because the previous test clearly underlies its poor performances for large data volumes and there is no way to distribute the processing between many servers as PostgreSQL has not been designed for horizontal scalability. The table 4 presents a synthesis of the results of this second test. Spatial Spark and Elcano both appear to have a good horizontal scalability. Furthermore, the execution time of the two systems presents a similar drop from one to nine servers: 87,4% for Spatial Spark and 87,2% for Elcano. But Elcano remains approximately 1.5 faster than Spatial Spark regardless of the number of servers. Table 4: Test 2 – Horizontal scalability when the number of server increases.

Servers

Spatial Spark (ms)

Elcano (ms)

Elcano’s superior brute speed in this second test can probably be explained by its using of Spark SQL. Otherwise the rates of scalability of the two systems are very close, maybe because both rely on the JTS spatial library for the implementation of the spatial analysis algorithms. Test 3 compares more finely the performances of PostGIS, Spatial Spark and Elcano. It counts the intersections between one million points in an envelope of Canada and the points in a copy of this set. Therefore, a total of 100 billion intersection tests (spatial join) are processed. The execution time is spread between indexation time, first query time (cold start) and second query time (hot start). Hot start queries are more representative of the response times in a running environment in production. Indeed, while indexing is only necessary once for the two given tables, an SQL query must be started for each spatial join operation applied to them. Table 5 offers a summary of the results for this third test. PostGIS presents a spatial indexation time a bit shorter than Spatial Spark, but the execution time of its first SQL query is then much longer. Elcano presents the best performances in all cases: its indexation time is five time lower than with PostGIS and the execution of its first query is two times faster than with Spatial Spark. Elcano is also the only solution to execute a second SQL query on the same data significantly faster than the first: the second execution is 26 times faster. The last point can probably be explained by Spark SQL’s caching system. Table 5: Test 3 – Execution time is spread between indexation time, first query time and second query time.

Solution

Indexation First Second time (ms) query (ms) query (ms)

PostGIS

29 756

100 742

100 742

1

3 349 414

2 196 344

Spatial Spark

36 824

36 824

36 824

2

1 718 672

1 123 153

3

1 143 790

762 536

Elcano

13 578

15 754

1 393

4

875 284

588 401

5

696 195

473 635

6

586 211

391 297

7

511 111

340 784

8

456 446

314 796

9

423 647

280 761

So, to sum up, above a given data volume, Elcano surpasses PostGIS and Spatial Spark in terms of execution speed. It presents a scalability similar to the one of Spatial Spark, but a better execution time when the number of servers increases.

125


5

CONCLUSION AND PERSPECTIVES

In conclusion, while Big data with a spatial component are in the midst of many scientific, economical and societal issues and while Hadoop has become a mature de facto standard for Big data processing, the number of processing and management systems for this type of data using the Hadoop environment and available in the market is limited. All available solutions are only prototypes with limited capabilities. Moreover, only five solutions are managing spatial data from Spark, which is perhaps the most promising Hadoop module for this type of processing, and none of these systems can entirely handle the geometry types and SQL spatial functions specified in the ISO 19125 standard. To tackle this issue, the present paper proposes a new spatial Big data processing and management system relying on Spark: Elcano. It is based on the SQL library of Spark and uses the JTS spatial library for its compliance with the ISO’s standards. Thanks to this approach, all SQL functions and operators defined by the ISO 19125 standard are fully supported. The proposed model on which Elcano relies is not a simple implementation of JTS. It comes with the possibility to use SQL spatial queries with a data model that can evolve. Furthermore, it integrates the geometric types on a context of Big data and comes with a scalable spatial indexation system which will be detailed in an upcoming article. In addition, Elcano offers better performances than Spatial Spark and a similar scalability. The detailed study of all the possibilities in term of spatial indexation management remains however to be done. A way to address it could be to adapt the no Hadoop solution defined by (Cortés et al, 2015) to the Spark environment, but there is also many classical spatial data indexation modes that could be explored and adapted in order to fulfill the big data processing requirements. In a larger perspective, it could be interesting in a near future to enable the management of the elevation together with dedicated data types such as Triangles and TINs in the current model. Raster data types, maybe via the use of RasterWKT, are also considered for inclusion. That would allow to apply the model to many new challenging situations such as the processing of large collection of images coupled with vector data analytics capabilities or the building and analysis of high resolution digital elevation models (DEM) or DTM without being 126

compelled to split them into tiles in order to be able to process them at a whole. The current version of Elcano manages only batches of data, but adding the possibility of processing and displaying continuously received data (in streaming) could be very interesting (Engélinus and Badard, 2016). Such an extension could indeed enable the design of real time geospatial analytical tools that will help in users (analysts, decision makers, …) in making more informed decisions on more up-to-date data and in a shorter period of time. Furthermore, it could provide some advanced features that deals with the temporal dimension of the data, as for example by excluding all data outside of a defined temporal window (Golab, 2006). Such extensions could allow the modelling of such data as a spatiotemporal event or flow and maybe to dynamically detect “hot spots” (Maciejewsky et al, 2010) in the stream. But, if Spark can technically handle streaming, taking it into account would induce several conceptual and technical problems. It would be necessary to define a mode of spatial indexation able to manage fluctuating data. Furthermore, what would be the visual variables to use for this type of data in order to represent their dynamic structure? Those defined by Bertin in 1967 (Bertin, 1967) and widely used since are inappropriate because of their strict limitation to a static spatiotemporal context. More recent works have tried to add visual variables to Bertin’s models in order to represent motion (MacEachren, 2001; Fabrikant and Goldsberry, 2005), but their application in a context of Big data remains unaddressed. Furthermore, once these conceptual issues are solved, the definition of a system that is effectively able to represent and manage streamed data remains to be done. This could not be a simple add-on to the classic geographic information systems (GIS): they are designed to be efficient for classical data only and are not able to deal with the huge amount of data and velocity that Big data implies. How then is it possible to manage and to represent fluctuating Big data in an efficient way, without losing the horizontal scalability offered by Hadoop? This rich problematic seems to require the definition of a new type of GIS. This will be the bottom line of our future research works.

ACKNOWLEDGEMENTS We acknowledge the support of the Natural Sciences and Engineering Council of Canada (NSERC),


funding reference number 327533. We also thank Université Laval and especially the Center for Research in Geomatics (CRG) and the Faculty of Forestry, Geography and Geomatics for their support and their funding. Thanks to Cecilia Inverardi and Pierrot Seban for their thorough proof reading and to Judith Jung for her advices in the writing of this paper.

REFERENCES A. Aji et al, 2013. “Hadoop GIS: a high performance spatial data warehousing system over mapreduce”. In: Proceedings of the VLDB Endowment 6.11, p. 1009–1020. M. Armbrust et al, 2015. “Spark SQL: Relational data processing in spark”. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, p. 1383–1394. T. Badard, 2014. “Mettre le Big Data sur la carte : défis et avenues relatifs à l’exploitation de la localisation”. In: Colloque ITIS - Big Data et Open Data au coeur de la ville intelligente. Québec : CRG. A. Eldawy, M. F. Mokbel, 2013. "A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data." Proceedings of the VLDB Endowment. J. Bertin, 1967. “Semiologie Graphique: Les Diagrammes, Les Reseaux, Les Cartes”. T. Bray, 2014. The javascript object notation (json) data interchange format, RFC 7158. R. Cortés et al, 2015. “A Scalable Architecture for SpatioTemporal Range Queries over Big Location Data”. In: Network Computing and Applications, IEEE 14th International Symposium, p. 159–166. M. Davis, J. Aquino, 2003. Jts topology suite technical specifications. J. Dean, S. Ghemawat, 2008. “MapReduce: simplified data processing on large clusters”. In: Communications of the ACM 51.1, p. 107–113. J. Engélinus, T. Badard, 2016. “Towards a Real-Time Thematic Mapping System for Strea-ming Big Data”. In: GIScience, Montreal. E. Gamma et al, 1994. Design Patterns: Elements of Reusable Object-Oriented Software. A. Eldawy, M. F. Mokbel, 2014. “Pigeon: A spatial mapreduce language”. In: Data Engineering, 2014 30th International Conference on IEEE, p. 1242–1245. A. Eldawy et M. F. Mokbel, 2015. “The Era of Big Spatial Data: A Survey”. In: Information and Media Technologies 10.2, p. 305–316. M. R. Evans et al, 2014. “Spatial big data”. In: Big Data: Techniques and Technologies in Geoinformatics, p. 149. S. I. Fabrikant, K. Goldsberry, 2005. “Thematic relevance and perceptual salience of dynamic geovisualization displays”. In: Proceedings, 22th ICA/ACI

International Cartographic Conference, Coruna. S. Fermigier. 2011. Big data et open source: une convergence inevitable? URL: http: //projetplume.org. C. Franklin, P. Hane, 1992. “An Introduction to Geographic Information Systems: Linking Maps to Databases [and] Maps for the Rest of Us: Affordable and Fun.” In: Database 15.2, p. 12–15. L. Golab, 2006. “Sliding window query processing over data streams”. Doctorate thesis. University of Waterloo. L. Gu, H. Li, 2013. “Memory or time: Performance evaluation for iterative operation on hadoop and spark”. In: High Performance Computing and Communications & IEEE 10th International Conference, Embedded and Ubiquitous Computing. 2013, p. 721–727. J. N. Hugues et al, 2015. “GeoMesa: a distributed architecture for spatio-temporal fusion”. In: SPIE Defense + Security. International Society for Optics et Photonics. 94730F. ISO 19125-1, 2004. Geographic information -- Simple feature access -- Part 1: Common architecture. ISO/TC 211, 42 pages. URL: https://www.iso.org/standard/ 40114.html. ISO 19125-2, 2004. Geographic information -- Simple feature access -- Part 2: SQL option. ISO/TC 211, 61 pages. URL: https://www.iso.org/standard/40115.html. A. M. MacEachren, 2001. “An evolving cognitivesemiotic approach to geographic visualization and knowledge construction”. In: Information Design Journal 10.1, p. 26–36. R. Maciejewsky et al, 2010. “A visual analytics approach to understanding spatiotemporal hotspots”. In: IEEE Transactions on Visualization and Computer Graphics 16.2 p. 205– 220. J. Manyika et al, 2011. “Big data: The next frontier for innovation, competition, and productivity”. In: The McKinsey Global Institute. B. Mericksay, S. Roche, 2010. “Cartographie numérique en ligne nouvelle génération: impacts de la néogéographie et de l’information géographique volontaire sur la gestion urbaine participative”. In: Nouvelles cartographie, nouvelles villes, HyperUrbain. R. O. Obe et L. S. Hsu, 2015. PostGIS in action. Manning Publications Co.. Commonwealth Computer Research, 2017. Apache Spark Analysis. URL: http://www.geomesa.org/documenta tion/tutorials/spark.html. S. Ram, 2015. Magellan: Geospatial Analytics on Spark. URL: http://hortonworks.com/blog/magellan-geospati al-analytics-in-spark/. R. Sriharasha, 2016. Magellan’s Github - issue 30. URL: https://github.com/harsha2010/magellan/issues. J. Vlissides et al, 1995. “Design patterns: Elements of reusable object-oriented software”. In: Reading: Addison-Wesley 49.120, p. 11. D. Vorha, 2016. “Apache Parquet”. In: Practical Hadoop Ecosystem. Springer, p. 325–335.

127


T. White, 2012. Hadoop: The definitive guide. O’Reilly Media, Inc. D. Xie et al, 2016. Simba: Efficient In-Memory Spatial Analytics. URL: https://www.cs.utah.edu/~lifeifei/ papers/simba.pdf. S. You, et al, 2015. “Large-scale spatial join query processing in cloud”. In: Data Engineering Workshops (ICDEW), 31st IEEE International Conference, p. 34– 41. J. Yu, 2017. GeoSpark’s Github- issue 33. URL: https://github.com/DataSystemsLab/GeoSpark/ issues. J. Yu et al, 2015. “Geospark: A cluster computing framework for processing large-scale spatial data”. In: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, p. 70. M. Zaharia et al, 2010. “Spark: cluster computing with working sets”. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. T. 10, p. 10.

128

VOLA: A Compact Volumetric Format for 3D Mapping and Embedded Systems Jonathan Byrne, Léonie Buckley, Sam Caulfield and David Moloney Advanced Architecture Group, Intel, Ireland {jonathan.byrne, leonie.buckley, sam.caulfield, david.moloney}@intel.com

Keywords:

Voxels, 3D Modelling, Implicit Octrees, Embedded Systems.

Abstract:

The Volumetric Accelerator (VOLA) format is a compact data structure that unifies computer vision and 3D rendering and allows for the rapid calculation of connected components, per-voxel census/accounting, Deep Learning and Convolutional Neural Network (CNN) inference, path planning and obstacle avoidance. Using a hierarchical bit array format allows it to run efficiently on embedded systems and maximize the level of data compression for network transmission. The proposed format allows massive scale volumetric data to be used in embedded applications where it would be inconceivable to utilize point-clouds due to memory constraints. Furthermore, geographical and qualitative data is embedded in the file structure to allow it to be used in place of standard point cloud formats. This work examines the reduction in file size when encoding 3D data using the VOLA format. Four real world Light Detection and Ranging (LiDAR) datasets are converted and produced data an order of magnitude smaller than the current binary standard for point cloud data. Additionally, a new metric based on a neighborhood lookup is developed that measures an accurate resolution for a point cloud dataset.

1

INTRODUCTION

The worlds of computer vision and graphics, although separate, are slowly being merged in the field of robotics. Computer vision is taking input from systems, such as Light Detection and Ranging (LiDAR), structured light or camera systems and generating point clouds or depth maps of the environment. The data must then be represented internally for the environment to be interpreted correctly. Unfortunately the amount of data generated by modern sensors quickly becomes too large for embedded systems. An example of the amount of memory required by dense representations is SLAMbench (Nardi et al., 2015) (kFusion) which requires 512 MiB to represent a 5m3 volume with 1 cm accuracy (Mutto et al., 2012). A terrestrial LiDAR scanner generates a million unique points per second (Geosystems, 2015) and an hour long aerial survey can generate upwards of a billion unique points. The result of having such vast quantities of data is that it quickly becomes impossible to process, let alone visualize the data on all but the most powerful systems. Consequently it is rarely used directly. It is simplified by decimation, flattened into a 2.5D Digital Elevation Model (DEM), or meshed using a technique such as Delaunay triangulation or Poisson reconstruc-

tion. The original intention of VOLA was to develop a format that was small enough to be stored on an embedded system and enable it to process 3D data as an internalized model. The model could then be easily and rapidly queried for navigation of the environment as it partitions space based on its occupancy. This paper focuses on the compression rates using the format on different datasets. Four publicly available large scale LiDAR datasets were examined in this work. The data was obtained by an aerial LiDAR system for San Francisco, New York state, Montreal and Dublin respectively. Although the quality and resolution of the data varies, they present a realistic representation of what would be processed by an embedded system in the real world, except on a much larger scale. This work examines the effect of point density versus compression depth on the data for both dense and sparse mappings and then compares the VOLA format against the original dataset. Our findings show that there are dramatic reductions in file size with a minimal loss of information. Another finding of this work is that average point cloud density is a poor metric for choosing a resolution for the voxel model as it can be biased by the underlying clusters in the data distribution. An efficient and easily calculated metric based on block occupancy is presented that takes into account the voxel 129

Byrne, J., Buckley, L., Caulfield, S. and Moloney, D. VOLA: A Compact Volumetric Format for 3D Mapping and Embedded Systems. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 129-137 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved


neighborhood when choosing a resolution.

2

RELATED RESEARCH

There exist several techniques for organizing point cloud data and converting it to a solid geometry. Point clouds are essentially a list of coordinates, with each line containing positional information as well as color, intensity, number of returns and other attributes. Although the list can be sorted using the coordinate values, normally a spatial partitioning algorithm is applied to facilitate searching and sorting the data. Commonly used approaches are the octree (Meagher, 1982) and the KD-Tree (Bentley, 1975). Octrees are based in three dimensional space and so they naturally lend themselves to 3D visualization. There are examples where the octree itself is used for visualizing 3D data, such as the Octomap (Hornung et al., 2013). Octrees are normally composed of pointers to locations in memory which makes it difficult to save the structure as a binary. One notable exception is the DMGoctree (Girardeau-Montaut, 2006) which uses a binary encoding for the position of the point in the octree. Three bits are used as a unique identifier for each level of the octree. The DMGoctree uses a 32 or 64 bit encoding for each point to indicate the location in the tree to a depth of 10 or 20 respectively. Another recent development is Octnet (Riegler et al., 2016). Their work uses a hybrid grid octree to enable sparsification of voxel data. A binary format is used for representing a set of shallow octrees that, while not as memory efficient as a standard octree still allows for significant compression. Furthermore they developed a highly efficient convolution operator that reduced the number of multiplications and allowed for faster network operations when carrying out 3D inference. Another technique for solidifying and simplifying a point cloud is to generate a surface that encloses the points. A commonly used approach that locally fits triangles to a set of points is Delaunay triangulation (Boissonnat, 1984). It maximizes the minimum angle for all angles in the triangulation. Triangular Irregular Networks (TIN) (Peucker et al., 1978) are extensively used in Geographical Information Systems(GIS) and are based on Delaunay triangulation. One issue is that the approach noise and overlapping points can cause the algorithm to make spurious surfaces. A more modern and accurate meshing algorithm is Poisson surface reconstruction (Kazhdan et al., 2006). The space is hierarchically partitioned and informa130

Table 1: A comparison between VOLA and octrees. Implementation Traversal Arithmetic Variable Depth Dense Search Complexity Sparse search complexity Embedded System Support Look Up Table (LUT) Support Easily Save to File File Structure Cacheable Hierarchical Memory Structure

VOLA Bit Sequence Modular Yes O(1) O(h) Yes Yes Yes Implicit Yes No

octree Pointer Based Pointer Yes O(h) O(h) No No No Explicit No Yes

tion on the orientation of the points is used to generate a 3D model. It has been shown to generate accurate models and it is able to handle noise due to the combination of global and local point information. Poisson reconstruction will always output a watertight mesh but this can be problematic when there are gaps in the data. In an attempt to fill areas with little information, assumptions are made about the shape which can lead to significant distortions. There are also problems with small, pointed surface features which tend to be rounded off or removed by the meshing algorithm. Finally there are volumetric techniques encoding point clouds. Traditionally used for rasterizing 3D data for rendering (Hughes et al., 2014), the data is fitted to a 3D grid and occupancy of a point is represented using a volumetric element, or “voxel”. Voxels allow the data to be quickly searched and traversed due to being fitted to a grid. While it simplifies the data and may merge many points into a single voxel, each point will have a representative voxel. Unlike meshing algorithms, voxels will not leave out features but conversely may be more sensitive to noise. The primary issue with voxel representations is that they encode for everything, including open space. This means that as the resolution is increased or the area covered is doubled then the memory requirements increase by a factor of 8. An investigation into using sparse voxel approaches to accomplish efficient rendering of large volumetric objects was carried out by (Laine and Karras, 2011). The work used sparse voxel octrees and mipmaps in conjunctions with frustrum culling to render volumetric scenes in real time. This work was further developed under the name Gigavoxel (Crassin et al., 2009). VOLA combines the hierarchical structure of octrees with traditional volumetric approaches, enabling it to only encode for occupied voxels. Although there is much similarity with traditional octrees this approach has several noticeable differences outlined in Table 1. The approach is described in detail below.

VOLA: A Compact Volumetric Format for 3D Mapping and Embedded Systems

Figure 1: Tree depth one, the space is subdivided into 64 cells. The occupied cells are shown in green.

3

THE VOLA FORMAT

VOLA is unique in that it combines the benefits of partitioning algorithms with a minimal voxel format. It hierarchically encodes 3D data using modular arithmetic and bit counting operations applied to a bit array. The simplicity of this approach means that it is highly compact and can be run on hardware with simple instruction sets. The choice of a 64 bit integer as the minimum unit of computation means that modern processor operations are already optimized to handle the format. While octree formats either need to be fully dense to be serialized or require bespoke serialization code for sparse data, the VOLA bit array is immediately readable without header information. VOLA is built on the concept of hierarchically defining occupied space using “one bit per voxel” within a standard unsigned 64 bit integer. The onedimensional bit array that makes up the integer value is mapped to three-dimensional space using modular arithmetic. The bounding box containing the points is divided into 64 cells. If there are points contained within a cell the bit is then set to 1 otherwise it is set to zero. The result of the first division is shown in Figure 1. For the next level each occupied cell is assigned an additional 64 bit integer and the space is further subdivided into 64 cells. Any unoccupied cells on the upper levels are ignored allowing each 64 bit integer to only encode for occupied space. The bits are again set based on occupancy and appended to the bit array.

Figure 2: Tree depth two. Each occupied cell is subdivided into 64 smaller cells.

The number of integers in each level can be computed by summing the number of occupied bits in the previous level. The resolution increases fourfold for each additional level as shown in figure 2 The process is repeated with each level increasing the resolution of the representation by four until a resolution suitable for the data is reached. This depends on the resolution of the data itself. The traditional approach used is to compute the average points per meter of the dataset and use this to compute a suitable tree depth. One of the issues raised in this work is that the non-uniform distribution of points makes this a poor metric. A new approach that approximates voxel neighborhood is found to more accurately reflect the dataset is discussed in Section 7.

4

AERIAL LiDAR DATASETS

Real world data obtained from aerial LiDAR scans is used in this work as it analogous to point cloud data that would be obtained from an embedded system used for 3D navigation, such as in a drone or a self driving car. The four datasets examined in this work are: • The 2010 ARRA LiDAR Golden Gate Survey (san, ) • The 2013-2014 U.S. Geological Survey CMGP LiDAR: Post Sandy (new, ) • The Montreal 2012 LiDAR Aerien Survey (mon, a) 131


(a) Depth 1

(b) Depth 2

Figure 4: San Francisco LiDAR scan density. The points are uniformly distributed with an average density is 0.2 points per meter. (c) Depth 3

(d) Depth 4

Figure 3: The output obtained for different tree depths. The resolution increases by 4 for each axis on every successive subdivision.

• The ALS 2015 Dublin survey (Laefer et al., )

The datasets were chosen as they model complex built up urban environments and they are publicly available open data (sfl, ; nyl, ; mon, b; dub, ). The density of the point clouds varies significantly between the datasets due to the flight paths used to gather the LiDAR data. Traditional GIS applications are only concerned with generating Digital Elevation Models which only requires a single height value per grid point. Accordingly the amount of overlap between flight paths was reduced to cover the maximum area in the least time and the resulting models are sparse. The San Francisco, Montreal and New York datasets followed this approach and so have a low point density. The San Francisco dataset proving to be the sparsest at 0.2 points per meter, followed by New York with 1.5 points per meter and Montreal with 8 points per meter, as shown in Figures 4, 5, and 6. The Dublin dataset was gathered by the Urban Modelling group which research techniques for maximizing 3D data and model generation (TruongHong et al., 2013). They used up to 60% overlap in order to increase the point density to 190 points per meter. The data was captured at an altitude of 300m using a TopEye system S/N 443. It consists over 600 million points with an average point density of 348.43points/m2 . It covered an area of 2km2 in Dublin city center. One issue that is obvious with the datasets is that point distribution is not uniform. Aerial LiDAR scans are inherently biased due to the angle and height at which the data is gathered. This is clearly highlighted in Figures 6and 7 where the roofs are bright green 132

Figure 5: The New York LiDAR scan is uniform and of higher density (1.5 points per meter) than the San Francisco scan but is missing the facades on the buildings.

and red, representing a higher point density. There are more point returns generated by higher flatter surfaces, such as the roofs of buildings, rather than occluded street level structures. The average point density does not give an accurate representation of the underlying data which is necessary for choosing the correct level of subdivision when voxelizing the dataset. Our experiments show that the block occupancy, which is analogous to a voxel neighborhood, gives a more realistic metric of the point density by finding the level of subdivision where the voxels become separated due to insufficient point density. The metric will be explained in more detail in the Section 7.

5

EXPERIMENTS

Two experiments are carried out in this work. The first experiment examines the difference between a traditional dense mapping of the space using a 1 bit per voxel representation versus the hierarchical


Figure 6: The Montreal LiDAR scan density is 8 points per meter and starts to show a non uniform distribution. The rooftops and overlapping scan lines show up in green and orange respectively.

Figure 7: The Dublin LiDAR scan density is 190 points per meter and the distribution towards the rooftops is clearly visible.

sparse mapping used by VOLA. The second experiment compares standard and compressed VOLA to the industry standard for point cloud compression, the LAZ format.

5.1

Comparison of Dense Versus Sparse Mapping

VOLA combines a 1 bit per voxel representation with a hierarchical encoding. Rather than using 32 bit integer to represent a voxel as in standard approaches, the occupancy of each voxel is indicated by setting a bit to 1. This encoding step alone will make the file size 32 times smaller. What is unclear is the magnitude of reduction in file size results from a hierarchical compression. The worst case scenario, where the points uniformly throughout the bounding box, would result in dense and sparse volumes being of equal size. The experiment measures the size reduction produced by hierarchically encoding real world data. As both

Figure 8: Sparse versus Dense Encoding.

dense and sparse representations use one bit per voxel, the one variable effecting file size is that the sparse representation may omit empty space. The sparse mapping discards any 64 blocks that exclusively contain zeros. This means that empty space is not encoded for but it adds an additional overhead in processing time when packing and unpacking the structure. Theoretically the worst case for a sparse mapping, where points are uniformly distributed throughout the space, would result in the dense mapping and sparse mapping having equal size. Fortunately points clouds based on real world data generally have non-uniform distributions with most point clouds consisting of empty space. (Klingensmith et al., 2015) found that only 7% of the space in a typical indoor scene is in fact occupied. The density of the Dublin LiDAR scan, for example, has an average occupancy of 1.36% per 100 meter tile. This work examines the levels of compression obtained when converting a point cloud to both dense and sparse representations and how this is effected by the depth.

5.2

Comparison with LAS File Format

The VOLA format is compared to commonly used industry standard for LiDAR data, the LAS file format. LAS is a binary format originally released by the American Society of Photogrammetry and Remote Sensing (ASPRS) for exchanging point cloud data. LAS was designed as an alternative to proprietary formats or the commonly used generic ASCII format. It has the ability to embed information on the dataset and the points themselves, such as coordinate reference system, number of returns, scan angle, etc. An addition to this format is the compressed LAZ format developed by (Isenburg, 2013). It is a lossless format that takes advantage of the fact that LAS stores the coordinates using a fixed point format to compress 133


the data. The resulting files are between 7% and 25% of the original size. LAZ is now become the de facto standard for point cloud formats. The comparison with VOLA has two caveats: firstly, converting a point cloud to a voxel format will implicitly simplify the distribution to a binary grid distribution. A voxel is placed in the grid if there is at least 1 point in the grid location. A voxel could result from a single point or 1000 points and so information on high point concentrations is lost. The result is that the conversion to a voxel format is lossy. The second consideration is that the LiDAR format contains meta-data about the points themselves, such as color, intensity, number of returns, scan angle, etc. Using 1 bit per voxel VOLA means that only the occupancy is recorded. In order to carry out a fairer comparison, 2 Bits per voxel VOLA is used to represent the point meta-data. The meta-data is encoded in byte blocks which means the resolution of the values is reduced from 32 bits to 8 bits but as much of this data is normally limited to this range (intensity, number of returns) then there is no loss of information. There is data loss on the resolution of the data as 2 bits per voxel records the information for a 64 bit occupancy block rather than the individuals voxels. In order to compare point cloud compression with VOLA compression we used a standard compression algorithm on the VOLA format. While VOLA is compact it is not compressed and so there is still significant room for further file size reduction, whereas the LAZ cannot be reduced further. The standard gzip (Deutsch and Gailly, 1996) library was used to compress the files. Gzip uses the deflate algorithm for compression (Deutsch, 1996). The resolution of the data used in this comparison was chosen by the occupancy as calculated in the dense versus sparse experiments. A resolution was chosen where the average block occupancy is above 15% which means that there is a increased likelihood that the voxels are connected.

6

DENSE VERSUS SPARSE COMPARISON RESULTS

Each dataset was computed to multiple depths (and therefore resolutions) in order to understand how the file size compression was effected by the resolution. The file size in megabytes was recorded for both the dense and sparse representation. The improvement in compression was computed by dividing the dense file size by the sparse file size. An additional measure of occupancy was computed by summing the bits in each 64 bit block in the sparse representation. This does 134

not give the actual neighborhood of a voxel, e.g., a neighboring voxel could be contained in an adjoining block, it does give an approximate measure of occupancy. The results for the San Francisco dataset are shown in Table 2. Although this is the least dense dataset it contains the largest number of tiles. The magnitude reduction increases as the depth (and accordingly resolution) increases. The maximum compression is 38 times smaller than the dense representation. The initial occupancy is 46% at depth 1 which then increases at depth 2 before decreasing again. This is because the majority of bounding box is empty space with most of the points having low height values. It is also noted at depth 4 that the initial occupancy of the voxel blocks falls below 10%. Each voxel in a block can have up to 26 neighboring voxels of which 6 can be contiguous, i.e.,sharing a common face. As the occupancy drops the likelihood of contiguity also decreases. This is covered in more detail in Section 7. Table 2: Depth results for San Francisco dataset for 234887 tiles.

Lvl

Dense(MB)

Sparse(MB)

1 2 3 4

1.87 122.14 7818.92 500412.66

1.87 40.2 881.3 12918.8

Mag Red 1 3.04 8.87 38.74

Block Occup 46.77% 51.93% 29.70% 7.3%

The New York dataset in Table 3 shows a similar reduction in file size for increasing depth with a similar magnitude reduction for depth 4. There is also the same increase in block occupancy at depth 2 before it decreases and is less than 10% at depth 4. Table 3: Depth results for New York dataset for 86804 tiles.

Lvl

Dense(MB)

Sparse(MB)

1 2 3 4

0.694 45.138 2889.53 184930.71

0.694 13.197 318.82 4858.55

Mag Red 1 3.42 9.06 38.06

Block Occup 39.15% 60.42% 31.25% 9.62%

The Montreal dataset results in Table 4 shows a slightly more pronounced reduction in file size initially but is only 34 times smaller by depth 4. The occupancy again spikes at depth 2 and reaches 15% at depth 4. This would imply that more detailed features are captured in the higher resolution. The Dublin dataset results are shown in Table 5. Due to the significantly higher point density it was decided to increase the maximum depth to 5. This is


Table 4: Depth results for Montreal dataset for 66299 tiles.

Lvl

Dense(MB)

Sparse(MB)

1 2 3 4

0.53 34.47 2206.96 141246.04

0.53 9.44 233.08 4106.08

Mag Red 1 3.65 9.47 34.4

Block Occup 35.62% 62.52% 37.97 % 15.62%

equivalent to each voxel representing a 9.7cm3 cube in the dataset. There is a smaller reduction in the file size for successive depths compared to the previous datasets but this increases to 70 time smaller at depth 5. The occupancy spikes at 87% at depth 2 and then drops off to a minimum of 14.72%. Table 5: Depth results for Dublin dataset for 356 tiles.

Lvl

Dense(MB)

Sparse(MB)

1 2 3 4 5

0.0028 0.185 11.85 758.43 48539.947

0.0028 0.065 1.96 40.84 684.36

6.1

Discussion

Mag Red 1 2.82 6.05 18.57 70.93

Block Occup 52.5% 87.14% 48.33% 35.46% 14.72%

As stated earlier, if the data was distributed uniformly throughout the bounding box this would result in dense and sparse volumes being of equal size. The experiments show that this is not the case with real world data. There is an initial low occupancy for the highest level encoding as the majority of points in each 100m3 tile are in the lowest third on the vertical axis. Once this has been removed the remaining space is largely occupied but then decreases for each successive increase in resolution. The magnitude of the reduction decreased for more dense datasets but this was offset by a marked increase in the magnitude of reduction for greater depths. Although higher resolution datasets require higher resolution VOLA models, increasing the resolution of sparse models increased the magnitude of the space saving. There was also a point with all the datasets where the resolution increased to the point that the voxels were no longer connected. The result is a voxel representation where the number of voxels is the same as the number of points (which is essentially a lossless encoding of the data) but is not useful when computing collisions and navigation information. We shall go into more detail on this in the next section.

7

BLOCK OCCUPANCY

Increasing the resolution resulted in greater number of the points being disconnected. Although this means that it more accurately represents the underlying point cloud, it is less useful when using the representation for navigation or using machine learning on the dataset. For example, analyzing a building facade or detecting buildings using a 3D CNN require that the data be connected into a contiguous object. The true neighborhood of a voxel is difficult to compute when using a hierarchical encoding as it requires finding neighboring blocks when a voxel is on the edge of the current block. An alternative approach is to conduct a bitwise comparison on the voxels or to compute Euclidean distance between the voxels in a block and ignore neighboring blocks. This simplifies the problem but it is still computationally expensive due to the number of comparisons required. A more efficient although less accurate approach is to compute the occupancy of a block, as was used in the previous experiments. The occupancy of a 64 bit block is computed by summing the bits set to one. This is not the true is only an approximation of neighborhood but it does give a clear probability of the connected components in a block. A comparison of the occupancy and its relationship with the number of connected components is shown in Table 6 and is found to closely approximate to the number of connections per block. Table 6: A comparison of occupancy and the number of connected voxels within a block.

Occup 100% 75% 50% 25% 10%

Contig Vox 144 80.44 35.31 8.58 1.49

StdDev 0 3.23 3.38 2.28 1.09

Connected 100% 55.8% 24.5% 5.9% 0.75%

The point density is traditionally used when working out the suitable resolution for a voxelised model but this approach oversimplifies the distribution of the data. It assumes the points have a uniform density although the points tend to be biased towards areas least occluded from the scanner, e.g., aerial LiDAR data has many more times the points at the rooftops than on ground level. It also takes no account for the spread of points, i.e., points may cover a building consistently but only sparsely. Ignoring this worst case resolution will result in fragmented voxel models. Although block occupancy may be an imperfect metric, it is easy to compute and correlates well with block contiguity. As such it provides a useful mecha135


nism when determining what is a sufficient resolution when processing a dataset.

8

LAS FORMAT COMPARISON RESULTS

The VOLA format is compared against the Laszip format using a two bits per voxel representation. This allows for the meta-data about the points to be encoded. The caveats are that VOLA is not a lossless format and the resolution of the point information is reduced due to the hierarchical encoding. The point information is averaged over each 64 bit block. There is then a comparison against VOLA when compressed using a gzip, a generic compression library. The depth chosen for the data was based on the previous results where the voxels are still connected. The San Francisco, New York and Montreal datasets are at depth 3 and the Dublin dataset is at depth 4. The VOLA format now allows for an arbitrary amount of additional information to be appended to the structure, although only 2 bits are used in this example. Table 7: A comparison of the file size reduction when using LAZ and VOLA. Dataset San Fran New York Montreal Dublin

LAS 224GB 126GB 167GB 36GB

LAZ 33GB 22GB 27.7GB 3.7GB

% 14.7% 17.4% 16.5% 10.27%

VOLA 1.76GB 637MB 466MB 81.68MB

% 1.07% 0.5% 0.27% 0.22%

VOLAZip 799MB 336MB 189 33MB

% 0.35% 0.26% 0.11% 0.091%

The results show that VOLA can reduce the file size on the datasets to less than 1% their original size. VOLA compressed using generic methods further reduces this by up to 50%. Although LAZ offers significant lossless compression, compressed VOLA reduces the file sizes to less than 5% of the LAZ files.

9

CONCLUSIONS

In this work we showed that encoding real-world data using the hierarchical VOLA encoding massively reduces the file size. We also introduce a metric based on voxel block occupancy that more accurately reflects the underlying point cloud distribution than average point density. Although it is only an approximation of neighborhood it is easily calculated using VOLA’s block format. We then compared the VOLA representation with point meta-data with standard LiDAR formats. The reduction of the file size when compared with the LAS format less than 1%, albeit at the loss of some resolution and point information. Using a generic 136

compression algorithm on VOLA results in it being 5% of the file size of the compressed LAZ format. Due to the inherent sparsity of real-world data, a hierarchical encoding that omits empty space makes sense. These results show that it is possible to store large amounts of 3D data in a memory footprint that could easily be accommodated on an embedded system for both mapping and machine learning applications.

10

FUTURE WORK

A generic compression algorithm was used to compress the data. This could be improved using bespoke techniques developed for the underlying 3D data such as run length encoding and look up tables for self similar features. Our intention is to use such techniques to further reduce the file size.

REFERENCES The 2010 arra lidar: The golden gate lidar project. https://data.noaa.gov/dataset/2010-arra-lidar/-goldengate-ca. Accessed: 2017-10-05. 2013-2014 u.s. geological survey cmgp lidar: Post sandy (new york city). https://data.noaa.gov/dataset/2014u-s-geological-survey/-cmgp-lidar-post-sandy-newjersey. Accessed: 2017-10-05. Dublin als2015 lidar license (cc-by 4.0). https://geo.nyu.edu/catalog/nyu 2451 38684. Accessed: 2017-10-19. Montreal lidar aerien 2015. http://donnees.ville.montreal.qc.ca/dataset/lidaraerien-2015. Accessed: 2017-10-05. Montreal lidar license (cc-by 4.0). http://donnees.ville.montreal.qc.ca/dataset/lidaraerien-2015. Accessed: 2017-10-19. Post sandy lidar survey license. https://data.noaa.gov/dataset/2014-u-s-geologicalsurvey-cmgp/-lidar-post-sandy-new-jersey. Accessed: 2017-10-19. San francisco arra lidar license. https://data.noaa.gov/dataset/2010-arra-lidar-goldengate-ca. Accessed: 2017-10-19. Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517. Boissonnat, J.-D. (1984). Geometric structures for threedimensional shape representation. ACM Transactions on Graphics (TOG), 3(4):266–286. Crassin, C., Neyret, F., Lefebvre, S., and Eisemann, E. (2009). Gigavoxels: Ray-guided streaming for efficient and detailed voxel rendering. In Proceedings of the 2009 symposium on Interactive 3D graphics and games, pages 15–22. ACM.


Deutsch, P. (1996). Deflate compressed data format specification version 1.3. Deutsch, P. and Gailly, J.-L. (1996). Zlib compressed data format specification version 3.3. Geosystems, L. (2015). Leica scanstation p30/p40. Product Specifications: Heerbrugg, Switzerland. Girardeau-Montaut, D. (2006). Change detection on threedimensional geometric data. PhD thesis, T e l e com ParisTech. Hornung, A., Wurm, K. M., Bennewitz, M., Stachniss, C., and Burgard, W. (2013). Octomap: An efficient probabilistic 3d mapping framework based on octrees. Autonomous Robots, 34(3):189–206. Hughes, J. F., Van Dam, A., Foley, J. D., and Feiner, S. K. (2014). Computer graphics: principles and practice. Pearson Education. Isenburg, M. (2013). Laszip. Photogrammetric Engineering & Remote Sensing, 79(2):209–217. Kazhdan, M., Bolitho, M., and Hoppe, H. (2006). Poisson surface reconstruction. In Proceedings of the Fourth Eurographics Symposium on Geometry Processing, SGP ’06, pages 61–70, Aire-la-Ville, Switzerland, Switzerland. Eurographics Association. Klingensmith, M., Dryanovski, I., Srinivasa, S., and Xiao, J. (2015). Chisel: Real time large scale 3d reconstruction onboard a mobile device using spatially hashed signed distance fields. In Robotics: Science and Systems, volume 4. Laefer, D. F., Abuwarda, S., Vo, A.-V., Truong-Hong, L., and Gharibi, H. 2015 aerial laser and photogrammetry survey of dublin city collection record. Achttps://geo.nyu.edu/catalog/nyu 2451 38684. cessed: 2017-10-05. Laine, S. and Karras, T. (2011). Efficient sparse voxel octrees. IEEE Transactions on Visualization and Computer Graphics, 17(8):1048–1059. Meagher, D. (1982). Geometric modeling using octree encoding. Computer graphics and image processing, 19(2):129–147. Mutto, C. D., Zanuttigh, P., and Cortelazzo, G. M. (2012). Time-of-flight cameras and microsoft kinect (TM). Springer Publishing Company, Incorporated. Nardi, L., Bodin, B., Zia, M. Z., Mawer, J., Nisbet, A., Kelly, P. H. J., Davison, A. J., Luján, M., O’Boyle, M. F. P., Riley, G., Topham, N., and Furber, S. (2015). Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM. In IEEE Intl. Conf. on Robotics and Automation (ICRA). arXiv:1410.2167. Peucker, T. K., Fowler, R. J., Little, J. J., and Mark, D. M. (1978). The triangulated irregular network. In Amer. Soc. Photogrammetry Proc. Digital Terrain Models Symposium, volume 516, page 532. Riegler, G., Ulusoys, A. O., and Geiger, A. (2016). Octnet: Learning deep 3d representations at high resolutions. arXiv preprint arXiv:1611.05009. Truong-Hong, L., Laefer, D. F., Hinks, T., and Carr, H. (2013). Combining an angle criterion with voxelization and the flying voxel method in reconstructing building models from lidar data. Computer-Aided Civil and Infrastructure Engineering, 28(2):112–129.

137

Minimum Collection Period for Viable Population Estimation from Social Media Samuel Lee Toepke Private Engineering Firm, Washington DC, U.S.A. [email protected]

Keywords:

Population Estimation, Volunteered Geographic Data, Twitter, Amazon Web Services, Social Media, Collection Period, Enterprise Architecture, Botometer.

Abstract:

Using volunteered geographic information for population estimation has shown promise in the fields of urban planning, emergency response and disaster recovery. A high volume of geospatially enabled Tweets can be leveraged to create population curves and/or heatmaps delineated by day of week and hour of day. When making these estimations, it is critical to have adequate data, or the confidence of the estimations will be low. This is especially pertinent to disaster response, where Tweet collection for a new city/town/locale may need to be rapidly deployed. Using previously leveraged data removal methods, temporal data quantity is explored using sets of data from increasingly longer collection periods. When generating these estimates, it is also necessary to identify and mitigate data from automated Twitter bots. This work examines the integration of a modern, web services based, Twitter bot assessment algorithm, executes data removal experiments on collected data, describes the technical architecture, and discusses results/follow-on work.

1

INTRODUCTION

Smart devices, e.g. tablets, smart phones, wearables, etc. continue to grow in popularity, and are accessible to a large percentage of the world’s population (Poushter, 2016). Smart phones expose the user to a pervasive Internet connection, and a rich suite of sensors. Access to the global navigation satellite system (GNSS) is a common smart phone functionality; coupled with a social media service, it is possible for the user to create volunteered geospatial information (VGI) (Aubrecht et al., 2016). VGI includes latitude/longitude, and the actual content of the data e.g. an image, text, sensor reading, etc. This data is useful for many tasks including environmental sensing, population estimation, urban planning, event detection (Doulamis et al., 2016), etc. It has been shown that VGI from social media services can be used to supplement population estimations in an urban area at high spatiotemporal resolution (Sinnott and Wang, 2017) (Liang et al., 2013). The estimations can be readily visualized using a heat map overlaid on a geospatial information system (Wood et al., 2007), or by using population curves over time of week/day. This is especially useful in the domain of emergency response and disaster recovery; when precisely directing resources/response

to those affected is of paramount concern. Often, when a disaster occurs, it is necessary for first responders to set up in an area where no social media data collection has previously taken place. If collection/processing code for social media feeds pertinent to the geographic area is deployed immediately, it is critical to know how much confidence a user can apply to the collected data. Generally, attaining more data will lead to a more complete picture; but how much data is enough before one can make a population estimation with confidence? When is it safe to discard data, as it is no longer adding value to the end product, but taking up bandwidth, storage space and processing power? Under the assumption that the estimation will eventually become saturated, e.g. having more data points no longer adds resolution to the end result; previous work (Toepke, 2017) has focused on data removal experiments using publicly available Twitter data. Using a full data set as the objective measure, Tweets were randomly removed in increasing steps of 10%, and the error between each resulting curve and the full data was calculated. Results showed a resilience to loss, with resulting curves still providing useful insight into population movement throughout the day. This current work focuses on leveraging the previously utilized experimentation framework, and

138 Toepke, S. Minimum Collection Period for Viable Population Estimation from Social Media. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 138-147 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Minimum Collection Period for Viable Population Estimation from Social Media

repeating the data removal experiments, but using increasing amounts of data for each set of experiments. The first run includes 1 week of data, and the last run includes 8 weeks of data, with the data set increasing in 1 week steps. Previous work only used the experimental framework on one static set of data, which was approximately five months worth of collected Tweets. Determining how much data is required for a confident estimation in a given geospatial area is challenging. Comparison of the generated VGI estimations against an objective measure would greatly increase confidence; unfortunately, baselines with adequate spatiotemporal resolution are not readily available. Some of the leading population estimation projects include LandScan (Rose and Bright, 2014) from Oak Ridge National Labs and Urban Atlas from the Joint Research Centre at the European Commission (Batista e Silva et al., 2013). To provide a measure, both products fuse disparate sets of input data e.g. census tracts, commune boundaries, soil sealing layers, land cover, slope, etc. With a resolution of approximately 30 mˆ2 for Landscan (including day/night models) and approximately 10 mˆ2 (Copernicus, 2017) for Urban Atlas, both products are high quality, but lack the required spatiotemporal resolution for adequate comparison. One solution would be to find a constrained geospatial area, e.g. a corporate/university campus that implements active security to all rooms/locations/spaces/etc. With a large enough social media user-base as well as cooperation from the owners, the objective human presence data could be compared against models attained through social media aggregation, with the goal of creating a confidence measure. For the purposes of this population representation, it is critical to have human-generated data. One of the ways manipulation of social media can negatively affect the estimation is through the use of Twitter bots (Subrahmanian et al., 2016). Twitter bots are coded by humans, and use Twitter to push an agenda; popular goals include advertising, responding to other bots’ keywords to increase re-Tweets, humor, etc. Irrelevant of the bot’s purpose, if code is generating geospatially enabled Tweets, then the Tweets should be removed from the dataset as they don’t represent a human presence. A web services based bot detection framework will be leveraged on a subset of the Tweets collected to ascertain whether the data was human generated. The bot detection functionality is implemented as a proof of concept, and will need to be extended further. This work will discuss the results of the data removal experiments using sets of data with increas-

ing quantity, explore the bot-ness of the accounts in the data set, delineate architectural details and present follow-on work.

2

BACKGROUND

Within recent years, many free social media services have been created which allow for the generation of VGI. Each of the most popular services focus on different niches, e.g. Facebook is a full social environment, Instagram is ideal for posting pictures, Foursquare is a location finding service, and Twitter allows for the end-user to post textual messages called Tweets. Twitter exposes a powerful application programming interface (API), that allows developers and researchers access to public Twitter data. Using any compatible programming language, an interested person can query the APIs for Tweets of interest, Tweets from a specific user, Tweets in a given area, etc. and retrieve an immense amount of data (Poorthuis et al., 2014). Social media services are effectively utilized by users with modern smart phones. Today’s devices have advanced GNSS units that can provide inexpensive location reporting with reasonable precision (Dabove and Manzino, 2014). With these technologies in place, they become powerful tools, especially in the case of facilitating disaster response (Caragea et al., 2011) (Khan et al., 2017). Despite the popularity of the services/devices, it is necessary to understand that full adoption has not been implemented. There are subsets of the population that do not generate social media data, e.g. the very young/old, technology non-adopters, those who face socio-economic challenges, etc. (Li et al., 2013). The combination of social media based population estimations with traditional methods can be beneficial towards the goal of creating a more complete operating picture, even in data-constrained small areas. (Lin and Cromley, 2015) (Anderson et al., 2014). One of the primary benefits of VGI, the ability to attain data from untrained sources at low/no cost (See et al., 2016), also introduces risks that must be mitigated. Ideally, the contributor is human and nonmalicious; unfortunately, impetus exists to contribute purposefully incorrect data. Especially in the case of using social media to respond to disasters, erroneous data can contribute to innocuous pranks, interference with the movement of supplies, or to maliciously facilitate further disaster (Lindsay, 2011). Recent research includes developing ways to classify Twitter accounts based on a combination of markers. BotOrNot, a system to evaluate social bots (Davis 139


et al., 2016), was created and deployed such that endusers can gain insight into bot-like behavior for a Twitter account. BotOrNot (from here on referred to by its current name, Botometer) is utilized on this investigation’s Twitter data as a proof of concept. To classify an account, Botometer requires the following input: • User Information, including id and username.

• Tweet Timeline, a full list of the user’s Tweets, and associated information. • Mentions, a list of the user’s Tweets, and associated information, where another Twitter account is mentioned in the text of the Tweet.

Using various models and machine learning tools, Botometer returns scores about the following features for an account (Varol et al., 2017): • User: considers the metadata of a user’s account.

• Friend: evaluates interconnections between friends of the user. • Network: checks how retweets and mentions interact with each other.

3

ARCHITECTURE

The VGI utilized in this project is retrieved several times an hour from the public Twitter API using web service calls, from a cloud-based enterprise system. The Search API (Twitter, 2017) is utilized extensively for the North American cities, and the infrastructure is fully described in (Toepke, 2017). Querying the Search API presents a number of challenges: • Rate-limiting of requests: for each Twitter developer account, only 180 requests are permitted inside of a fifteen minute window. • Maximum limit on returned Tweets from each request: with a current limit of 100 Tweets. Thus for each fifteen minute window, a maximum of 18,000 Tweets can be retrieved. • Circular geospatial query: instead of a quadrangle, the geospatial queries are defined as a function of circle-center (latitude/longitude), and radius in either miles or kilometers.

• Temporal: includes analysis of when Tweets are made as well as frequency of Tweets. • Content and language: looks for natural language components in each Tweet text. • Sentiment: evaluates the attitude/feeling of Tweet content. The service also returns an aggregate score considering the Tweet text is written in English, or a universal score, which removes the content/language/sentiment considerations. Exploration of Botometer as applied to this Twitter data is further discussed in the following architecture and results sections. The Lisbon Metropolitan Area and five major cities in the United States of America (US) are the geospatial areas of interest for this work. Lisbon was picked to supplement previous work, in which the reasons for the US cities are fully described (Toepke, 2017). The cities are as follows: • San Diego, California (CA). • San Francisco, CA. • San Jose, CA.

• Portland, Oregon (OR).

• Seattle, Washington (WA).

140

Figure 1: AWS Twitter Streaming Query/Storage Architecture.

Figure 2: Software Layers for Streaming Query/Storage.

These limitations create an optimization problem, where a developer attempts to cover the maximum amount of geographic area, while minimizing the possibility of lost Tweets. The circular query-area with limited maximum response is especially challenging; to fully cover an area, overlap in the queries is unavoidable, with duplicate results being filtered out on the developer’s end before database insertion. It is also important to ensure the queries don’t saturate, e.g. if a developer chose the center of the circle to


be a downtown area, then made the radius 3 miles, one would get back 100 Tweets for every request; but many Tweets would be lost in the response from the Twitter API. Each circular query has to be made small enough such that a reasonable amount of Tweets are returned below the limit. With trial and error, a set of queries for the US cities was “dialed-in” such that they consistently cover an area, while returning an inundation of approximately 30% on average. This low average reduces the geospatial area that can be covered, but protects against a large burst of Tweets during a special event. Nonetheless, there is a maximum amount of queries that can exist per time period per developer account, so this is not a holistic solution. To resolve these limitations, Twitter also makes available a Streaming API, which once connected to, will continuously push Tweets to the consumer. The Streaming API can be configured to return Tweets from multiple geospatial areas, which is convenient for this use case, as only one solution needs deployed for multiple areas of interest. Implementing software for the Streaming API requires a different architecture than the Search API. As long as the consumer has the processing power to process Tweets, and the network connection remains alive, the Twitter API will continue to return data. This requires creating a solution that is fully available, properly sized, and resilient to software failures as well as the eventual disconnections that occur when using web services. Architecture blueprint can be seen in Figure 1. Software layers can be seen in Figure 2. The solution for the Streaming API was designed as follows: • Amazon Web Services (AWS) Elastic Beanstalk (ELB): an orchestration product that provides a server-as-a-service. Current support for environments includes Java, Python, PHP, .NET, Ruby, NODE.js and Docker. ELB manages all the backend server provisioning, networking, fail-over, updates, security, etc., allowing the end-user to focus on the code to be deployed. (Amazon.com, 2017) • Docker: containerization software that allows for repeatable packaging and deployment of execution environments as well as code. A Dockerfile for Ubuntu 14.04 was created with all updates, necessary permissions, ancillary packages, and built source code. • Java: a high-level, object oriented programming language which was used to create the code that makes the connection to the Twitter API, and process Tweets.

• Cron: a job scheduler used in UNIX/Linux, and is configured in the Docker container to begin the Java code when the container is first started. The custom Java software remains up indefinitely, unless a catastrophic error occurs. Using GNU’s ‘flock’ command, the Cron job runs every minute, looking to see if the Java code has stopped execution, if so, it restarts the code. • Hosebird Client: an open-source Java library which manages the ongoing connection to the Twitter Streaming API. The library securely makes the connection, passes geospatial query parameters, takes action on the incoming Tweets, and intelligently reconnects to the service in the case of a network connection break. (Client, 2017) • AWS DynamoDB: a NoSQL datastore-as-aservice that is used for Tweet storage. Like ELB, DynamoDB abstracts the database, and prevents the end-user from spending time on underlying administration details. • AWS CodeCommit, AWS Identity and Access Management, AWS CloudWatch and AWS Elastic MapReduce (EMR) are used for version control, security, monitoring, and data export, respectively. The architecture was implemented and is currently being used to download Tweets from the Lisbon Metropolitan Area. Once the Tweets have been collected from all cities, they are exported from DynamoDB to a local machine for processing. The export functionality uses AWS EMR, big-dataframeworks-as-a-service, to rapidly copy data from DynamoDB to a text file. The resulting Tweets are then used in a number of data removal experiments, of differing time periods, for each city, to ascertain a minimum viable length of time for data collection. The last step in processing, creating visualizations from the data, is completed using GNU Octave (Eaton et al., 2007), an open-source MATLAB alternative. A subset of all collected Tweets is used for the data removal experiments; the Tweets are for all cities, published in May/June 2017. Arbitrary months were chosen, with the amount of Tweets being appropriate for integration with Botometer considering time/API constraints. The authoring research team deployed Botometer as a representational state transfer web service, and presents an API through the Mashape API Marketplace (Mashape, 2017). The endpoint is web accessible by any compatible programming language, and requires a Mashape API key for access. Another constraint is that for each account that is being queried, 141


the Twitter API must be queried several times to build content for the Botometer request. Careful consideration of rate-limiting was required when designing the architecture, as the Twitter API currently only allows 180 requests per 15 minute window. To facilitate easy data manipulation, a PostgreSQL database was used with the PostGIS extensions to execute geographic queries. A custom Python script, leveraging Botometer’s suggested library, botometer-python (GitHub, 2017), was used to collect/populate the bot classification information over a span of several days.

4

POPULATION REMOVAL EXPERIMENTS RESULTS/OBSERVATIONS

The experimental data consists of geospatially enabled posts from the Twitter API occurring from 2017-05-01 00:00:00 (GMT) to 2017-06-26 00:00:00 (GMT) for a total of 179,598 Tweets, from six cities, published from 30,007 unique Twitter accounts. Publicly available web service APIs were used to download the data in a JavaScript Object Notation (JSON) format. While the collected dataset is much larger and available, Tweets starting in May 2017 and going for eight weeks are sufficient for this work. Also of note, the Lisbon data collection code started collecting Tweets as of 2017-04-16 14:38:07 (GMT), and its area of interest is larger than that of the U.S. cities. The average query area for each U.S. city is approximately 3.26 kmˆ2, and the query area for Lisbon is approximately 691.61 kmˆ2. The size of the query areas in the U.S. cities were designed such that they cover the downtown core areas adequately, while minimizing REST API calls. Lisbon has a much larger area of interest because it was the initial test case for leveraging Twitter’s Streaming API. The raw Tweet count for the different cities varies, and can be seen in Table 1, and visualized in Figure 3. Table 1: Total Tweet Count Per City, 2017-05-01 to 201706-26.

City San Jose, CA San Francisco, CA Portland, OR Lisbon, Portugal San Diego, CA Seattle, WA

142

Tweet Count 14,975 18,797 31,848 31,854 33,163 40,271

Figure 3: Total Tweet Count Per City, 2017-05-01 to 201706-26.

For each city, the data is broken up into different days of the week, and then broken up into different hours of the day. The end result is a graph showing the patterns of Tweets throughout a day. As the different cities receive a different volume of Tweets, the graphs are normalized using a standard method (Abdi and Williams, 2010). An example of the normalized hourly data for each city for a specific day of the week can be seen in Figure 4. Random data removal in increasing steps of 10% is then affected using Java code. The root mean square error (RMSE) (Chai and Draxler, 2014) (Holmes, 2000) is calculated for each resulting hourof-day graph, compared against the full set of data. An example graph showing data removal in increasing steps of 20% can be seen in Figure 5. The steps of 10% are calculated, but only steps of 20% are shown in the graph to remove clutter, and make it easier to visualize movement of the plots. As the amount of data is removed, one can see the data plots increasingly moving away from the full dataset line. The RMSE experiments are run eight times, starting from May 1, 2017, and using data from an increasing amount of days: 7, 14, 21, 28, 35, 42, 49, and 56. The time periods were chosen arbitrarily, increasing each time by one week. The data removal experiments were run until saturation became apparent with the increasing amount of days. Results from the first set, using one week’s worth of data, can be seen in the top part of Figure 6, a line for Lisbon is not represented due to an inadequate amount of data for those days. As removal of data increases, RMSE as compared with the full data increases. Results from the last set, using eight week’s worth of data, can be seen in the bottom part of Figure 6. The average RMSE has decreased by approximately 50%, and the slope of the RMSE between ˜10% data loss and ˜80% data loss is visibly flatter, with the population estimation showing increasing resilience to data loss. Figure 7 shows all the cities averaged together for each data collection length. Once about 5 weeks of data is collected, a decreasing return on increased data


Figure 4: Normalized Average Tweet Count Per Hour on Thursdays, 2017-05-01 to 2017-06-26.

Figure 5: Normalized Average Tweet Count Per Hour on Thursdays with Data Removal for Seattle, WA., 2017-05-01 to 2017-06-26.

can start to be seen. The estimation is becoming saturated with approximately 8 weeks worth of data; indicating that a reasonable population estimation can be gleaned from between about 6 and 8 weeks worth of collected data.

5

BOTOMETER RESULTS/OBSERVATIONS

For the dataset’s time period, there are 30,007 unique Twitter users who made posts for all six cities of investigation. Python code was used to query the Botometer web service service, and the English/Universal scores were updated in the PostGIS database for each Twitter user. An example return 143


Figure 6: RMSE of Normalized Average Tweet Count per Hour Averaged from All Days of Week with Data Removal, 2017-05-01 to 2017-05-08 (top set) and 2017-05-01 to 2017-06-26 (bottom set).

from the service can be viewed as JSON below. { ‘categories’ : { ‘content’ : 0.34, ‘friend’ : 0.25, ‘network’ : 0.24, ‘sentiment’ : 0.27, ‘temporal’ : 0.35, ‘user’ : 0.02 }, ‘scores’ : { ‘english’ : 0.26, ‘universal’ : 0.24 }, ‘user’ : { ‘id_str’ : ‘XXXXXXXXXX’, ‘screen_name’ : ‘YYYYYYYYYY’ } } For each category, a decimal value between 0 and 1 is returned. If the value is more towards 0, it indicates less bot-like behavior; as the value is more towards 1, it indicates more bot-like behavior. Botometer does not claim to be infallible, detection is difficult, and can create false results. According to the Botometer instructions, the best way to interpret an aggregate score is as follows. • Green, between 0.00 and 0.39, likely not a bot. 144

• Yellow, between 0.40 and 0.60, classifier is unable to determine bot-ness. • Red, between 0.61 and 1.00, likely a bot.

For the above random user, based on Botometer’s heuristics, they are likely not a bot. Indeed, a topical inspection of the account’s Twitter page is indicative of the user being human. All of the 30,007 accounts were run through the Botometer, with 835 accounts, or approximately 2.78% of the accounts, not retrieving data. The service either responded with “Not authorized.” or “Sorry, that page does not exist.”; either the user has changed their privacy settings since the Tweets were collected, or the account is no longer available. The five US cities use English as their primary language, so the aggregate ‘english’ score was used from Botometer. As the Lisbon Metropolitan Area primarily speaks Portuguese, the aggregate ‘universal’ score was used. Results for all Tweets can be visualized in Figure 8. It can be seen that approximately 29% of the Twitter accounts can be classified as bot-like (red), and approximately 23% listed as ambiguous (yellow) based on the Botometer classification algorithms. These results are indicative that bot-like accounts are pervasive, and identification/removal is necessary for an accurate population estimation from VGI.


Figure 7: RMSE of Normalized Average Tweet Count per Hour Averaged from All Days of Week with Data Removal, Averaged for all Cities, for Different Time Lengths of Data.

Figure 8: Botometer Scores for Tweets in the Six Cities, 2017-05-01 to 2017-06-26.

6

FOLLOW-ON WORK

• One of the drawbacks of the current implementation is that Twitter is the only social media service that is being used. Obtaining data from other popular social media products, and performing similar tests, can create more insight into the minimum viable estimation time period. Implementation/extension of the Botometer algorithms for other social media services would also be useful. • The current bot detection implementation is a pro-

totype, functioning only on the data exported for processing. The architecture would benefit from using web services to create a psuedo-realtime connection to the Botometer service, annotating Tweets after retrieval from the Twitter API, and before insertion into DynamoDB. A local support table, holding the bot-ness data for each Twitter user would greatly reduce calls to the Botometer web service. • A total overhaul of the architecture is required, 145


mainly to accommodate the Twitter Search API restrictions. The Streaming API prototype using Docker/ELB has proven successful, the five American cities can be ported to this architecture with minimal difficulty.

7

CONCLUSIONS

This work has built on previous investigations, further exploring temporal implications of population estimations from social media data. A new architecture was deployed, new data from Lisbon, Portugal was attained, and a modern bot detection algorithm was explored. Using removal techniques from previous work, experiments were run on different time periods, in multiple cities, to create a baseline minimum amount of time that collection code would have to run (6-8 weeks), before a population estimation with reasonable confidence can be obtained. This is pertinent when a new geographic area is being investigated, or a new social media feed is being implemented for an existing area. Having a minimum viable time period can bring a greater confidence to the end user when leveraging this method for population estimation.

REFERENCES Abdi, H. and Williams, L. (2010). Normalizing data. Encyclopedia of research design. Sage, Thousand Oaks, pages 935–938. Amazon.com, I. (2017). Aws elastic beanstalk - deploy web applications. Anderson, W., Guikema, S., Zaitchik, B., and Pan, W. (2014). Methods for estimating population density in data-limited areas: Evaluating regression and treebased models in peru. PloS one, 9(7):e100037. ¨ Aubrecht, C., Ozceylan Aubrecht, D., Ungar, J., Freire, S., and Steinnocher, K. (2016). Vgdi–advancing the concept: Volunteered geo-dynamic information and its benefits for population dynamics modeling. Transactions in GIS. Batista e Silva, F., Poelman, H., Martens, V., and Lavalle, C. (2013). Population estimation for the urban atlas polygons. Joint Research Centre. Caragea, C., Mcneese, N., Jaiswal, A., Traylor, G., Woo Hyun, K., Mitra, P., Wu, D., H Tapia, A., Giles, L., Jansen, J., and Yen, J. (2011). Classifying text messages for the haiti earthquake. 8th International Conference on Information Systems for Crisis Response and Management: From Early-Warning Systems to Preparedness and Training, ISCRAM 2011. Chai, T. and Draxler, R. R. (2014). Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature. Geoscientific Model Development, 7(3):1247–1250.

146

Client, H. (2017). Github - twitter/hbc: A java http client for consuming twitter’s streaming api. Copernicus (2017). Urban atlas 2012 - copernicus land monitoring service. Dabove, P. and Manzino, A. M. (2014). Gps & glonass mass-market receivers: positioning performances and peculiarities. Sensors, 14(12):22159–22179. Davis, C. A., Varol, O., Ferrara, E., Flammini, A., and Menczer, F. (2016). Botornot: A system to evaluate social bots. In Proceedings of the 25th International Conference Companion on World Wide Web, pages 273–274. International World Wide Web Conferences Steering Committee. Doulamis, N. D., Doulamis, A. D., Kokkinos, P., and Varvarigos, E. M. (2016). Event detection in twitter microblogging. IEEE transactions on cybernetics, 46(12):2810–2824. Eaton, J. W., Bateman, D., and Hauberg, S. (2007). GNU Octave version 3.0. 1 manual: a high-level interactive language for numerical computations. SoHo Books. GitHub (2017). Github - iunetsci/botometer-python: A python api for botometer by osome. Holmes, S. (2000). Rms error. Khan, S. F., Bergmann, N., Jurdak, R., Kusy, B., and Cameron, M. (2017). Mobility in cities: Comparative analysis of mobility models using geo-tagged tweets in australia. In Big Data Analysis (ICBDA), 2017 IEEE 2nd International Conference on, pages 816– 822. IEEE. Li, L., Goodchild, M. F., and Xu, B. (2013). Spatial, temporal, and socioeconomic patterns in the use of twitter and flickr. cartography and geographic information science, 40(2):61–77. Liang, Y., Caverlee, J., Cheng, Z., and Kamath, K. Y. (2013). How big is the crowd?: event and location based population modeling in social media. In Proceedings of the 24th ACM Conference on Hypertext and Social Media, pages 99–108. ACM. Lin, J. and Cromley, R. G. (2015). Evaluating geo-located twitter data as a control layer for areal interpolation of population. Applied Geography, 58:41–47. Lindsay, B. R. (2011). Social media and disasters: Current uses, future options, and policy considerations. Mashape (2017). Botometer api documentation. Poorthuis, A., Zook, M., Shelton, T., Graham, M., and Stephens, M. (2014). Using geotagged digital social data in geographic research. Poushter, J. (2016). Smartphone ownership and internet usage continues to climb in emerging economies. Pew Research Center, 22. Rose, A. N. and Bright, E. A. (2014). The landscan global population distribution project: current state of the art and prospective innovation. Technical report, Oak Ridge National Laboratory (ORNL). See, L., Mooney, P., Foody, G., Bastin, L., Comber, A., Estima, J., Fritz, S., Kerle, N., Jiang, B., Laakso, M., et al. (2016). Crowdsourcing, citizen science or volunteered geographic information? the current state of crowdsourced geographic information. ISPRS International Journal of Geo-Information, 5(5):55.


Sinnott, R. O. and Wang, W. (2017). Estimating micropopulations through social media analytics. Social Network Analysis and Mining, 7(1):13. Subrahmanian, V., Azaria, A., Durst, S., Kagan, V., Galstyan, A., Lerman, K., Zhu, L., Ferrara, E., Flammini, A., and Menczer, F. (2016). The darpa twitter bot challenge. Computer, 49(6):38–46. Toepke, S. L. (2017). Data density considerations for crowd sourced population estimations from social media. In Proceedings of the 3rd International Conference on Geographical Information Systems Theory, Applications and Management - Volume 1: GISTAM,, pages 35–42. INSTICC, SciTePress. Twitter, I. (2017). Search api: search/tweets – twitter developers. Varol, O., Ferrara, E., Davis, C. A., Menczer, F., and Flammini, A. (2017). Online human-bot interactions: Detection, estimation, and characterization. pages 280– 289. AAAI Conference on Web and Social Media (ICWSM). Wood, J., Dykes, J., Slingsby, A., and Clarke, K. (2007). Interactive visual exploration of a large spatio-temporal dataset: Reflections on a geovisualization mashup. IEEE transactions on visualization and computer graphics, 13(6):1176–1183.

147

Spatiotemporal Data-Cube Retrieval and Processing with xWCPS George Kakaletris1, Panagiota Koltsida2, Manos Kouvarakis1 and Konstantinos Apostolopoulos1 1

Communications & Information Technologies Experts S. A. Athens, Greece of Informatics & Telecommunications, University of Athens, Athens, Greece {gkakas, manosk, kapostolopoulos}@cite.gr, [email protected]

2Department

Keywords:

Query Language, Array Databases, Coverages, Metadata.

Abstract:

Management and processing of big data is inherently interweaved with the exploitation of their metadata, also "big" on their own, not only due to the increased number of datasets that get generated with continuously increased rates, but also due to the need for deeper and wider description of those data, which yields metadata of higher complexity and volume. Taking into account that generally data cannot be processed unless enough description is provided on their structure, origin, etc, accessing those metadata becomes crucial not only for locating the appropriate data but also for consuming them. The instruments to access those metadata shall be tolerant to their heterogeneity and loose structure. In this direction, xWCPS (XPath-enabled WCPS) is a novel query language that targets the spatiotemporal data cubes domain and tries to bring together metadata and multidimensional data processing under a single syntax paradigm limiting the need of using different tools to achieve this. It builds on the domain-established WCPS protocol and the widely adopted XPath language and yields new facilities to spatiotemporal datacubes analytics. Currently in its 2nd release, xWCPS, represents a major revision over its predecessor aiming to deliver improved, clearer, syntax and to ease implementation by its adopters.

1

INTRODUCTION

Petabytes of data of scientific interest are becoming available as a result of humanity's increased interest and capability to monitor natural processes but also to model them and explore the results of those theoretical models under different conditions. This trend on its own puts data infrastructures storage and transfer mechanisms under severe pressure, not to mention the processing ones. Duplicating or moving those data at their final consumption point is usually beyond its capabilities, as network and local storage capabilities cannot catch up this trend. It is a direct consequence of this observation that it is important to develop mechanisms to support efficient data identification, filtering and in-situ processing that will reduce the need for unnecessary data move and duplication. On the other hand, it is also evident that in order to pick the appropriate data and subsequent to consume them, one needs to be able to identify those data among a huge number of data sets. This sums to the point that more complex and more detailed metadata that cover an increasing number of aspects of the data they describe are required and produced.

As the volume of produced data grows larger, so does the volume of metadata that offer information about them, and it is evident that their handling need efficient retrieval mechanisms too, however those can no longer be considered independently of the mechanisms that handle the data, as both are needed together. A data science field where those observations apply to their full extent, and the area where the work presented herein focuses on, is the geospatial one. Data from the domain are multidimensional, diverse in terms of content and size and are accompanied with metadata that are essential for their retrieval and processing. Regarding data management, traditional database management systems (DBMSs) do not efficiently support array data, which is the most common form of data met here. This led to the development of dedicated array DBMSs like SciDB (Brown, 2010) and Rasdaman (Baumann, 1998), which close this gap by extending the supported data structures with multidimensional arrays of unlimited size, thus enabling the efficient storage of spatiotemporal data. Although array databases manage to handle these types of data they lack dealing with metadata filtering and processing in a unified way.

148 Kakaletris, G., Koltsida, P., Kouvarakis, M. and Apostolopoulos, K. Spatiotemporal Data-Cube Retrieval and Processing with xWCPS. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 148-156 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Spatiotemporal Data-Cube Retrieval and Processing with xWCPS

Our approach, manages to deliver efficient cross disciplinary querying and processing of array data and metadata, by offering a unified and friendly way through the xWCPS 2.0. xWCPS 2.0 is built on the first specification of the language xWCPS 1.0 (Liakos, 2014), as defined in EarthServer Project (EarthServer.eu, 2018) and refines its characteristics, so it facilitates implementation, improves expected query performance and eases user adoption and usage. In contrast to traditional approaches, where two different queries are required so as to first filter the semi structured metadata, retrieve and process the results, in our approach the same functionality can be achieved by executing just one unified query, resulting the least number of data transferred to their final consumption point. To overcome those limitations, we propose a query language with clear and user friendly syntax and we offer a working engine for it, in which its core components are a new metadata management engine, called FeMME, that follows a scalable no-SQL approach fitting the needs of the endeavour, and a proven array database system, Rasdaman and we efficiently combine them to support unified processing and retrieval of array data and metadata.

2

CONCEPTS AND MOTIVATION

The fundamental ideas behind this work have emerged from the EarthServer project series, which set as an objective to establish Agile Analytics on Petabyte Data Cubes as a simple, user-friendly and scalable paradigm. The mandate of the project includes the delivery of a standards’ based, declarative query language that enhances geospatial data infrastructures by allowing combined multidimensional data and metadata filtering and processing. xWCPS, in its current 2nd version, is built on top of xWCPS 1.0 and combines two widely known specifications, XPath (W3c.org, 2018) and the Web Coverage Processing Service (WCPS) standard (Baumann, 2010) into a single FLWOR (acronym For-Let-Order By-Where-Return, which stands for For-Let-Where-Order-Return) syntax to achieve the aforementioned result. In the root of the overall approach lies the concept of “coverage” (OGC, 2017), a fundamental element in Open Geospatial Consortium (http://www.open geospatial.org/) ecosystem. The coverage refers to data and metadata representing multidimensional space/time-varying phenomena. The OGC has introduced a number of standards and specifications for accessing, retrieving and processing coverages,

the Web Coverage Service (WCS) (Baumann, 2012) being one standard to support the access of raster data that are handled as coverages. WCS defines requests against these data and returns data with original semantics (instead of raster images). WCS supports the delivery of rich metadata about coverages, however it yields huge flexibility on those metadata to their provider making assumptions on the nature and form of those metadata, irrelevant. Complementing WCS, the Web Coverage Processing Service offers processing capabilities on top of array data using its defined language, allowing ad-hoc processing of coverage data. Examples include deriving composite indices (e.g. vegetation index), determining statistical evaluations, and generating different kinds of plots like data classifications, histograms, etc. The primary structure of the WCPS language comprises the for-where-return clauses. The “for” clause specifies the set of coverages that will be examined by a query. The “return” clause specifies the potential output that may be appended to the list of results, in each iteration defined by the “for” clause. Criteria used for determining if the output of “return” is actually appended are specified by the “where” clause. It has to be noted that WCS and WCPS are implemented by Rasdaman array database, which is OGC's official Reference Implementation for WCS Core. Rasdaman (Baumann, 1998), raster data manager, is a fully parallel array engine that allows storing and querying massive multi-dimensional arrays, such as sensor data, satellite imagery, and simulation data appearing in domains like earth, space, and life sciences. This array analytics engine distinguishes itself by its flexibility, performance, and scalability. From simple geo imagery services up to complex analytics, Rasdaman provides the whole spectrum of functionality on spatio-temporal raster data - both regular and irregular grids. Although in EarthServer project they are faced from the geospatial domain standpoint and expressed as coverages, multi-dimensional arrays are far from domain specific data form, and play a central role in all science, engineering, and beyond. Consequently, a significant number of approaches for retrieval from arrays have been proposed for different purpose applications. In practice, though, arrays typically are forming part of some larger data structure. Array SQL is a horizontal technology that – by its key enabling features flexibility, scalability, and information integration – enhances all fields of data management. In this domain it is evident that data cannot be consumed without metadata describing their essence.

149


Various and important characteristics reside into their metadata, thus making the consideration of joint filtering and processing of data and metadata a fundamental requirement. However, metadata engagement in this context has been largely ignored until recently, and this is the gap that our approach comes to fill in. In prior approach, in order to accommodate this requirement, xWCPS 1.0 (Liakos et Al, 2014) has been specified and implemented by merging the WCPS standard with the XQuery language, thus eliminating the limitations of the WCPS and WCS queries and allowing the parallel and combined query and processing for both data and metadata. Although xWCPS 1.0 and its initial implementation met the requirements mentioned above and managed to fill the gap of jointly accessing and processing data and metadata, it was evident that would not easily cope with challenges of the near future, where billions of datasets may be present in a federated infrastructure of even in single data server. The main problems of this approach can be summarized below: a) its syntax proved to be cumbersome for the users, especially dealing with the XQuery syntax and b) its engine implementation relied on XML management systems, with full XQuery support, that could not perform as required for extremely large metadata volumes and complexity. These limitations became evident during the adoption tests of xWCPS 1.0 that took place in the EarthServer-1 project. To overcome all these issues xWCPS 2.0 has been designed and implemented in EarthServer-2 project and is presented in detail in the following sections.

3

xWCPS 2.0

The management of big multidimensional datasets, e.g. coverages, poses a number of issues and challenges due to their size, nature and the diversity in phenomena and processes they might represent. Combining this with the velocity those are generated, be it generation of data from sensing, simulations and transformations, the demand for efficiently identifying, filtering and processing them (and if possible in a distributed manner) has emerged. At a certain point where users need to refer to large data stores, it becomes clear that the simultaneous utilization of metadata and array data is required so that the precise piece of data needed is located and processed according to its form and characteristics and the requirements of its consumer/client. To accommodate this in our approach, two wellknown standards, XPath for metadata

150

filtering/extracting and the Web Coverage Processing Service (WCPS) for array data processing, are combined, allowing an operation to be executed utilizing both of them without roundtrips or explicit knowledge of the characteristics of the data and the system they reside on, that is the case until now at least in the geospatial domain. The result is a declarative query language that follows the For-LetWhere-Order-Return paradigm (expressed as FLWOR) that offers a clear, well defined syntax, improving the way scientific data can be accessed and eliminating the need of prior knowledge of the data identifiers and characteristics. A similar approach was followed in the xWCPS 1.0, however use of full XQuery was assumed leading to the no user acceptance due to its bewildering syntax. The 2nd release drops XQuery in favour of XPath, with some additional elements yielding several positive results both from implementation and utilization standpoints. In the rest of this section we provide a brief introduction to the fundamentals of xWCPS 2.0 describing the core idea, its syntax, a number of use cases, and summarizing important notions for this paper. In the rest of the paper, for simplicity, we refer to xWCPS 2.0 with the term xWCPS.

3.1

Approach

One of the fundamental operations a query language must offer is that of querying for all data residing in the database without prior knowledge of their internal representation. WCPS requires the specification of coverage identifiers in selection queries. These identifiers are part of the database’s resource description and can be retrieved by issuing a WCS operation. This step introduces overhead in the querying process, which significantly constrains the user-friendliness of the query language and undermines the overall user experience. Another very common feature a query language must offer is that of filtering results according to some specified criteria. However, when a user asks for array data with WCPS in order to select coverages, it is not possible to define conditions regarding the accompanying metadata. Finally, yet importantly, it is fundamental for a language to return all the available information, containing both data and metadata. xWCPS (XPath Enabled WCPS) is a Query Language (QL) introduced to fill these gaps, merging two widely adopted standards, namely XPath 2.0 because of its capabilities on XML handling and WCPS's raster data processing abilities, into a new construct, which enables simultaneous exploitation of


both coverage metadata and payload in data processing queries. By combining those two, it is delivering a rich set of features that revolutionizes the way scientific data can be located and processed, by enabling combined search, filtering and processing on both metadata and OGC coverages' payload. In brief, queries expressed in xWCPS are able to utilize coverage metadata - commonly expressed in XML by incorporating support for FLWOR expression paradigm and providing the appropriate placeholders that enable any XPath or WCPS or combined query to be expressed in its syntax. Expressiveness and coherence are key features of the language, now in its 2nd revision, allowing experts dealing with multidimensional array data to easily adopt and take advantage of its offerings. In general, xWCPS is designed to consist the following features: • Coverage Identification based on Metadata: WCPS requires the specification of coverage identifiers in selection queries. xWCPS is introduced to fill this gap and eliminate the need of prior knowledge of the data by offering a unified interface aiming at being rich, expressive and user friendly and allowing coverage selection based on an XPath expression. •

Exploitation of Descriptive Metadata: Coverage filtering based on the available metadata using XPath 2.0. For and where clauses can contain XPath 2.0 expressions in order to restrict results to specific metadata.

•

Repetitiveness Reduction: xWCPS supports variable manipulation, which allows assigning complex expressions to variables and re-using them for subsequent references, avoiding repetitiveness.

•

Extended Set of Results Support: An important feature of xWCPS is the ability to return the data accompanied with their metadata.

3.2

Syntax

Queries are the most fundamental part of the language. A simple WCPS query is based on a "forwhere-return" structure. An xWCPS query is composed from several expressions, including the basic three clauses "for-where-return" of WCPS, while introducing the "let-order by" structure and XPath 2.0. Additionally, xWCPS includes special operators to provide easier search abilities to filter specific metadata. The top-level grammar of xWCPS

is presented on Figure 1. xWCPS acts as a wrapper construct on top of XPath 2.0 and WCPS, thus it doesn't offer any language specific operations. Every valid WCPS or XPath 2.0 operation is a valid xWCPS operation; xWCPS combines WCPS with XPath 2.0 operations using a rather simple syntactic formalism.

Figure 1: xWCPS Syntax.

3.2.1 For Statement The

"for"

statement

snippet

is:

{for

variable_name in for_expression}.

It can also contain the let clause allowing variable definition that can be used later on. The for clause binds a variable to each item returned by the in expression. There are 3 options that can be used in a ‘for’ statement: •

Use all available coverages: *

•

Use all coverages of a specific service: *@endpoint (endpoint can be a url with double quotes or an alias)

•

Use specific coverages: coverageId or coverageId@endpoint (endpoint can be a url with double quotes or an alias)

3.2.2 Let Statement The let statement snippet is: {let variable_name := wcps_clause;}

The let clause can initialize variables following an assignment expression that finishes with a semicolon. The use of the let clause can greatly reduce

151


repetitiveness, making xWCPS extremely less verbose than WCPS. Moreover, arithmetic operations can be executed between defined variables.

3.2.3 Where Statement The where statement is used to specify one or more metadata or coverage related criteria for filtering down the returned result. Currently combined data and metadata join operations are not allowed in the context of xWCPS. Every XPath or WCPS expression evaluating to a boolean result is a valid xWCPS comparison expression. To declare an xPath expression the “::” notation should follow the variable. That notation fetches the metadata of the coverage where the xPath is evaluated.

3.2.4 Order by Statement The Order by statement has the following syntax: {order_by_expression (asc | desc)}

Results can be sorted using ORDER BY. Like in FLWOR expressions, the construct takes one or more order expressions that each can have an optional order modifier (ASC or DESC). The order by clause is used to rank the returned coverages based on an XPath clause applicable on their metadata. If direction is not defined explicitly, ascending is used by default.

3.2.5 Return Statement The return statement of a query specifies what is to be returned and the way that this result should be represented. It can contain textual results, structured XML results, WCPS encoded (i.e. png, tiff, csv) results or combinations of binary and textual data as mixed results. xWCPS acts as a wrapper construct on top of XPath 2.0 and WCPS, thus it doesn't offer any language specific operations. Thus we can have the following options: •

Use the encode function of WCPS -> WCPS result

•

Use "::" operator -> Fetch metadata -> XML result

•

Use an xPath 2.0 expression / function -> XML result

•

Use the new “mixed” function to combine both -> Multipart result

152

3.3

Use Cases

The features and functionality introduced with xWCPS are presented in this section through a number of use cases, examples. The queries represent the expressive power of our language and its superiority over WCPS in array database search. In the context of the EarthServer project we have tested the effectiveness of xWCPS by searching over array databases with terabyte of data and metadata by registering the services and their metadata into the catalogue. Six services are part of the EarthServer project and all of them are making available terabytes of data. More information is available in the public reports of the project.

3.3.1 Retrieving Data and Metadata using Special Characters XQuery was a key feature of xWCPS 1.0. Now in its 2nd revision, xWCPS is based on XPath in order to accomplish user friendliness and simplified queries to retrieve data and metadata. Special characters are introduced for expressiveness in order to easily retrieve all coverages and/or filter them by endpoints. The example below shows a query that uses both * and @ special characters to fetch all coverages from a specific service endpoint and return part of the actual coverage as a result. The encode function of WCPS defines the returned result in this specific case. {for $c in *@ECMWF return encode($c[ansi("2001-0731T23:59:00")] * 1000 , "png")}

while the following one shows a query that fetches the metadata of a specific coverage using :: special character. {for $c in precipitation@ECMWF return $c::}

3.3.2 Building Coverage Filtering Queries using XPath Filtering metadata of a coverage through XPath can be applied in both where and return clause. In where clause to decrease the number of results and in return clause to manipulate what is presented as a result. The following example has is accommodating both filter options by filtering an XML attribute for a specific value and then setting that attribute as the result. In this example, the result contains only XML metadata. {for $c in *@ECMWF where $c:://RectifiedGrid[@dimension=2] return $c:://RectifiedGrid}


3.3.3 Building Coverage Ordering Queries using XPath and Let Clause xWCPS supports the 'let' clause, which allows assigning complex expressions to variables and reusing them for subsequent references, avoiding repetitiveness. In the following example a variable called '$orderByClause' is assigned with the id of every coverage that matches the 'for' clause. This variable is firstly used to order the results and then to be presented to the user as the returned value. Let clause holds the result of a metadata expression filtered by XPath. {for $c in *@ECMWF let $orderByClause := $c:://wcs:CoverageId/text(); orderby $orderByClause desc return $orderByClause}

3.3.4 Retrieving a Mixed Form Containing Data and Respective Metadata An important feature of xWCPS is the ability to return the data accompanied with their metadata reducing the amount of queries required before and allowing the user to retrieve only one result containing both. This can be achieved using the 'mixed' clause of xWCPS as can be seen in the example below: {for $c in CCI_V2_monthly_chlor_a return mixed(encode ($c[ansi("2001-0731T23:59:00")] * 1000 , "png"), $c::)}

The source result of an xWCPS query is in JSON format and it contains both the metadata and the actual coverage in base64 format. For simplicity the xWCPS web application supports the execution of xWCPS queries, including all the above examples. Figure 2 shows how a mixed result is displayed in the web application.

4 4.1

IMPLEMENTATION Base Architecture

xWCPS constitutes of two distinct implementations. Initially, a query parser has been implemented to support the query translation based on the language definition presented before. It uses the ANTLR 4 framework (ANTLR, 2018) and it translates the xWCPS queries to source code. The language is specified using a context-free grammar, which is expressed using Extended BackusNaur Form (EBNF). Open source grammars for WCPS and XPath are extracted from (ANTLR WCPS, 2018) and (ANTLR XPath, 2018) respectively. The xWCPS engine implementation exploits FeMME metadata management engine for the metadata query support and it utilizes registered Rasdaman servers for processing the (geospatial) array queries following the WCPS syntax. The overall architecture of the system is shown in Figure 3.

In the query above, the usage of the mixed clause will return a result that contains the actual coverage processed as the encode function defines together with the full set of metadata that accompanies it.

Figure 3: xWCPS Engine Architecture.

Figure 2: xWCPS Web Application.

xWCPS offers either a web application for end users or a REST API for machine to machine interaction. Any valid xWCPS query can be executed and the results are returned to the consumer. The flow for executing a query is the following: Initially the for and where clauses of the query are analysed, producing a composite query which is evaluated against FeMME. Following the composite query strategy, XPath evaluation on FeMME can be optimized by restricting the number of coverages that are considered for the XPath execution. 153


As soon as the first evaluation step is completed the return statement is executed using the items returned in the first step. Depending on its contents, return can utilize either FeMME or Rasdaman. Encode function is evaluated in its entirety using Rasdaman and its supported geospatial operations. Other available expressions, like XPath and metadata retrieval, use FeMME to generate the returned result.

4.2

FeMME: Metadata Management Engine

Metadata play a significant role in the evaluation of an xWCPS query. It is the means of identifying and filtering the available coverages through the executed queries. The goal of the metadata management engine is to amalgamate all this information into one catalogue offering federated metadata search upon coverages' descriptive information. In order to support the execution of the metadata part of the queries such an engine, termed “FeMME”, Federated Metadata Management Engine, has been designed and implemented. The main principles it adheres to, are the metadata schema agnosticism, in order to support storage, querying and manipulation of descriptive metadata from various data sources and querying (XPath) performance efficiency. FeMME has been designed aiming on being pluggable and supporting the storage of metadata available through different protocols and standards. To this end, a number of sub-components enable the harvesting of metadata for every available collection of coverages, which are first initialized with the WCS available metadata and can then be enriched from other catalogues supporting them. xWCPS uses FeMME as its central point for identifying and retrieving the required information for each collection of coverages and for gaining access to the Describe Coverage metadata and executing the XPath queries.

4.2.1 XPath Performance Efficiency In order to overcome the inherent memory and speed limitations of in memory XPath, as proved to be the main limitation of the initial implementation, it was decided to utilize the speed and flexibility of NoSQL systems to implement XPath. The technology used initially was MongoDB. The approach followed at first was to flatten an XML document and store each XML element as a separate document in MongoDB. Building a custom parser allowed us to transform an XPath to a MongoDB query and evaluate it in the database. An unforeseen

154

issue was that this method of “indexing” an XML document resulted in the creation of a large number of documents. For example, for a typical response the number of documents produced was over 1000. As the number of indexed metadata increased, so did the overhead. As a result, a different approach was followed. Each XML document is transformed to one JSON object, reflecting the XML document’s structure and hierarchy. Each XML element would map to a JSON node and children elements to children nodes. Namespaces and attributes would be transformed to children nodes. This way it is possible to transform XML to JSON and vice versa without losing any information. In order to achieve better performance different technologies were evaluated. ElasticSearch was chosen to provide the storing and querying capabilities. ElasticSearch also stores data as JSON documents but, in contrast to MongoDB, indexes every field of a JSON document. This fact promised much better performance for queries that could query for a value at any level of the document.

4.3

Federated Geospatial Queries Execution

The execution of the geospatial, array queries of xWCPS are executed remotely, by interacting with the appropriate array database engine (i.e. rasdaman) using a WCPS query. FeMME holds all the required information for each registered array database addressed by xWCPS in order to identify the appropriate service endpoint that holds the coverages defined, or is specified in the xWCPS query. The system can interact with more than one data management service at one query, allowing the concurrent retrieval of data that are not part of the same engine. This feature is considered to vastly simplify the implementation of applications for array data aggregation and presentation from multiple sources through a unified way, rendering end-user application development quite straight forward. The performance of this part of the execution is highly relying on the interconnection and performance of the array database engines that comprised the implied federation. It has to be noted that the volume of the data transported may become a reason for bottlenecks and delays.


5

APPLICATIONS

One of the main applications is applying the solution over the Meteorological Archival and Retrieval System (MARS). MARS is the main repository of meteorological data at the European Centre for Medium-Range Weather Forecasts (ECMWF). MARS hosts operational and research data, as well as data from special projects. The archive holds Petabytes of data, mainly using GRIB format for Meteorological fields and BUFR format for Meteorological Observations. Most of the data produced at ECMWF daily is archived in MARS, and therefore available to users via its services. The MARS archive integration aims to bring the more than 100PB MARS archive of ECMWF to its audience via the WCS and WCPS standards. Due to its enormous size it is practically infeasible to ingest the data into a system capable of exposing those via the aforementioned standards, as this would require twice the storage space. Thus, a different approach had to be designed to allow only the required subset of data that are addressed by a WCS/WCPS operation to moved out of the archive when required. xWCPS was chosen as the best candidate to address these requirements due to its simplicity, expressiveness and filtering capabilities. As the project lays its array processing capabilities on the Rasdaman engine, the objective is to move as little data as feasible into the Rasdaman data store, on demand and offload the remaining processing to the Rasdaman engine. Utilizing the metadata management capabilities of FeMME allowed the on the fly data retrieval from MARS and subsequent ingestion in Rasdaman. As soon as the MARS data lie in Rasdaman, the aforementioned workflow of an xWCPS query can be carried out. A descriptive diagram of how the MARS system works, is shown in Figure 4.

Figure 4: MARS System Integration.

FeMME and xWCPS subsystems are surrounded by a web UI that allows the exploration of coverages under a familiar virtual globe, offered by NASA Web World Wind 3D virtual globe technology [NASA WWW] that allows rendering the coverage data and

metadata (bounding boxes). The combined platform allows handling, retrieval processing and visualization of various Coordinate Reference Systems, making it possible to utilize the same stack for rendering Earth as well as other solar body data.

6

EVALUATION

We evaluate our approach from two perspectives, (a) the language definition and (b) its prototypical implementation. Regarding the language definition, as opposed to xWCPS v1.0, we have an apparently simpler grammar, giving away quite a significant part of XQuery processing features, which is the main drawback of the initial approach. However, this move in, in par with the OGC-Extended Coverage Implementation Schema which also suggests XPath for querying hierarchical metadata. It has to be noted that hierarchical metadata are assumed to be the prominent model, be they in json or xml form, although not the only or most powerful model in place. Naturally due to the specification size, potential clashes among WCPS and XPath are quite fewer and as such are resolved in a more intuitive manner, while the syntactic sugar added supports common use cases identified within the hosting project course and based on the feedback provided by the rest of the partners, experts in the geospatial domain. xWCPS eases significantly implementation of applications, as it removes the need for multistep approach followed by coverage data consumers, which encloses at least the steps of locating the dataset, extracting the significant metadata for processing it and finally processing its content. This approach not only implies round trips but also requires that the client understands the form of the data/metadata infrastructure and reduces the opportunity of server-side optimisation of data management and processing. Regarding the aspect of prototyping, the current implementation is based on Rasdaman array database for WCPS queries, on one hand and FeMMe engine on the other for metadata retrieval and the two parts of execution are orchestrated by the prototype's execution engine. This has the drawback that little optimisation can be performed at query time as internal engine structures and processes run separately. Nevertheless, the built-in heuristics (e.g. assume that XPath execution is always faster than accessing the array data) manage to avoid common pitfalls. In return, for this approach, we achieve a

155


much cleaner implementation that does not reside on a particular engine's characteristics and may be moved from one context to another (e.g. a different array DB or metadata management engine), which is a quite stronger requirement at the stage of prototyping a language implementation.

7

FUTURE WORK

Having provided an implementation of xWCPS and proved its potential and usefulness, there is still space for several improvements both in the language definition and in the implementation of the engine that supports it. Full support of the XPath 2.0 specification is the first priority to work on allowing more efficient filtering of the available metadata leading to queries that require less transfer of data and increasing the response time. Subsequently, we plan to work on optimizing the response time of the xWCPS queries, by improving the FeMME metadata engine which is the core component of the engine, working on the transformation of a single component to a distributed one. Apart from that, user experience like auto-complete and a visual query editor could be further investigated based on the clients’ feedback together with security aspects arising between the different layers of the architecture and the number of Rasdaman engines registered in the catalogue.

ACKNOWLEDGEMENTS This work has been partially supported by the European Commission under grant agreement H2020 654367, “Agile Analytics on Big Data Cubes (EarthServer 2)” and is powered by rasdaman array database engine and NASA Web World Wind 3D virtual globe.

REFERENCES Liakos, P., Koltsida, P., Kakaletris, G., and Baumann, P. (2015). xWCPS: Bridging the gap between array and semi-structured data. In Knowledge Engineering and Knowledge Management, pages 120–123. Springer. Baumann P., (2012). OGCr WCS 2.0 Interface Standard — Core. OGC 09-110r4, version 2.0. OGC. Baumann, P. (2010). The OGC web coverage processing service (WCPS) standard. GeoInformatica, 14(4):447– 479. Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., and Widmann, N. (1998). The multidimensional database

156

system rasdaman. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 575–577. ACM. Brown, P. G. (2010). Overview of SciDB: Large scale array storage, processing and analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 963–968. ACM. Baumann, P., Mazzetti, P., Ungar, J., Barbera, R., Barboni, D., Beccati, A., Bigagli, L., Boldrini, E., Bruno, R., Calanducci, A., Campalani, P., Clements, O., Dumitru, A., Grant, M., Herzig, P., Kakaletris, G., Laxton, J., Koltsida, P., Lipskoch, K., Mahdiraji, A.R., Mantovani, S., Merticariu, V., Messina, A., Misev, D., Natali, S., Nativi, S., Oosthoek, J., Pappalardo, M., Passmore, J., Rossi, A.P., Rundo, F., Sen, M., Sorbera, V., Sullivan, D., Torrisi, M., Trovato, L., Veratelli, M.G., Wagner, S., 2016. Big data analytics for earth sciences: the EarthServer approach. Int. J. Digital Earth 9:3–29. W3c.org. (2018). XML Path Language (XPath) 2.0 (Second Edition). [online] Available at: http://www.w3c.org/ TR/xpath20 [Accessed 2 Jan. 2018]. OGC. (2017). The OpenGIS® Abstract Specification Topic 6: Schema for coverage geometry and functions, Version 7. [online] Available at: http://portal.open geospatial.org/files/?artifact_id=19820 [Accessed Dec. 2017]. ANTLR. (2018). ANTLR. [online] Available at: http://www.antlr.org [Accessed 5 Dec. 2017]. Earthserver.eu. (2018). Home | EarthServer.eu. [online] Available at: http://earthserver.eu [Accessed 2 Jan. 2018]. ANTLR WCPS. (2018). WCPS Grammar. [online] Available at: http://www.rasdaman.org/browser/appli cations/petascope/petascope_main/src/main/java/petas cope/wcps/parser/wcps.g4 [Accessed 1 Jan. 2018]. ANTLR XPath. (2018). antlr/grammars-v4. [online] Available at: https://github.com/antlr/grammars-v4/ tree/master/xpath [Accessed 1 Jan. 2018]. NASA WWW. (2018). Web World Wind. Available at https://worldwind.arc.nasa.gov/web/ [Accessed 1 Jan. 2018].

S HORT PAPERS

Improving Urban Simulation Accuracy through Analysis of Control Factors: A Case Study in the City Belt along the Yellow River in Ningxia, China Rongfang Lyu, Jianming Zhang, Mengqun Xu and Jijun Li College of Earth Environmental Sciences, Lanzhou University, Tianshui South Road 222, Lanzhou, China [email protected]

Keywords:

Urban Simulation, Spatial Heterogeneity, Macro-control Influence, SLEUTH-3r Model, City Belt along Yellow River in Ningxia.

Abstract:

Spatial heterogeneity of urban expansion and macro-scale influence of socioeconomic development are the two main problems in urban-expansion modelling. In this study, we used the SLEUTH-3r model to simulate urban expansion at a fine scale (30 m) for a large urban agglomeration (22000 km2) in north-western China. Multiple spatial constraint factors were integrated into the model through Ordinary Least Regression and Binary Logistic Regression to simulate the spatial heterogeneity in urban expansion. A critical parameter—the diffusion multiplier (DM)—was used to simulate the macro-scale influence of socioeconomic development in the urban model. These two methods have greatly enhanced the ability of the SLEUTH-3r model to simulate urban expansion with high heterogeneity, and adapt to urban growth driven by socioeconomic development and government policy.

1

INTRODUCTION

Urbanization, an unprecedented global phenomenon, has significantly altered natural landscapes and human lives (Zhang et al., 2012). Urban expansion, a significant performance of urbanization, has brought numerous threats to ecosystem, such as loss of natural resources (Delphin et al., 2016), climate change (Singh et al., 2017), and biodiversity decrease (Haase et al., 2012). Therefore, it is critical to predict urban expansion patterns for sustainable development, especially in metropolitan areas, which form the basic unit in future socioeconomic development (Poyil and Misra, 2015). Urbanization is a dynamic process influenced by geophysical, environmental, demographic, and social factors at multiple scales (Akın et al., 2014). Complicated interactions between these factors, and associated temporal changes lead to spatial and temporal heterogeneity in urban expansion (Li et al., 2017). A number of techniques have been developed to simulate urban expansion, ranging from static models based on gravity theory and optimization mathematics to dynamic models (Berling-Wolff and Wu, 2004). In particular, the cellular automata (CA) model is widely used in urban simulation for its

simplicity, flexibility, intuitiveness, and transparency in modeling complex systems (Santé et al., 2010). However, the CA model often fails to capture the change magnitude of urban expansion driven by political and economic strategies (Qi et al., 2004). Despite its successful application in many cities, the SLEUTH model is also a CA model that fails to consider the macro-scale driving influence of socioeconomic development (Berberoğlu et al., 2016, Chaudhuri and Clarke, 2013). Since urbanization in China is highly driven by government policies, it is essential to integrate these macro-scale control factors into urban model. The SLEUTH model has been always used to simulate urban land distribution in a single city at coarse resolution (Chaudhuri and Clarke, 2013), but not for large urban agglomerations consisting of several cities with high spatial heterogeneity (Jat et al., 2017). Several approaches have been developed to evaluate the effects of driving forces on urban expansion, such as binary (Haregeweyn et al., 2012), multiple linear (Gao and Li, 2011), and geographically-weighted regressions (Su et al., 2012), analytic hierarchy process (Thapa and Murayama, 2012), and logistic regression (Long et al., 2012). Among them, multiple linear and binary

159 Lyu, R., Zhang, J., Xu, M. and Li, J. Improving Urban Simulation Accuracy through Analysis of Control Factors: A Case Study in the City Belt along the Yellow River in Ningxia, China. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 159-166 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved


regression, both reliable and easy to manipulate, were selected to integrate multiple factors into the SLEUTH model to simulate urban spatial expansion with high heterogeneity (Liu et al., 2014). To date, most of urban studies in China focused on fast-growing coastal and major interior cities; however, urban growth in inner northwestern China, especially in large urban agglomerations, has not been well described. Our study will help to bridge the gap, as the study area is a large city belt in northwestern China. The main objectives of our study were to: (1) identify factors that control urban expansion, and quantify their impacts, (2) simulate urban expansion with high spatial heterogeneity, and (3) integrate the macro-scale driving influence of socioeconomic development into model to simulate urban expansion with proper magnitude.

2 2.1

STUDY AREA AND METHODS Study Area

The City Belt along the Yellow River in Ningxia (CBYN), located in northwestern China, is a large urban agglomeration consisting of four cities: Shizuishan, Yinchuan, Wuzhong and Zhongwei

(Fig. 1). The study area, with Tengger desert in the west, the Maowusu desert in the east, and the Ulan Buh desert in the north, is one of the core areas of the west Longhair-Lanxin xian economic belt. Since 2000, socioeconomic development in this area has been deliberately enhanced by the government through West Development Project. Gross Domestic Product (GDP) increased from 5045.93 million Yuan in 1990 to 223,550.29 million Yuan in 2013, with an annual growth rate of 188.27%, while population increased at an annual rate of 2.75 %. (Ningxia Statistical Yearbook, 1990-2014). Growing industry and commerce in the urbanized areas provide more work opportunities, and attract population from the rural areas, further promoting urbanization.

2.2

Data Collection and Processing

Twelve scenes of Landsat MSS/TM/ETM+/OLI images, covering the study area in 1989, 1999, 2006 and 2016, were used as the primary resource data (involving path/row of 129/33, 129/34 and 130/34). Images were preprocessed in ENVI 5.3, including geographical registration, radiometric calibration and atmospheric correction, and then were exportedinto eCognition 8.7 for an object-based classification. Reference samples were identified in Google Earth and field survey to examine classification accuracy. The Kappa coefficients

Figure 1: Location and administrative division of the study area—Shizuishan, Yinchuan, Wuzhong and Zhongwei:a) the study area in China; b) the study area in Ningxia Hui Autonomous Region; c) topography and the city center of Shizuishan, Yinchuan, Wuzhong and Zhongwei.

160

Improving Urban Simulation Accuracy through Analysis of Control Factors: A Case Study in the City Belt along the Yellow River in Ningxia, China

(consistency test between classification results and reference samples) reached 0.93, 0.89, 0.91 and 0.87 in 1989, 1999, 2006 and 2016, respectively, thus the results were reliable. The ASTER DEM data (version 4.1) (https://search.earthdata.nasa.gov/) was resampled to 30 m in ArcGIS 10.3, and used to generate slope and hillshade layers. Transportation layers were extracted from satellite images and by visual interpretation using Google Earth. All the input layers were resampled for 30 m in ArcGIS 10.3, and then imported into Photoshop CS6 to be exported in GIF format. Socioeconomic data, such as population and GDP, was obtained from Ningxia Statistical Yearbook (1990-2014), compiled by the statistical bureau of Ningxia Hui Autonomous Region and Ningxia Survey Office of National Statistical Bureau, and published by China Statistics Press.

2.3

Overview of the SLEUTH Model

The SLEUTH model (Clarke et al., 1997) is designed to simulate urban growth and land use change. The name includes the first letters of the input layers: slope, land cover, excluded, urban, transportation, and hillshade. The model simulates urban expansion with four rules: spontaneous growth that simulates the random urbanization, new spreading center growth that establishes new urban centers, edge growth and road influenced growth. The model behavior are controlled by five growth coefficients (diffusion, breed, spread, road gravity, and slope) that range from 0 to 100, indicating the relative contribution of each growth types for whole urban growth. Moreover, self-modification is applied to better predict rapid or depressed urban growth. Model calibration allows users to obtain parameters describing past urban expansion, while prediction helps forecast urban growth and land use change under different scenarios. Due to the large amounts of input data, we selected the 3r-version of the SLEUTH model (SLEUTH-3r) for our study; it has more efficient utility of computer memory and higher simulation accuracy of dispersed settlements (Jantz et al., 2010). Two new accuracy parameters—area fractional difference (AFD) and clusters fractional difference (CFD)—were designed in SLEUTH-3r model to compare urban pixels and clusters between simulated and real maps. Besides that, Lee-Sallee metric, the shape index of spatial fit between actual urban map and predicted one, has also been used in our study to examine the simulation accuracy.

2.4

Simulating Spatial Heterogeneity

To address spatial heterogeneity in urban expansion, we first established a suitability system of factors driving urban growth from past studies (details in 2.4.1 below). Second, we detected the spatial relationships between factors and urban expansion through the Ordinary Least Square (OLS) regression model in ArcGIS 10.3 (details in 2.4.2 below). Finally, suitability for urban expansion was calculated and mapped through Binary Logistic Regression with weighted factors derived from the former step (details in 2.4.3 below). Then the suitability map was transformed into the excluded layer for the SLEUTH-3r model.

2.4.1 Suitability-Factor System Different types of explanatory variables have been identified (Gao and Li, 2011, Su et al., 2012), and categorized based on physical conditions, ecological protection, and socio-economic development (Table 1). Ecological factors are protected from urban expansion and are assigned value of 100 in the excluded layer. Slope factor is not included in the system, as it is already in SLEUTH-3r model. All variables were first normalized into the range of 0-1 to eliminate the effect of magnitude. Based on correlation analysis, multicollinearity did not exist among the explanatory variables in the subsequent regression analysis. Table 1: Factors influencing urban development. Type Physical Ecological Socio-economic

Factor Elevation Geomorphic type Water areas National natural reserves Growth rate of GDP Growth rate of population Distance to city centers Distance to county centers

Code XE XM XW XN XG XP XD1 XD2

2.4.2 Weights Estimation OLS, which could minimize the sum of squared vertical distance between observed variables and simulation values (Gao and Li, 2011), was used to explore the relationships between urban expansion and its driving factors, as follows: Z=C+ ∑n wi Xi +er

(1)

Where Z was the dependent variable, C was the constant parameter; wi was the parameter of independent variable Xi; er was the error term.

161


Because non-urbanized area greatly surpassed urbanized area in CBYN, we randomly selected 5,000 points in each area, with a distance between each point > 300 m to minimize the impacts of spatial autocorrelation. The “extract multi values to points” tool in ArcGIS 10.3 was used to obtain the values of driving parameters and urban expansion (0 for non-urbanized area and 1 for urbanized area) at each point. They were then used to establish the OLS model in ArcGIS 10.3.

2.4.3 Generating Suitability Maps If the probability of a cell suitable for urbanization followed the logistic curve described in Eq. (2), the possibility of a cell being urbanized was estimated with Eq. (3): pi

ln =C+ ∑ni=1 wi Xi 1-pi pi =

1 1+exp(-C- ∑n wi Xi )

(2) (3)

Where pi was the probability of a cell becoming urbanized, Xi was the driving factor for urban expansion, wi was the coefficient of each factor derived from OLS, and C was a constant.

2.5

RESULTS

3.1

Urban Expansion Suitability Map

Multiple linear regression analysis processed in SPSS 22.0 had the same results as OLS in ArcGIS 10.2 (Eq. (4)). The six factors had different effects on urban expansion, indicated by the coefficients of each factor. And the influence of geophysical factors was greater than that of socioeconomic factors. The regression model was as follows (Eq. (4)): ln (

pi ) =1.53-1.32×XE -0.4×XM -0.51× 1-pi XD1 -0.53×XD2 +0.05×XG +0.02×XP

(4)

Where pi was the urbanization probability of each cell. Based on binary logistic regression, a probability map for urban suitability was generated (Fig. 2a). Then, we converted it to an excluded layer that contained areas ranging from unsuitable for urbanized (value=100) to suitable (value=0) in SLEUTH-3r model using the “map algebra” tool in ArcGIS 10.3 (Fig. 2b). The transformation equation was as follows (Eq. (8)):

Socioeconomic Factors in the Model

In SLEUTH-3r model, spontaneous urban growth was the foundation of other growth types, and mainly determined by a diffusion multiplier (DM), diffusion coefficient (DC), and the size of input images (Jantz et al., 2010). Thus DM could generally determine the simulation magnitude of urban growth in model, and allowed the integration of socioeconomic development into the model. The DM value was 0.005 in the original version, and 0.015 in the 3r version of the SLEUTH model, and neither could generate enough urban growth (AFD ranging from -0.847 to -0.06). Thus, the first problem was obtaining an appropriate DM. As discussed above, DM was related to simulation magnitude, so we explored the relationship between DM and simulation magnitude of urban area and cluster (AFD and CFD) to find appropriate DM. We selected the annual growth rates of GDP and population as the representatives for socioeconomic development, and generated an indicator (SE) using factor analysis in SPSS 22.0. Then we explored the relationship between SE and DM through regression analysis in SPSS 22.0, to use DM representing different socioeconomic development conditions.

162

3

RE =(

MAX(Rsuit )-Rsuit )×100] MAX(Rsuit )-MIN(Rsuit )

(5)

Where RE and Rsuit were the raster maps of excluded layer and suitability map, respectively.

Figure 2: Suitability map for urbanization probability (a) and excluded map for SLEUTH-3r model (b).

3.2

Determination of DM

We explored the relationships between AFD/CFD and DM in the calibration mode of the model with the five growth coefficients ranging from 0 to 100


and an increment of 50. We found that the minimum values of AFD and CFD were almost the same (-0.847 and -0.73) under different DM, while the maximum value increased with an increase in DM. The relationships between the maximum values of AFD/CFD and DM were established through regression analysis in SPSS 22.0. The equations and simulated curves were as follows (Eq (6) with R2 of 0.975, Eq (7) with R2 of 0.997, and Fig. 3)): AFDmax =3.323+0.79×ln(DM )

(6)

CFDmax =2.69+202.67×DM -188.45×D2M

(7)

+71.49×D3M

Figure 3: Maximum and minimum values of AFD and CFD over increasing DM.

From the testing data shown in figure 3, three values of DM—0.03/0.04/0.05—were considered to have the largest opportunity to simulate sufficient amount of urban area with fewer clusters. We calibrated the model with the three DM values (Table 2), and 0.04 was the most suitable value for DM in our study. Under DM of 0.04, the maximum value of AFD was 0.783. As discussed in Section 2.5, 0.783 of the maximum value of AFD was appropriate for DM determination. Table 2: Coarse calibration performance of the model under different DM. DM 0.03 0.04 0.05

3.3

AFD 0.002 0.001 0.001

CFD 6.9 5.162 6.283

Lee-Sallee 0.301 0.351 0.309

The Socioeconomic Factor

The socioeconomic development indicator (SE) was generated with the following equation (Eq. (8), Section 2.5): SE=8.23×10-7 ×GDPS +2.74×10-5 ×PS -0.94

1999-2016, and 1989-2016). The relationship between DM and SE was estimated with regression analysis in SPSS 22.0 (Eq. 9) with a R 2 of 0.981). Therefore, SLEUTH-3r model could predict urban expansion driven by different socioeconomic development conditions by setting the DM value. DM =0.083×SE+0.043×SE2 -0.011×SE3 +0.056

3.4

Simulation Accuracy of the Model

The SLEUTH-3r model was calibrated to find a combination of coefficients that best simulated historical urban expansion through the “brute-force” method (Silva and Clarke, 2002). The selection criterion used the minimum absolute value of CFD and AFD of < 0.05. Then the model was initialized in 1989 and ran in predict mode to 2016, with the coefficients derived from calibration. In the prediction mode, we utilized two scenarios, in which one (S1) came from the suitability map, and the other (S2) coded water with 100 and other land with 50 as comparison. To evaluate the simulation accuracy, we calculated the Kappa metric and spatial topology for the predicted maps (Table 3). The Kappa metric (consistency between predicted and real maps) in 2016 under S1 reached 0.77, while the one under S2 was 0.56, indicating that S1 could significantly improve model accuracy. Urban spatial topology can further describe the simulation accuracy (Kantakumar et al., 2016), and was classified based on proportion of built-up area (using 30% and 50% as a boundary) within the neighborhood of 3×3 cells through “block statistics” tool in ArcGIS 10.3. Prediction under S1 accurately simulated the area of the urban core, 74.82% of the real urban fringe, but 172.86% of the scattered settlement; this indicated that most of the simulation error occurred in scattered settlements. Under S2, the main error occurred in simulating the urban core (at 78.29%) and urban fringe (at 62.38%). Overall, integrating the effects of multiple drivers into the model can greatly enhance the ability to simulate urban expansion with high spatial heterogeneity. Table 3: Urban spatial pattern predicted in 2016 under different scenarios.

(8)

We obtained 30 values of DM through the method discussed in Section 3.2 for the five different areas (4 cities and the whole region) in the six periods (1989-1999, 1999-2006, 2006-2016, 1989-2006,

(9)

S1 S2 Real

Urban area (km2)

Kappa

1205.81 931.51 1182.123

0.77 0.56 —

Urban topology type Urban Urban Scatter core fringe settlement (km2) (km2) (km2) 1165.67 146.25 349.05 911.66 121.94 212.81 1164.51 195.48 201.93

163


4

DISCUSSION

Documentation and source code of the SLEUTH model have been publicly available, thus interested researchers were able to modify and improve it. Several successful efforts reduced computation time and increased model efficiency, including OSM (Charles Dietzel, 2007), pSLEUTH (Guan and Clarke, 2010), SLEUTH-3r (Jantz et al., 2010), and SLEUTH-GA (Shan et al., 2008), among others. These modifications helped to overcome some of the limitations, enhance model applicability, and provide suggestions for more accurate simulation (Chaudhuri and Clarke, 2013). Using the SLEUTH-3r model, we simulated urban expansion in CBYN during 1989-2016. We confronted three main problems. First, the determination methods for DM were not appropriate for our study as they could not generate sufficient urban growth area. Second, urban growth in China, largely driven by socioeconomic development at macro-scale, could not be effectively expressed in this model. Third, spatial heterogeneity in urban growth, such as city and villages in a large urban agglomeration, was an important source of simulation error that needed to be addressed.

4.1

Parameters Driving Urban Growth

Similar to most studies that analysed urban expansion, the factor system we built in this study was incomplete, due to lack of data and the presence of unknown urban-growth driving factors (Hietel et al., 2007). For example, urban planning has been shown to greatly affect urban expansion (Long et al., 2012), however, it has not been included in this study due to lack of data. The incomplete picture of the factors driving urbanization was one source of simulation error. In 1989-2016, physical factors impacted urban expansion more than socioeconomic conditions did at spatial scale. Elevation and morphology exhibited significantly negative effects on urban expansion in CBYN, while low elevation and flat areas were more suitable for urban growth. Previous studies suggested that the effects of elevation on urban expansion depended on the topography (Li et al., 2013). Positive effects of elevation on urban expansion have been shown in Lagos and Nigeria, where low elevation areas necessitated drainage, possibly increasing the cost of building construction (Dewan and Yamaguchi, 2009). In CBYN, areas of high elevation were more likely to be situated in the mountains, where costs of development were higher than at low elevations.

164

The significant relationships between urban expansion and social factors of proximity to urban centers (negatively correlated), and growth rate of GDP and population (positively correlated) were consistent with previous findings (Luo and Wei, 2009, Poelmans and Rompacy, 2009). Moreover, the effects of proximity exceeded those of economic development and population growth. This was mainly due to the coarser resolution of census data compared with other factors. The spatial heterogeneity of urban and suburban areas could not be expressed by GDP or population data, indicating that data at finer-scales were needed. Previous studies on megacities in China and USA have shown that positive relationships existed between socioeconomic development and urban expansion, especially in developing countries (Kuang et al., 2014), and that the socioeconomic factors would play an increasingly important role in urbanization. For example, studies in Beijing (Liu et al., 2014) suggested that the importance of urbanization drivers varied over time, and the effects of physical and neighborhood factors decreased with increasing socioeconomic factors. Compared with Beijing, CBYN developed at a slower pace in the past thirty years, as indicated by urban population rate of 67.56% in CBYN in 2015, and 86% in Beijing in 2010 (Liu et al., 2014). As a result, the impacts of socioeconomic development were less important than those of geophysical conditions, but would increase in the future.

4.2

Implications of Model Simulation

Chinese megacities are in a stage of development at which population growth, economic development, and policy significantly influence urban expansion patterns and rates. This is unlike megacities in developed countries where population and economic conditions are not important forces of urban growth (Kuang et al., 2014). The effects of socioeconomic development on urban expansion were classified in this study into two categories: spatial heterogeneity and temporal dynamics; the former was expressed in the excluded layer from the suitability map, and the latter was reflected in the changing value of DM. Spatial differences in physical conditions, cultural background, socioeconomic development, and human preferences were responsible for the high heterogeneity in urban distribution and expansion (Lin et al., 2014); this was also reflected in DM with value ranging from 0.008 to 0.38 among different cities. This heterogeneity improved the difficulty in precise urban simulation, and can be an important


source of simulation error. Linear or logistic regression-based models cannot calculate heterogeneous urban expansion due to their dependability on weights (Hu and Lo, 2007). Artificial neural network models also have limited capacity for accurate modeling of spatial heterogeneity (Almeida et al., 2008). The SLEUTH model can simulate urban growth at coarse resolution well, and has been successfully applied to cities all over the world (Akın et al., 2014, Al-shalabi et al., 2012, Bihamta et al., 2014). However, the SLEUTH model is still inadequate for simulating urban growth with high heterogeneity, or at high resolution at large-scales (Jat et al., 2017). In our study, integrating various spatial factors into the model greatly enhanced the simulation accuracy in an urban agglomeration. The influence of socioeconomic growth on urban expansion, and the fundamental function of DM in controlling the magnitude of urbanization (suggested by Eq. (9)), allowed DM to exert temporal influence in the model. The high correlation between DM and SE further supports this conclusion. Future research needs to focus on predicting urban expansion under different socioeconomic growth scenarios, and on comparing the effects of government policies on urbanization.

5

CONCLUSIONS

Urban expansion is unavoidable and has significant impacts on ecosystem services and functions. The successful application of the SLEUTH-3r model in the City Belt along the Yellow River in Ningxia at a resolution of 30 m has shown its utility in simulating urban expansion in a large area with high precision. In the past 27 years, the effects of elevation and geomorphology on urban expansion exceeded those of socioeconomic development. We quantitatively integrated these factors into the model to simulate urban expansion with high heterogeneity across a large area with high accuracy. The influence of socioeconomic development was introduced into model with DM, which can be set interactively. Both of these actions improve model accuracy in simulating urban expansion in urban agglomerations. However, the excessive amounts of scatter settlements in the simulation indicated the need for further research.

ACKNOWLEDGEMENTS This research was supported by the National Natural Science Foundation of China under grant No. 41371176 and the Fundamental Research Funds for the Central Universities under grant No. lzujbky_2017_it91.

REFERENCES Akın, A., Clarke, K. C. & Berberoglu, S. (2014). The impact of historical exclusion on the calibration of the SLEUTH urban growth model. International Int J Geographical Inf. Sci, 27: 156-168. Al-shalabi, M., Billa, L., Pradhan, B., Mansor, S. & Al-Sharif, A. A. A. (2012). Modelling urban growth evolution and land-use changes using GIS based cellular automata and SLEUTH models: the case of Sana’a metropolitan city, Yemen. Environmental Earth Sciences, 70(1): 425-437. Almeida, C. M., Gleriani, J. M., Castejon, E. F. & Soares‐ Filho, B. S. (2008). Using neural networks and cellular automata for modelling intra-urban land-use dynamics. Int J Geographical Inf. Sci, 22(9): 943-963. Berberoğlu, S., Akın, A. & Clarke, K. C. (2016). Cellular automata modeling approaches to forecast urban growth for adana, Turkey: A comparative approach. Landscape Urban Plan., 153: 11-27. Berling-Wolff, S. & Wu, J. (2004). Modeling urban landscape dynamics: a review. Ecol. Res., 19(1): 119-129. Bihamta, N., Soffianian, A., Fakheran, S. & Gholamalifard, M. (2014). Using the SLEUTH Urban Growth Model to Simulate Future Urban Expansion of the Isfahan Metropolitan Area, Iran. Journal of the Indian Society of Remote Sensing, 43(2): 407-414. Charles Dietzel, K. C. C. (2007). Toward optimal calibration of the SLEUTH land use change model. Transactions in GIS, 11(1): 29-45. Chaudhuri, G. & Clarke, K. C. (2013). The SLEUTH land use change model: A review. The International Journal of Environmental Resources Research, 1(1): 88-104. Clarke, K. C., Hoppen, S. & Gaydos, L. J. (1997). A self-modifying cellular automaton model of historical urbanization in the San Francisco Bay Area. Environ. Plann. B, 24: 247-261. Delphin, S., Escobedo, F. J., Abd-Elrahman, A. & Cropper, W. P. (2016). Urbanization as a land use change driver of forest ecosystem services. Land Use Policy, 54: 188-199. Dewan, A. M. & Yamaguchi, Y. (2009). Land use and land cover change in Greater Dhaka, Bangladesh: using remote sensing to promote sustainable urbanization. Appl. Geogr., 29(3): 390-401. Gao, J. & Li, S. (2011). Detecting spatially non-stationary and scale-dependent relationships between urban landscape fragmentation and related factors using

165


Geographically Weighted Regression. Appl. Geogr., 31(1): 292-302. Guan, Q. & Clarke, K. C. (2010). A general-purpose parallel raster processing programming library test application using a geographic cellular automata model. Int J Geographical Inf. Sci, 24(5): 695-722. Haase, D., Schwarz, N., Strohbach, M., Kroll, F. & Seppelt, R. (2012). Synergies, Trade-offs, and Losses of Ecosystem Services in Urban Regions: an Integrated Multiscale Framework Applied to the Leipzig-Halle Region, Germany. Ecology and Society, 17(3): 22. Haregeweyn, N., Fikadu, G., Tsunekawa, A., Tsubo, M. & Meshesha, D. T. (2012). The dynamics of urban expansion and its impacts on land use/land cover change and small-scale farmers living near the urban fringe: A case study of Bahir Dar, Ethiopia. Landscape Urban Plan., 106(2): 149-157. Hietel, E., Waldhardt, R. & Otte, A. (2007). Statistical modelling of land-cover changes based on key socio-economic indicators. Ecol. Econ., 62: 496-507. Hu, Z. & Lo, C. P. (2007). Modeling urban growth in Atlanta using logistic regression. Comput. Environ. Urban., 31(6): 667-688. Jantz, C. A., Goetz, S. J., Donato, D. & Claggett, P. (2010). Designing and implementing a regional urban modeling system using the SLEUTH cellular urban model. Comput. Environ. Urban., 34(1): 1-16. Jat, M. K., Choudhary, M. & Saxena, A. (2017). Urban growth assessment and prediction using RS, GIS and SLEUTH model for a heterogeneous urban fringe. The Egyptian Journal of Remote Sensing and Space Science. http://dx.doi.org/10.1016/j.ejrs.2017.02.002. Kantakumar, L. N., Kumar, S. & Schneider, K. (2016). Spatiotemporal urban expansion in Pune metropolis, India using remote sensing. Habitat Inter., 51: 11-22. Kuang, W., Chi, W., Lu, D. & Dou, Y. (2014). A comparative analysis of megacity expansions in China and the U.S.: Patterns, rates and driving forces. Landscape Urban Plan., 132: 121-135. Li, C., Zhao, J. & Xu, Y. (2017). Examining spatiotemporally varying effects of urban expansion and the underlying driving factors. Sustainable Cities and Society, 28: 307-320. Li, X., Zhou, W. & Ouyang, Z. (2013). Forty years of urban expansion in Beijing: What is the relative importance of physical, socioeconomic, and neighborhood factors? Appl. Geogr., 38: 1-10. Lin, J., Huang, B., Chen, M. & Huang, Z. (2014). Modeling urban vertical growth using cellular automata: Guangzhou as a case study. Appl. Geogr., 53: 172-186. Liu, R., Zhang, K., Zhang, Z. & Borthwick, A. G. (2014). Land-use suitability analysis for urban development in Beijing. J Environ Manage, 145: 170-179. Long, Y., Gu, Y. & Han, H. (2012). Spatiotemporal heterogeneity of urban planning implementation effectiveness: Evidence from five urban master plans of Beijing. Landscape Urban Plan., 108: 103-111. Luo, J. & Wei, Y. H. D. (2009). Modeling spatial variations of urban growth patterns in Chinese cities: The case of Nanjing. Landscape Urban Plan., 91(2): 51-64.

166

Poelmans, L. & Rompacy, A. V. (2009). Detecting and modellling spatial patterns of urban sprawl in highly fragmented areas: a case study in the Flanders-Brussels region. Landscape Urban Plan., 93(1): 10-19. Poyil, R. P. & Misra, A. K. (2015). Urban agglomeration impact analysis using remote sensing and GIS techniques in Malegaon city, India. International Journal of Sustainable Built Environment, 4(1): 136-144. Qi, Y., Henderson, M., Xu, M., Chen, J., Shi, P., He, C. & Skinner, W. (2004). Evolcing core-periphery interactions in a rapidly expanding urban landscape: the case of Beijing. Landscape Ecology, 19: 491-497. Santé, I., García, A. M., Miranda, D. & Crecente, R. (2010). Cellular automata models for the simulation of real-world urban processes: A review and analysis. Landscape Urban Plan., 96(2): 108-122. Shan, J., Alkheder, S. & Wang, J. (2008). Genetic algorithms for the calibration of cellular automata urban growth modeling. Photogrammetric Engineering and Remote Sensing, 74: 1267-1277. Silva, E. A. & Clarke, K. C. (2002). Calibration of the SLEUTH urban growth model for Lisbon and Porto, Portugal. Comput. Environ. Urban., 26: 525-552. Singh, P., Kikon, N. & Verma, P. (2017). Impact of land use change and urbanization on urban heat island in Lucknow city, Central India. A remote sensing based estimate. Sustainable Cities and Society, 32: 100-114. Su, S., Xiao, R. & Zhang, Y. (2012). Multi-scale analysis of spatially varying relationships between agricultural landscape patterns and urbanization using geographically weighted regression. Appl. Geogr, 32(2): 360-375. Thapa, R. B. & Murayama, Y. (2012). Scenario based urban growth allocation in Kathmandu Valley, Nepal. Landscape and Urban Planning, 105: 140-148. Zhang, C., Tian, H., Chen, G., Chappelka, A., Xu, X., Ren, W., Hui, D., Liu, M., Lu, C., Pan, S. & Lockaby, G. (2012). Impacts of urbanization on carbon balance in terrestrial ecosystems of the Southern United States. Environ Pollut, 164: 89-101.

Creating a Likelihood and Consequence Model to Analyse Rising Main Bursts Robert Spivey and Sivaraj Valappil Thames Water Utilities Ltd, Innovation Department, Reading, 2017, U.K. {robert.spivey, raj.valappil}@thameswater.co.uk

Keywords:

Geographical Information System (GIS), Risk Analysis, Spatial Analysis, Spatial Modelling, Data Interpretation, Data Visualisation.

Abstract:

A model was created that analysed the likelihood and consequence of a sewage rising main bursting at any given time. Likelihood of failure was analysed through factor analysis using GIS data and historical rising main bursts data. Consequence was analysed through spatial analysis on GIS using multiple spatial joins, property density and a cost of tankering model that was created using data from GIS. This analysis created a likelihood and consequence score for each section of rising main to then create a combined overall risk score. These outputs were then used to develop a rising main planning tool in the data presentation programme Tableau to identify the high risk sites and target asset maintenance and rehab works. This paper will explain how the tool was created and the benefits of the final outputs.

1

INTRODUCTION

The waste water network is made up of various types of sewer pipes. One of these pipes is known as a rising main.

Figure 1: Rising main diagram (Sharkawi farm, 1999).

As shown from figure 1, gravity sewers take sewage from houses and connect them to a pumping station. This is then pumped up a rising main to the start of another length of gravity sewer. This process is then continued until the pipe reaches a sewage treatment works. Due to the increased pressure from the pumping station there is a risk that the rising main can burst.

  

2002/03 Spreadsheet Risk Model 2007/08 Probability of Failure x Rolling Ball model 20012/13 Updated Probability of Failure x Rolling Ball model

The most recent model is different to previous models because it has split the rising mains into smaller sections and observed other burst factors such as rising mains located under rail/roads, ‘soft’ land or ‘urban’ land to attempt to identify additional factors that could affect the likelihood of bursting. It also improves the consequence aspect of the risk model whereas previous models had less robust consequence models. A sewage rising main bursting can cause a serious issue for the company. This is due to the cost of repair, cost of tankering and pollution and flooding fines. To address this issue investment is made into regular replacement of rising main pipes. To identify which areas need the greatest investment a model was created to identify the areas of rising main that pose the largest risk. Company datasets relating to the sewer network were regularly used throughout this project.

Previously, 3 rising mains models have been created by the company:

167 Spivey, R. and Valappil, S. Creating a Likelihood and Consequence Model to Analyse Rising Main Bursts. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 167-172 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved


2 2.1

CREATING THE MODEL Likelihood

To begin creating this model, historical burst data was obtained which contained a list of every recorded burst since 1994. Burst data is updated regularly every time new bursts are recorded. Bursts are recorded by an eastings and northings coordinate system in a simple Excel spreadsheet. This is then plotted into GIS using the display X/Y feature. When each new burst point is plotted it is then saved as a shape file then spatially joined to sections of rising main. Based on the distance between the new burst points and the sections of rising main we can work out which section the new bursts is referring to.

Figure 2: Diagram of burst identification issue.

As shown from figure 2, allocating a burst to a specific section of rising can prove difficult at times as burst coordinates do not always match up exactly with sections of rising main. This can lead to a burst not being added to the model as the data is not specific enough be sure which section of rising main has burst. However, this is only the case for a small percentage of the burst data.

Figure 3: Burst map (Thames Water Utilities Ltd, 2017).

Figure 3 above shows the map of bursts from 1994 up until 2017 across the Thames Valley area. Once we know what section of rising main that burst we can add this to the overall burst database which the model is based upon. This data enabled us to identify certain factors that contributed to a rising main bursting. These factors included: 1. Material 2. Age 3. Ground type 4. Diameter 5. Soil corrosivity Information regarding material, age, ground type and diameter were accessible through the company records however soil corrosivity was identified from using the soil map from Cranfield University to show which areas of land are most corrosive. Below shows the breakdowns of each factor based on the length in kilometres. Using historical burst data each category was given a burst rate of number of bursts per kilometre of pipe. Material  Plastic  Iron  Concrete  Other  Unknown Diameter    

Small (225mm and below) Medium (226-600mm) Large (above 600m) Unknown

   

1900 or earlier 1901-1959 1960 or later Unknown

Age

168

Creating a Likelihood and Consequence Model to Analyse Rising Main Bursts

Ground Type  Traffic (rail or road)  Soft land  Urban land Soil Corrosivity      

0 (Very low corrosivity) 1 (Low corrosivity) 2 (Low-medium corrosivity) 3 (Medium corrosivity) 4 (High corrosivity) 6 (Very high corrosivity) Figure 5: Tableau table of burst rate categories.

The table above shows the top categories of bursts per km.

2.2

Figure 4: Soil corrosivity map (Cranfield University, 2017).

Figure 4 shows the areas of land that are most corrosive within the Thames Valley area. For the purposes of the model water was assumed to have a soil corrosivity of 0. GIS was used to create this map by adding soil data to GIS then colour coding based on soil corrosivity score. Each category of burst was then given a burst rate score then matched up with the sections of rising main associated with each category. Some information for the rising mains is unknown due to a lack of information in some of the company records. Any category that had an unknown factor was taken out of final outputs as it is not an accurate measure. The category with the greatest bursts per km was other material, medium diameter, 1900 or earlier, soft ground type and soil corrosivity 6.

Consequence

Consequence was then added to the model through 3 factors including: 1. Distance to specific locations 2. Property density 3. Tankering cost Specific locations were identified by the consequence to the business and society of flooding. Distances considered include:   

Hospitals Schools Roads (Motorways, A-Roads and BRoads)  Water  Sites of Special Scientific Interest (SSSI’s)  Bio habitats  Underground Stations  Railways The distances to each of these points of interest were analysed through spatial joins in GIS by combining shape files of rising main locations and spatial locations of all the areas listed above. Shape files of all these points of interest were created by obtaining easting and northing positions for each location and importing this data into GIS from Excel spreadsheets using these easting and northing positions.

169


Figure 6: Rising main map (Thames Water Utilities Ltd, 2017).

Figure 6 shows a map of rising mains and their location across the Thames Valley area. To analyse distances to various points of interest other spatial data is added to the map then spatially joined from the rising main data. For example Figure 7 shows the rising main data combined with motorway data across the Thames Valley area.

Figure 7: Map of rising mains and motorways (Thames Water Utilities Ltd, 2017).

After the distance to each of these points of interest had been analysed for each section of rising main they were combined into an overall distance ratio by taking an average of all the distances. This allows the model to take into account sections of rising main that are close to more than one point of interest rather than just how close it is to an individual location. After spatial distance data had been analysed we then looked at the property density that each section of rising main falls into. To add this we combined a square kilometre grid across the whole of the Thames Valley area with property data. This allowed us to create a count of properties per each square kilometre. Rising main location data was then added to this grid count to analyse which property grid square each section of rising main was in. This allowed us to allocate a number of properties per section of rising main.

170

Figure 8: Property density heat map.

Figure 8 shows a heat map of property density across the Thames Valley area. The colour scale ranges from green to red with red being highest number of properties. As expected the highest number of properties are located in and around the London area. After property density had been considered we added tankering cost to the consequence model. Tankering is the process of providing tankers to the location of the burst in order for the waste water to fill into the tankers rather than flood across the burst area. In order to add this, a separate model was created to analyse tankering cost. This model was created by combining 5 separate factors to create a tankering cost per section of rising main. These factors include: 1. Distance from pumping stations to tanker depots 2. Distance from pumping stations to sewage treatment works 3. Flow data in the rising mains 4. Diameter of rising main 5. Length of rising main Flow data, diameter and length were accessible through company records however the distances were created by spatially joining locations of pumping stations to tanker depots and sewage treatment works in GIS. At this point in the model intervention data was added to the outputs. Intervention data is data regarding what lengths of rising main have been recently replaced. This is then removed from the outputs as it is assumed that if the pipe has recently been replaced then it reduces the risk of bursting again. The 3 consequence factors were then combined to create an overall consequence of failure score for each section of rising main. Likelihood and

Creating a Likelihood and Consequence Model to Analyse Rising Main Bursts

consequence scores were rated between 0 and 10 with 10 being the highest and 0 the lowest. By taking the average of these scores an overall risk score was created in order to rank each section of rising main on its risk priority.

Figure 9: Likelihood consequence plot.

Figure 9 shows the overall plot of the likelihood and consequence scores for each section of rising main. To identify the top sites that need attention the top 10 sections of rising main with the highest risk score were observed. When observing the highest risk sections we observed sites that have a consequence score over 5.5 and a likelihood score of over 5. The sum of the length of rising main that were incorporated in this category came to 7km. Hence, if 7km of rising main were replaced it would remove all of the high risk sections from the model. 7km may seem like a large amount of pipe however the overall length of rising main that was incorporated in the model is 2109km. Therefore, only 0.33% falls in the high risk area of this risk plot.

3

After this model had been created it was then adapted into a user friendly planning tool. This was created within the data visualisation programme Tableau. Tableau was chosen for this planning tool as it allows spatial and other data files to be combined into one, user-friendly, interactive dashboard. The planning tool contains data relating to each section of rising main. For example, the region that the rising main falls within and the contact details of the operational staff member responsible for the rising main section. The region was identified through spatially joining the rising main file to the operational regional boundary file in GIS. This is very useful as if there is an issue with a certain section of rising main the member of staff responsible can be quickly contacted in order to resolve this issue. The planning tool will be used by many members of staff across the business. Therefore, the planning tool will need to be user friendly in order for staff members from a non-analytical background to use it effectively. This is achieved through easy access information dashboards that can be filtered through drop down menus relevant to the maps or graphs. Updating the model is also extremely user friendly. New bursts data is added to the original burst spreadsheet and Tableau will update all of the models and dashboards based on this new data. This allows for the model to stay updated therefore reducing the need for a new model to be created when the current data set is outdated. The planning tool will be distributed across the business in the form of a packaged workbook file in Tableau reader. This allows access for all staff across the business without them being able to edit the original file. Due to this only one Tableau server license is needed to share this tool with the rest of the business.

4

Figure 10: Map of high risk sites.

Figure 10 shows the locations of the sites that based on the model created have the greatest risk score.

PLANNING TOOL

CONCLUSIONS

To conclude, this paper has shown how GIS spatial analysis and modelling is used by the water industry to analyse the impact of a rising main bursting. This model will provide a direction for rising main replacement investment. It allows the business to efficiently replace the minimum amount of rising main pipe based on how detrimental a burst would be in that section, therefore maximising the operational cost saving.

171


Without the use of the tools within GIS this model would have been a lot more difficult to create. Simple tools on GIS such as spatial joining were influential in the making of this model. The outputs from this project include a list of all rising main sections with its associated risk score and a user friendly planning tool to be used across the business. This model has areas for improvement using further applications in GIS and other programmes. The model could be improved by adding in lidar data to the consequence modelling in order to analyse the heights of all the sites listed within the distance factor of consequence. This will give a better insight into the flow of the flooding out of a burst rising main. For example, if a school is downhill from a rising main burst it is more likely to flood towards the school compared to if the school was higher than the burst. This model will be further improved by adding in a more detailed likelihood model based on further analysis that looks to identify which likelihood factors are greater linked to a burst. This is likely to be modelled within the statistical programme R using logistical regression.

REFERENCES Sharkawi farm, (1999), Rising main diagram (ONLINE). Available at: http://www.sharkawifarm.com/lift/liftstation-pump-wiring-diagram (Accessed 27th September 2017). Cranfield University (2017) Soil corrosivity dataset, (Accessed 5th September 2017). Thames Water Utilities Ltd (2017) Sewer network datasets (Accessed 10th August 2017).

172

Identifying the Impact of Human Made Transformations of Historical Watercourses and Flood Risk Thomas Moran1, Sivaraj Valappil1 and David Harding2 Innovation – Thames Water Utilities Limited, Island Road, Reading, U.K. Planning & Optimisation – Thames Water Utilities Limited, Island Road, Reading, U.K. {thomas.moran, sivaraj.valappil, david.harding}@thameswater.co.uk

1Waste 2Waste

Keywords:

Geographical Information System, Lidar, Culvert, Watercourse, Digital Terrain Model, Sewer, Flood.

Abstract:

In the past, many urban rivers were piped and buried either to simplify development, hide pollution or in an attempt to reduce flood risk and these factors define a culverted watercourse. A large amount of these watercourses are not mapped, and if they are, then their original nature is not clearly identifiable due to being recorded as part of the sewer network. Where these culverted watercourses are not mapped due to being lost to time and development, we expressed these to be so-called ‘lost rivers’. There is a lack of awareness of the flood risk in catchments housing these rivers, and because many of them are incorrectly mapped as sewers, there is often confusion over their legal status and responsibility for their maintenance. To identify the culverted watercourses many datasets were used including LiDAR data (Ground Elevation Data), historical maps (earliest 1840's), asset data (Sewer network), and the river network. Automatic and manual identification of potential culverted watercourses were carried out and then the mapped assets are analysed with flooding data to understand the impacts. A GIS map has been created showing all potential lost rivers and sites of culverted watercourses in the North London area.

1

INTRODUCTION

London has a large legacy population of culverted and concealed watercourses, dating from the 19th and 20th centuries. Since these structures were built, changes to the governance of drainage have resulted in many assets being transferred between authorities and in the process, comprehensive records have not always survived. There is often uncertainty surrounding the legal status and responsibility for the maintenance of culverts. For example, many culverted watercourses in London were included on the map of public sewers, where their original status has become obscured over time. This can be a significant obstacle to the proper stewardship of the structures. In addition, the culverting of watercourses causes problems such as increasing upstream flood risk due to blockages, reduced ecological value within concrete channels and with reduced daylight and adverse effects on environmental features and wildlife. The issues are summarised as: Inadequate maintenance and investment – the responsibility for drainage assets varies according to their legal status. For example responsibility for a watercourse normally rests with the owners of land

through which it flows (riparian owners). Where a watercourse is incorrectly mapped or not mapped at all, owners may be unaware of their responsibilities. Different agencies often assume that others are responsible for such assets, and as a result appropriate maintenance regimes are not in place. Many of these assets are critical structures with a high impact of failure. They should be subject to regular inspection and have adequate investment plans for their maintenance and eventual replacement. More immediately, culverts often have grilles at their inlets and outlets which can become easily blocked, or debris causes a blockage within the culvert, potentially leading to flooding. Poor understanding of flood risk – culverted sections of watercourse may drain large, upstream catchments that extend far beyond the urban area. Such a situation may not be clear from drainage records and if it is not appreciated, can lead to understatement of the flood risk as well as concealing potential upstream solutions to flooding. A recent study on drainage capacity relating to the surface water drainage system around the Mill Hill Circus junction in London by Transport for London (TfL) is a good example of this. During periods of medium to heavy rainfall, the Mill Hill Circus roundabout floods

173 Moran, T., Valappil, S. and Harding, D. Identifying the Impact of Human Made Transformations of Historical Watercourses and Flood Risk. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 173-179 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved


in three different areas. These flooded areas spread across lane 1, 2 and a footway (Transport for London, 2014). Differing legislature and flood risk management – different asset types are governed by various pieces of legislation, which give design criteria for flood risk management and define stewardship responsibilities for agencies. Floods in urbanised areas have a greater impact due to the exclusion of rivers in those areas. The watercourses have been substituted by sewers which are not designed to convey intense rainfall as effectively as flood defences would be. Also, whilst rivers have a degree of protection against development with regard to flood risk, developers and property owners have right to connect to sewers, regardless of flood risk. Funding – funding for different types of drainage comes from multiple sources e.g. Sewerage investment is funded from customer bills, while land drainage comes from a combination of local taxes and levies and central government grants. If an asset is assigned to the wrong owner, they may not be able to access funds to maintain it. An example of a watercourse that encapsulates all of the issues is the Caterham Bourne, a chalk-fed river that flows from the North Downs, into South London. Much of its length is culverted and different culverts are variously mapped as ordinary watercourses, main rivers or as sewers (Surrey County Council, 2015). During a recent severe flood event, there was considerable dispute over responsibility for the different culverts, leading to delays in clearing blockages and the prolonging of the significant traffic disruption, caused by the flooding. In the subsequent investigation into the flooding, the different legal statuses of culverts meant the different agencies applied different thresholds of risk, since their origins are not clear. This is hampering the development of a coherent flood alleviation strategy and is an obstacle to identifying funding for investment. Deculverting (or daylighting) these watercourses can instigate advantages including ecological benefits, reduced flood risks, recreation for local communities and a stimulus for regeneration. The evidence for these impacts are sparse, however these are the opportunities that present themselves when considering daylighting the watercourses.

2

METHODOLOGY

To identify the culverted watercourses many datasets will need to be used including LiDAR data, historical maps, asset data, and the river network. The potential

174

culverts can then be drawn on GIS and plans can be put in place to ensure they receive the correct maintenance. This section outlines the datasets used and the mapping of the lost rivers through GIS.

2.1

GIS Asset Data

This dataset included:     

Gravity sewers Invert levels Manholes Operational sites Sewer end items

During analysis, gravity sewer coverage was the most useful with surface water sewers being identified as the most likely candidates for being culverts. Of these sewers, pipes with a large (>500mm) diameter were seen to have a higher probability. The following sewer types were included in the analysis:   

Surface (S) Storm Overflow (SO) Other (O)

Sewer end types were also thought to be useful. The dataset was filtered to include only those that had notes in the watercourse attribute or which had “inlet”, “outfall” or “culvert” in the comments.

2.2

Historical Mapping

Datasets from around 1840 and 1935 were provided by the National Library of Scotland and were available at various scales; enabling identification of field boundaries. Rivers and watercourses were digitised from this mapping where they were present on the mapping but not on the EA main river network layer. Some smaller watercourses were also identified as the Ordinance Survey (OS) labelled them with flow direction.

2.3

EA Main River Network

This data was in the form of a shapefile showing the centroids of the main river channels as defined by the EA. The dataset shows both currently exposed watercourses and a number of culverted rivers. However, there did not seem like there was particular logic as to which of these covered watercourses were mapped, and the provenance of the data is unavailable.

Identifying the Impact of Human Made Transformations of Historical Watercourses and Flood Risk

Figure 1: 1:10,500 mapping showing flow direction arrow.

2.4

Figure 3: Detailed drainage from 2m DTM overlain on 1900 1:10,500 mapping.

EA Lidar Data

This dataset recently became open data, but the quality and resolution of the available data was variable. However, the 2m digital terrain model (DTM) data was selected to be utilised as it was adequate for picking out drainage channels and it also provided the most complete coverage. The data is supplied in ESRI ASCII format (.asc files) in 10km by 10km tiles. These were converted in Quantum GIS (QGIS) to ERDAS IMAGINE format (.img files) and mosaicked to form a seamless dataset over the study area. This data was run through an automated drainage extraction routine in QGIS since the data is inherently noisy and a number of manmade features interfere with the natural drainage patterns (roads, railways etc.)

Figure 2: Cuttings and embankments in the EA DTM.

However, this is true of all DTM products. The full resolution 2m DTM was found to give a dense network of drainage, far too detailed for the purpose.

3

GIS ANALYSIS

It was decided to reduce the scale of the DTM to 10m resolution and use thinning and cleaning techniques to produce the final drainage output from the EA LiDAR data. The 2m DTM was resampled to 10m, then the dataset was run through the r.watershed routine in QGIS. The parameters applied to the dataset include:    

Minimum size of exterior watershed basin: 100 Maximum length of surface flow: 0 Convergence factor for MFD: 5 Beautify flat areas – selected

The process produced a raster output with pixels of varying value tracing the drainage pathways. This was then run through a thinning routing (r.thin in QGIS) that removed excess pixels from the drainage, outputting a single pixel path for the drainage. The raster dataset was then converted to shapefile by the r.to.vect routine in QGIS. This still resulted in a fairly complex drainage pattern, so in order to simplify it a little more a cleaning routine was run to remove dangling vectors under 100m in length. It was then run through the v.clean routine in QGIS using rmdangle as the cleaning tool and 100 as the threshold. A considerably simplified drainage output was the result of this process. It was clear that the watercourses digitised from the 1900 historical mapping were the primary indicator of potential lost rivers. A number of large diameter surface water sewers were observed closely following the course of the original watercourses and these became high confidence targets.

175


3.1

Buffer Zones

As the sewers did not exactly follow the original watercourses, it was necessary to add a buffer zone around the line of the watercourses. A buffer of 50 meters was used for the historical rivers and EA river network in order to include those sewers that run parallel to the original watercourse. This figure was derived from trial and error so as to include known targets. As the EA LiDAR drainage was less well defined, two buffer zones of 30m and 100m were used; the first to capture high probability targets and the second to capture lower probability targets.

3.2

Sewer end Items

Sewer end items that include a watercourse name or “inlet, “outfall” or “culvert in the comments were felt to be indicative of natural drainage. These were filtered from the original dataset, buffered to 10m to ensure intersection with the sewer network and then used to select output vectors from the 100m buffered EA LiDAR drainage dataset.

3.3

Figure 5: OS 1:10,500 Historical Mapping.

Now, the digitised drainage (blue line), the EA river buffer (green shading) and the drainage extracted from the EA LiDAR (green line) can be overlaid.

Lost Rivers Model

The model was constructed in ERDAS IMAGINE Spatial Modeller (Sterling Geo, 2016).

3.3.1 Examples The following demonstrates some of the features found in this investigation. Firstly, here is the OpenStreetMap (OSM) data over an area in West London, where there is no trace of surface watercourses:

Figure 6: Overlay showing extracted drainage.

It is clear that the EA LiDAR drainage follows the river quite well, but the railway interferes with the drainage path. The EA main river network is mostly good, but it cuts a corner on the 1900 river path. This matches up to the filtered sewer network in the following way:

Figure 4: OSM of an area with no surface watercourses.

And this is what the same area looked like around 1900:

176

Figure 7: Overlay of drainage buffers and sewer network.


The large (>300 diameter) sewers (thick red line) in this instance provide a close match to the digitised drainage network, with the smaller diameter sewers not relevant.

Figure 8: Target high probability sewers over OSM.

P3

300mm

P1

>300mm

Within 50m of the EA Main River Network. Within 100m of the drainage network extracted from EA LiDAR DTM that also intersect with the filtered sewer end outfall/inlet/culvert/watercourse points. Within 100m of the drainage network extracted from EA LiDAR DTM.

These criteria proved to be too broad and identified over 1023km of pipes as culverted watercourses, out of the 5245km of pipe in the trial area of North London. Therefore, we decided to use the digitised rivers from historical mapping along with the EA main river network to perform further analysis. Also included was the whole gravity sewer network in the trial area, filtered so only surface (S) and surface overflow (SO) with diameter over 300mm were considered. These pipes were then classified as “highly likely”, “possible” and “not likely” to be a culverted sewer in the following way:

Figure 9: Sewer end targets in yellow.

In other areas, a high concentration of sewer end targets (figure 9 in yellow) may also be an indicator of former watercourses.

3.4

Output

Figure 10: Example of highly likely culverted watercourses.

The following seven shapefiles were produced: Table 1: Shapefiles in order of probability of being a culverted sewer. Group

Diameter of sewer

P7

>300mm

P6

>300mm

P5

300mm

Data used Within 50m of the digitised rivers from historical mapping. Within 50m of the EA Main River Network. Within 50m of the digitised rivers from historical mapping. Within 30m of the drainage network extracted from EA LiDAR DTM.

A sewer (green) connecting two watercourses (blue) or following closely to the digitised lost river (pink) were classed as “highly likely” if the diameter is greater than 600mm, or “possible” if between 300mm and 600mm. In addition, when an Ordinance Survey (OS) watercourse ends but the EA river network continues, the sewers connected to this have been classed as possible. See figure 11 for a “possible” watercourse shown in orange:

177


Figure 12: Rate of flooding per kilometre of pipe. Figure 11: Example of a possible culverted watercourse.

All other surface and storm overflow sewers greater than 300mm are classed as “not likely”. Since some of these watercourses were not all connected but had the same GISID, it was necessary to categorise the lost rivers by a letter (describing its likelihood due to positioning) and a number (detailing the rivers connected ID). The details of the categories are as follows:

Figure 13 shows a site where “highly likely” culverted watercourses (green) have the same flooding patterns (beige) as water features (blue).

Table 2: Category description. Category A B C D E F (none)

4

Probability of being Sewer description a culverted sewer (After Manual Checking) Between two watercourses and Possible likely. Between two watercourses and Highly Likely highly likely. Follows the path of a lost river Possible and likely. Follows the path of a lost river Highly Likely and highly likely. End of EA river network and Possible likely. End of EA river network and Highly Likely highly likely. Not Likely None of the above.

Figure 13: Similarities in flooding patterns between culverted watercourses and water features.

Figure 14 shows an area with hydraulic sewer flooding (brown points circled in red) due to culverted watercourses (green):

FLOOD RISK

Analysis was carried out to identify areas of culverted sewer flooding using 70 years of surface water flooding data and 16 years of hydraulic flooding data.

4.1

Results

The rate of “highly likely” watercourses flooding is much greater than the rate of “possible” and “not likely” watercourses flooding.

178

Figure 14: Example of hydraulic flooding due to watercourses.


5

CONCLUSIONS

Through spatial modelling and analysis we have produced a lost river map in North London and identified 83 “highly likely” culverted watercourse sites, 12 of these were found to have had hydraulic sewer flooding in the last 8 years. In addition, 47 “possible” culverted sewer sites were found, 5 of which had hydraulic sewer flooding in the last 8 years. There are some obvious examples of where pipes have been culverted and have the same flooding patterns as rivers. There is also evidence to suggest that culverted watercourses are flooding at a higher rate than non-culverted watercourses. Further work has been planned to complete the lost river mapping and identification of culverted sewers across the London area to aid future investigations into the flooding risk of other culverts. More field trials are required to evaluate the asset characteristics and structural conditions of these assets. At those sites, engagement with all relevant agencies will occur to explore the issues and options surrounding the ownership of the assets and responsibility for their stewardship. Observations from this exercise will be incorporated into a draft template for establishing stewardship regimes at similar, high-risk sites.

REFERENCES Transport for London (2014) – Report on “A1 Mill Hill Circus – Capacity Study”. Report Submitted by AECOM to Transport for London. Surrey County Council (2015) – Report on “S19 Flood Investigation - The Caterham Bourne, London”. Sterling Geo (2016) – Report on “Mapping Lost Rivers”, Submitted to Thames Water Utilities Limited. CIWEM (2007) – Policy Position Statement on Deculverting of Watercourses. Chartered Institution of Water & Environmental Management, London. Wild, T., Bernet, J., Westling, E., Lerner, D., Water and Environment Journal 25 (2011) – p412-421.

179

Evaluation of AW3D30 Elevation Accuracy in China Fan Mo1, Junfeng Xie1 and Yuxuan Liu2 1

2School

Satellite Surveying and Mapping Application Center, NASG, Beijing, China of Remote Sensing and Information Engineering, University of Wuhan, Wuhan, China {mof, xiejf}@sasmac.cn, [email protected]

Keywords:

Digital Surface Model, AW3D30, ALOS DSM, Elevation Accuracy Evaluation, National Control Point Image Database.

Abstract:

The AW3D30 dataset is a publically available, high-accuracy digital surface model; the model’s cited nominal elevation accuracy is 5 m (1σ). In order to verify the accuracy of AW3D30, we selected China as test area, and used field measurement points in the national control point image database as control data. The elevation accuracy of the field measurement points in the national control point image database is better than 1 m. The results show that the accuracy of the AW3D30 satisfies the requirement of 5 m nominal accuracy, and elevation accuracy reached 2 m (1σ). Accuracy is related to both terrain and slope. Accuracy is better in flat areas than in areas of complex terrain, and the eastern region of China is characterized by better accuracy than the western region.

1

INTRODUCTION

The Advanced Land Observing Satellite (ALOS) is a high-resolution three-line array stereo remote sensing surveying satellite launched by Japan Aerospace Exploration Agency (JAXA) on January 24, 2006. Its primary mission is to complete global key area 1:25000 scale terrain mapping (Rosenqvist et al., 2007; Shimada et al., 2010). The Panchromatic Remote-sensing Instrument for Stereo Mapping (PRISM) carried by ALOS has three 2.5meter resolution panchromatic cameras that are used for front view, nadir view and rear view of earth observation along the track direction, respectively. Precise three-dimensional surface information can be obtained through front intersection processing (Takaku and Tadono, 2009; Rosenqvist et al., 2014). On May 31, 2016, JAXA released the ALOS Global Digital Surface Model “ALOS World 3D-30 m” (AW3D30). The model produced a global 30meter grid using ALOS stereoscopic images with elevation nominal accuracy of 5 meters. The AW3D30 currently possesses the highest accuracy among global public digital elevation models. The accuracy has been verified in many countries around the world (Zhihua et al., 2017; Takaku et al., 2014; Tadono et al., 2014). In order to verify the accuracy of AW3D30 elevation data in China, we adopted the national

control point image database as an evaluation benchmark. This image database is a core component of the Environment and Disaster Monitoring Engineering Based on Moonlet Constellation. The database contains a total of about 350,000 generalized control points, including about 73,000 field measurement control points. The initial aim of this database is to meet the need of matching with satellite optical images. The overall accuracy of the control point image database can meet plane and elevation requirements of national 1:50,000 scale mapping (Yu, 2012). The elevation accuracy of the field measurement control points acquired by GPSRTK (Global Position System Real Time Kinematic) included in the database is better than 1 m. Therefore, by using the control point image database as a control benchmark, we can precisely and objectively evaluate the elevation accuracy of AW3D30 data in China.

2 2.1

DATA DESCRIPTION AW3D30

ALOS had been running on track for 5 years, acquiring a large number of image data with global coverage. AW3D30 data were generated from approximately 3 million ALOS PRISM 2.5 m

180 Mo, F., Xie, J. and Liu, Y. Evaluation of AW3D30 Elevation Accuracy in China. In Proceedings of the 4th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2018), pages 180-186 ISBN: 978-989-758-294-3 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Evaluation of AW3D30 Elevation Accuracy in China

resolution three-line array stereo imaging that, in general, covered global land areas. Due to the limitations of panchromatic camera imaging, there are few images for global waters. Thus, the AW3D30 data do not include information on ocean elevations. The original digital surface model (DSM) data from AW3D30 are 5-meter grid digital images. However, because of the amount of data generated (and other reasons), JAXA only publicly released the 30 meter grid AW3D30 data. The released AW3D30 data contain two versions related to differences in the 5 to 30 meter down-sample processing: AVE and MED. AVE uses a mean filter to down-sample the raw data, whereas MED uses a median filter. According to JAXA's release plan, AW3D30 data are divided into three versions. Currently released AW3D30 data belong to the version 1.1; the main date parameters associated with this version are shown in Table 1 (Takaku et al., 2016; Tadono et al., 2016). Table 1: Listing of the primary parameters of the AW3D30. Parameter l

Value

Image file

16-bit integer, gray value represents elevation, the unit is meter

Each view coverage area

1° × 1°

Resolution

1" × 1"

Vertical accuracy

5 m (RMSE)

Coordinate system

Latitude and longitude (ITRF97 [GRS80])

Elevation type

Normal

Since the elevation type of AW3D30 data are normal height, it is necessary to introduce an geoid height when performing an elevation accuracy evaluation. In order to control the data on the same elevation reference, we used the EGM2008 model to calculate the geoid height of corresponding points (Pavlis et al., 2013; Hirt et al., 2011).

2.2

Control Point Image Database

China covers a vast area characterized by large climatic difference between the North and the South. Due to its size, disasters are difficult to monitor in real time, resulting in a greater threat to public’s safety and economic security. For this reason, the

Ministry of Civil Affairs National Disaster Reduction Center started the national control point image database construction project in 2010. The project took the China Institute of Surveying and Mapping two years to complete. The control point image database covers a total of 31 provinces (Taiwan, Hong Kong and Macau are not covered). The control point image database contains about 350,000 generalized control points, most of which are pass points, as well as some field measurement points and measurement points obtained from largescale aeronautical digital orthography model (DOM). The accuracy of the pass points and the large-scale aeronautical DOM image internal collection points is lower than that of field acquisition measurement points. Therefore, only the field acquisition measurement points in the control point image database were selected as experimental control data. During the process of evaluating elevation accuracy of the DSM, there is no need to measure an image point; hence, there is no measurement error in this process. The accuracy of the selected elevation control data is better than 1 m (Yu, 2012).

3 3.1

RESEARCH METHOD Nationwide Comprehensive Evaluation

At present, AW3D30 data cover all global lands, including the entire territory of China. In order to macroscopically verify the overall accuracy of the AW3D30 elevation data in China, we selected all the field measurement points in the national control point image database to evaluate the elevation accuracy of AW3D30 data in the coverage area. Then, we individually calculated each province as an independent sample to examine trends in the accuracy of the nationwide AW3D30 data based on a provincial division.

3.2

Typical Terrain Evaluation

China is vast, extending from a longitude of E73°33' to E135°05' and latitude of N3°51' to N53°33'. In general, China's terrain is elevated in the west and low in the east, and exhibits a ladder-like distribution. The mainland of China is topographically complex, and can be subdivided into five basic types of terrain: plateaus, mountains, plains, hills and basins. The basic terrain types in mainland China are shown in Table 2. 181


Given the imaging mechanism of ALOS PRISM sensors, different terrain may exhibit different mapping accuracy. Therefore, in order to validate the elevation accuracy of AW3D30 data under different terrain conditions, we selected typical areas within the five terrain types for accuracy analysis, and quantitatively evaluated the elevation accuracy of AW3D30 data within each type of terrain. Table 2: Basic types of terrain (landscapes) in China. Terrain

Elevation variations

Typical areas

Qinghai-Tibet Plateau, Inner Mongolia Plateau, Plateau Loess Plateau, YunnanGuizhou Plateau and Pamirs Great Himalayas, Elevation >500 Hengduan Mountains, Mountain meters, with large Nanling, Qinling and topographic variations Taihang Mountains Northeast plain, North Elevation of

gistam 2018

gistam 2018

Suggest Documents

gistam 2018

january 2018 february 2018 march 2018 april 2018 may 2018 june 2018

July 2018 August 2018 September 2018 October 2018 November ...

JANUARY 2018 FEBRUARY 2018 MARCH 2018 APRIL 2018

NethMap 2018 Maran 2018

2018

2018

2018

2018

2018

2018

2018

2018

2018

2018

2018

2018

2018

SPMS 2018 ICPE-S 2018

F:\2018\IJF\IJF MAY 2018\IJF May 2018.pmd

2018 Homeless Count January 23, 2018 & January 24, 2018

2018 Canadian Surf Lifesaving Championship - 2018-08-24 to 2018 ...

June 2018 September 2018 March 2019 December 2018 No ... - KPMG

Social & Behavioural Sciences EEIA-2018 2018