scaling spatial big data in a location-based social ...

13 downloads 14449 Views 749KB Size Report
KEYWORDS: Spatial Big Data; NoSQL; Social Networks; LBSN; Smart Cities. .... HBase is a NoSQL database based on Hadoop that supports the structured and ...
 

Journals Homepage:  www.sustenere.co/journals

    SCALING SPATIAL BIG DATA IN A LOCATION-BASED SOCIAL NETWORK ABSTRACT The widespread of the World Wide Web has resulted in a high volume of volunteered generated information using different formats including text, photography and video. The technological advances of recent years enabled the emergence and the popularization of various mobile devices equipped with GPS and connectivity to the Internet. This scenario contributed to the advent of several location-based applications and aroused the interest of many users in the geographical context of the information. An example of such applications are the Location-Based Social Networks (LBSN), in which the users interact with information classified by geographic region, as in the context of Smart Cities, in which citizens can interact pinning their criticisms, opinions and comments on various topics related to their city or neighborhood. The LBSNs have increasingly attracted the interest of the population and have consequently registered an increase in both the number of users interacting and the volume of shared information. This popularity brings up concerns about scalability, since it is essential to provide an environment that maintains the users active and motivated for contributing. Thus, the LBSNs must ensure acceptable response times, especially in spatial queries performed by their users, otherwise such applications may collapse due to the abandonment of their faithful users. Among several proposals of LBSNs in the community, it is still difficult to find out approaches concerned in scalability. In this context, this paper proposes an approach based on Big Data technologies to provide scalability in LBSNs and thus handle large volumes of spatial data. Our approach exploits NoSQL databases, the Map/Reduce technique and the development of extensions for indexing and querying Spatial Big Data. KEYWORDS: Spatial Big Data; NoSQL; Social Networks; LBSN; Smart Cities.

ESCALANDO BIG DATA ESPACIAL EM UMA REDE SOCIAL BASEADA EM LOCALIZAÇÃO RESUMO A disseminação da Internet tem resultado na geração de um grande volume de informações nos mais diversos formatos, como textos, fotografias e vídeos. Os avanços tecnológicos dos últimos anos permitiram o surgimento e a popularização de vários dispositivos móveis equipados com GPS e conectividade à Internet. Esse cenário contribuiu para o advento das mais variadas aplicações baseadas em localização, despertando o interesse de vários usuários no contexto geográfico das informações. Um exemplo dessas aplicações são as Redes Sociais Baseadas em Localização (LBSN), onde os usuários interagem com informações classificadas por região geográfica, como no contexto das Cidades Inteligentes, onde os cidadãos interagem depositando suas críticas, opiniões e comentários sobre variados temas relacionados à sua cidade ou bairro. As LBSNs tem despertado cada vez mais o interesse da população e, consequentemente, registrando aumento tanto na quantidade de usuários interagindo como no volume de informações compartilhadas. Essa popularidade traz a tona preocupações com a escalabilidade, uma vez que é primordial prover um ambiente que mantenha os usuários ativos e motivados em contribuir. Desse modo, as LBSNs precisam garantir tempos de resposta aceitáveis, sobretudo nas consultas espaciais realizadas pelos usuários, do contrário, tais aplicações podem entrar em colapso em função do abandono de seus fiéis usuários. Dentre várias LBSNs propostas na comunidade, ainda é difícil encontrar abordagens preocupadas em escalabilidade. Neste contexto, este artigo propõe uma abordagem baseada em tecnologias para Big Data para escalar LBSNs e, dessa forma, lidar com grandes volumes de dados espaciais. Nossa abordagem explora banco de dados NoSQL, a técnica Map/Reduce e o desenvolvimento de extensões para indexação e consulta de Big Data espacial. PALAVRAS-CHAVES: Big Data Espacial; NoSQL; Redes Sociais; LBSN; Cidades Inteligentes.

Revista Brasileira de Administração Científica (ISSN 2179‐684X)   © 2014 Sustenere Publishing Corporation. All rights reserved.  Rua Dr. José Rollemberg Leite, 120, CEP 49050‐050, Aquidabã, Sergipe, Brasil  WEB: www.sustenere.co/journals – Contact: [email protected]  

 

Revista Brasileira de  Administração Científica,  Aquidabã, v.5, n.2, Out 2014.    ISSN 2179‐684X    SECTION: Articles  TOPIC: Sistemas e Tecnologia da  Informação 

  Anais do Simpósio Brasileiro de  Tecnologia da Informação (SBTI 2014) 

 

 

DOI: 10.6008/SPC2179‐684X.2014.002.0011 

    Maxwell Guimarães de Oliveira  Universidade Federal de Campina Grande, Brasil  http://lattes.cnpq.br/9070169649750195   [email protected]    

Ana Gabrielle Ramos Falcão  Universidade Federal de Campina Grande, Brasil  http://lattes.cnpq.br/4654742780714984   [email protected]    

Cláudio de Souza Baptista  Universidade Federal de Campina Grande, Brasil  http://lattes.cnpq.br/0104124422364023   [email protected]    

Hugo Feitosa de Figueirêdo  Instituto Federal de Educação, Ciência  e Tecnologia da Paraíba, Brasil  http://lattes.cnpq.br/9466135849011391    [email protected]   

Daniel Farias Batista Leite  Universidade Federal de Campina Grande, Brasil  http://lattes.cnpq.br/9968731111485780   [email protected]  

    Received: 31/08/2014  Approved: 15/10/2014  Reviewed anonymously in the process of blind peer. 

    Referencing this:    OLIVEIRA, M. G.; FALCÃO, A. G. R.; BAPTISTA, C. S.;  FIGUEIRÊDO, H. F.; LEITE, D. F. B.. Scaling spatial big  data in a location‐based social network. Revista  Brasileira de Administração Científica, Aquidabã, v.5,  n.2, p.141‐155, 2014. DOI:  http://dx.doi.org/10.6008/SPC2179‐ 684X.2014.002.0011  

OLIVEIRA, M. G.; FALCÃO, A. G. R.; BAPTISTA, C. S.; FIGUEIRÊDO, H. F.; LEITE, D. F. B. 

INTRODUCTION The widespread of the World Wide Web, each day acquiring thousands of new users, has generated a high volume of different types of information. The technological evolution and the easier access to GPS-enabled mobile devices, e.g. smartphones and tablets, have enabled the emergence of Location-Based Social Networks - LBSNs. The LBSNs provide context aware services which help associate users with content (VICENTE et al., 2011). In such kind of social network, much information is voluntarily generated in different formats including text, photography and video, thus raising a crowdsourcing environment. Crowdsourcing explores the perceptual and cognitive abilities of a group of individuals and makes them true human sensors (participatory sensing) (ERICKSON, 2010). All this volunteered information is available to be explored in several manners and scenarios. One of these scenarios that has received great attention and several contributions is the Smart Cities approach (HARRISON & DONNELLY, 2011; HELAL, 2011). Several solutions have been applied in the context of Smart Cities around the world. The main goal of these solutions is to improve the infrastructure and services delivered by local government in large cities. In other words, the main goal is to improve the population's quality of life. Although digital sensors have often been used for information retrieval, novel human sensor-based approaches have been proposed and presented promising results (DEMIRBAS et al., 2010; FURTADO et al., 2010; BRABHAM, 2009). These novel approaches empower the population, allowing them to identify issues and search for solutions aiming for the improvement of collective welfare. In Erickson’s (2010) point of view, humans, unlike computer systems, can contribute with qualitative and deep knowledge besides having easier capabilities for analyzing and identifying incomplete and incoherent data. In this sense, Falcão et al. (2012) developed the Crowd4City infrastructure, a Smart City domain-applied LBSN. Such LBSN supports participatory human sensors aiming to provide an environment for identification and discussion of issues related to cities' governance and population's shared interests. The solution provides both community and government active participation, where the former can act sharing the discovered issues, taking spatial and temporal dimensions into account; and the latter, on the other hand, may better plan effective actions for the solution of the citizen's demands. The Crowd4City LBSN was conceived adopting RDBMS (Relational Database Management Systems) technologies enabled for spatial data handling. Once the volume of users increases, so does the amount of information reported by these users. In this concern scalability problems arise such as high processing costs, since traditional RDBMS were not originally designed to withstand such huge amount of information. The phenomenon which involves huge amounts of information is known as Big Data (NATURE, 2008) or, more precisely in our case, Spatial Big Data (CATTELL, 2010). The latter differs from the former due to the insertion of the spatial dimension. The term Big Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 142 

Scaling spatial big data in a location‐based social network 

Data has emerged recently and is associated with the large volume of data produced daily by several computational devices (OLIVEIRA et al., 2013). Scalability of systems such as Crowd4City is a primary factor for keeping users interested in its usage. Thereby, if the system does not present acceptable response times, it will suffer a severe decrease in user participation and consequently collapse. Facing these issues, novel database systems supporting Big Data and Spatial Big Data requirements needed to be implemented. NoSQL (Not Only SQL), for instance, is a well-known approach with the purpose of scaling huge volumes of data in a low-cost implementation architecture. In this work, we propose the replacement of Crowd4City's RDBMS (SQL-based) with a NoSQL-based architecture and perform a case study focusing on scalability. The main objective of this paper is to improve the response time of LBSNs such as Crowd4City. Therefore, it will be possible to verify whether NoSQL outperforms SQL in a LBSN environment even without an optimal hardware setup for Big Data processing. The remainder of this paper is structured as follows: the related work are discussed in the next section; the proposed approach is presented in the following; another section addresses the case study and a comparative analysis between SQL and NoSQL architecture; and finally, in the last section, we conclude the paper and discuss further work to be undertaken. RELATED WORK The main goal of NoSQL databases is not to replace the relational DBMSs, but to be useful in cases where the database structure may be more flexible, aiming at achieving better performance (BAPTISTA et al., 2014; SADALAGE & FOWLER, 2012). Therefore, NoSQL systems were conceived to scale thousands or millions of users performing simple operations over the data, such as updating or retrieving the information. The Apache Hadoop project is a well-known infrastructure for dealing with Big Data (WHITE, 2012). Hadoop is an open source framework used for maintaining and processing data in huge scale. It is considered an efficient tool that provides scalability, reliability and distributed computing. It is composed of the Hadoop Distributed File System (HDFS) and the distributed processing technique Map Reduce (LEE et al., 2011). The Hadoop project is composed of several subprojects, including the HBase database. HBase is a NoSQL database based on Hadoop that supports the structured and optimized storage of big tables, but that does not support SQL. The HBase database is scalable to massive volumes of data, offering support for billions of records and millions of columns (WHITE, 2012; JIANG, 2011) and partial support for spatial data. To date, numerous studies have been performed on LBSN and geosocial networks within the context of citizen participation. We now present the most relevant to our proposal. Shankar et al. (2012) discussed the relevance of location based services (LBS) and the difficulties in the gathering Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 143 

OLIVEIRA, M. G.; FALCÃO, A. G. R.; BAPTISTA, C. S.; FIGUEIRÊDO, H. F.; LEITE, D. F. B. 

of information about the locations. In this study, the authors developed the Social Telescope service that compiles, indexes and classifies locations based on user interactions. The results show that such crowdsourcing based approach returned relevant and accurate results with a lower cost than other common approaches such as PageRank. This way, the authors testified the power of crowdsourcing on location based services. Scellato et al. (2011) presented a study of the spatial properties of the social networks arising among users of popular location based social networks, finally discovering the LBSNs present universal spatial features across them and strong heterogeneity across users. Traynor and Curran (2013) argued about location based social networks focusing on their purpose, value and challenges. From the study conducted, they were able to identify that guarantying users privacy and security represent a difficult challenge, however these social networks have the power to provide relevant and high quality information. The authors were able to identify that the amount of data generated and processed by such services create an overly interesting context for gathering information concerning the collective intelligence. The studies aforementioned confirm the value of LBSN and LBS, however none of the authors keep in mind scalability issues. System performance is a key aspect which may result in holding or losing user's attention and interest. This aspect is also widely important for the policy makers, once the LBSNs represent a means where important and substantial information regarding the citizens perception of their environment may be retrieved, which helps in the decision making process. Baykurt (2012) discussed the technical and political issues raised with digital citizen-driven public service improvements, highlighting the FixMyStreet.com location based social network. Using this system, citizens are able to report problems in their neighborhood such as potholes or bad lighting. Despite confirming the value of civic engagement in public services, this system does not support data processing on a large scale, affecting its performance. Furtado et al. (2010) developed the WikiCrimes system, which is a social network where users can report crimes (such as robbery, attempted murder, etc). WikiCrimes allows citizens to have a better insight on the safety of certain locations of their city since it provides information analysis tools and also because it employs reputation strategies to prevent false reports. However, once again the performance of the system may be compromised when considering a great number of information and users, since the authors did not employ any Big Data approach. Also, the system is strictly tied to the crime domain. WeGov (WANDHOFER et al., 2012) is an electronic government project based on the connection between citizen and authorities by means of popular social networks like Twitter and Facebook. This way, it is possible to gather the population perception of their city and finally prepare new political strategies. However, even though the retrieved information is displayed in a map, geographical data are underexploited.

Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 144 

Scaling spatial big data in a location‐based social network 

Although these previous related studies address location based collaborative environments, they do not take into account a high performance storage approach such as NoSQL. This aspect may be crucial for the user's point of view once they usually are response time sensitive. Therefore, our proposal addresses performance issues concerning the LBSN user's point of view. Users are interested in fast interactions in environments such as LBSNs and this aspect must be taken into account. Falcão et al. (2012) presented the Crowd4City system, a location based social network that enables participatory democracy, allowing users to report problems or suggestions of any field regarding their city. Crowd4City explores the available spatial information allowing users to better analyze the gathered data, also enabling them to improve the decision making process. A Nosql-Based Storage Architecture for LBSN In the Crowd4City system, users can contribute creating georeferenced alerts classified into several categories such as security, transportation, education, among others. These contributions are shared in the social network and its users can interact with them. All of this shared information can be enriched by users' comments assuring the real occurrence of related facts or denouncing false facts. There are several ways of using this information by both population and government agencies. The latter can, for example, use them for strategic planning and general improvement of government services. The current architecture of the Crowd4City system is composed of three layers: presentation, business and data. The presentation layer contains users' interaction components. The business layer is responsible for processing requests delivered by the presentation layer and it is responsible for implementing the system's business logic. The data layer handles data storage and is composed of three repositories: the first is responsible for storing semantic information, the second stores the multimedia files shared by users and the last is responsible for both storing traditional and geographical data and performing geographical operations. Concerning the current Crowd4City system's architecture, our approach focuses on the last repository of the data layer, which is responsible for storing the data of the LBSN. We performed the replacement of the current architecture based on PostgreSQL1 and PostGIS2 RDBMS technologies with one based on HBase (JIANG, 2011), a NoSQL database system, so the LBSN response times can be improved. The adoption of NoSQL is motivated by the need of processing and analyzing large volumes of data and by the complexity of such tasks for conventional database systems such as RDBMS (PATEL et al., 2012). This proposed approach is illustrated in Figure 1 and compared with the current Crowd4City system's architecture.

                                                             1 2

PostgreSQL: http://www.postgresql.org PostGIS: The spatial extension for PostgreSQL DB: http://postgis.net

Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 145 

OLIVEIRA, M. G.; FALCÃO, A. G. R.; BAPTISTA, C. S.; FIGUEIRÊDO, H. F.; LEITE, D. F. B. 

Figure 1: The current and the proposed LBSN data repository of the Crowd4City’s data layer

The Hadoop (WHITE, 2012) and HBase (JIANG, 2011) technologies were chosen for the development of the new data repository as illustrated in Figure 1. The main reasons for these choices were: the supported data schema, once HBase is column-based and that is the most compatible one with the current relational schema of Crowd4City (among the other existing schemas for NoSQL, such as document-oriented, graphs, etc.); the distributed storing and processing environment provided by Hadoop; and the HBase Map/Reduce implementation for queries into the database. A column-based database schema suitable for HBase constraints was developed based on the relational database schema of the Crowd4City system, so that it could store the same information about the LBSN. Such stored information basically addresses user registration data and both spatial and non-spatial data related to georeferenced markers created and maintained by users in the social network environment. The GeoHash algorithm (DIMIDUK & KHURAN, 2013) was adopted for indexing the spatial data. GeoHash allows storing spatial data sorted by spatial proximity. Therefore, a spatial search in a specific geographical range will no longer necessarily seek the entire database. Sample data using GeoHash index is shown in Table 1. Each record of the table corresponds to a known point plotted in the map shown in Figure 2. Each bounding box is identified by a unique GeoHash prefix. The GeoHash prefix is the same for all points contained inside the bounding box. An example may be seen in the first three records of Table 1: all these GeoHashes have the string "6gy8zk" as prefix, which implies that the smallest bounding box containing these three points has the GeoHash "6gy8zk", as can be seen in Figure 2. Likewise, the larger bounding box area identified by the GeoHash "6gy8z", simply removing "k" letter from the end of the previous bounding box, will contain all records listed in Table 1. Table 1: Sample spatial markers stored in the HBase column-based schema developed Point Number

Spatial Index

Marker ID

Timestamp

1

6gy8zkk52xv5

105951847

1368637960000

2

6gy8zksunj7w

105926689

1368587880000

Stickup in street

3

6gy8zkvuge12

105937451

1368601980000

There are emergency care services

4

6gy8zm7969wr

105951359

1368632460000

Traffic jam

5

6gy8zmnv8bxf

105936261

1368596760000

Very big hole!

Description Noisy dogs

A web service was developed aiming to establish the communication between the Crowd4City system and the HBase database. Thus, any other LBSN might use that service and get Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 146 

Scaling spatial big data in a location‐based social network 

the assembled resources of storage and handling for spatial data. For that purpose it is only necessary to develop a driver which can establish such communication from the data layer. The web service architecture is shown in Figure 3. The web service architecture is composed of three modules: interpreter, executor and dispatcher. Interactions between those modules are illustrated in Figure 3. The interpreter receives and interprets service requests. The executor module processes data requests and interacts directly with the HBase database that performs the spatial queries. The dispatcher organizes and delivers answers from the service in a JSON format.

Figure 2: The relationship between GeoHash and geographical location illustrated in a map

Figure 3: The architecture of the developed web service

The executor module uses a Java-written HBase API to obtain database access. The Map/Reduce paradigm is encapsulated in HBase API for scan() functions, which are used for performing searches over stored data. The submodule of the executor module, responsible for the management of spatial queries, uses GeoTools3 library to perform spatial queries considering their geographical relationships. We needed to develop that submodule due to the fact that the HBase API does not implement spatial functions yet, as PostGIS does over PostgreSQL. The main methods developed are responsible for performing the BUFFER and CONTAINS queries, supported by the Crowd4City system.

                                                             3

GeoTools Library: http://www.geotools.org

Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 147 

OLIVEIRA, M. G.; FALCÃO, A. G. R.; BAPTISTA, C. S.; FIGUEIRÊDO, H. F.; LEITE, D. F. B. 

Both spatial queries were implemented on the application's side. This approach entailed the spatial processing in memory level while HBase provided Map/Reduce scans. The BUFFER query calculates the smallest bounding box which circumscribes the search circle area, requiring only markers that are contained in this area. GeoHash provides spatially sorted data storage into HBase, therefore, the search for points within a given bounding box is equivalent to sweeping a contiguous sequence of data. The resulting spatial markers are subjected to a mathematical test that discards those that are not contained in the circle area. The CONTAINS query works in a similar way to the BUFFER one. The smallest bounding box that circumscribes the polygon of the “contains” spatial search is calculated and then subjected to HBase by a scan operation. The resulting spatial markers are subjected to a contains() function provided by the GeoTools library so it enables the discarding of every spatial markers that are not contained in the polygon area. Thus, the proposed approach presented in this paper encompasses: the usage of a NoSQL database replacing a RDBMS one; the development of a data schema following constraints of the selected database technology and application domain; the development of a web service which assumes the role of a DBMS and offers storage and retrieval data services in such database; and the development of a module which provides spatial data support over HBase system. These features compose the main contribution of this paper, which is to provide a novel storage architecture that employs Big Data technologies to scale large volumes of spatial data in LBSN systems. In the following section, we present and discuss the results of applying our proposal to the Crowd4City system. In this section, we present a case study carried out to perform a comparative analysis between the NoSQL and SQL storage architectures. We evaluate whether our NoSQL proposal offers a better performance and if it is acceptable from the LBSN users' point of view. METHODOLOGY A real scenario of a LBSN usage was defined aiming to evaluate such kind of system when processing a massive data volume generated by thousands of users resident in the city of São Paulo, the most populous Brazilian city, during a specific time period. Simulated data needed to be generated. We considered a scenario composed of about 10% of São Paulo's population assuming the users role that randomly create several georeferenced markers, in any category, within São Paulo's geographic boundaries, during the entire year of 2013. Such percentage rate of the active population in LBSN was defined considering that LBSNs are currently used by a small portion of population in large cities. An algorithm was developed to generate these simulated data like users acting in the Crowd4City system. About 1.2 million registered users creating about 144 million georeferenced markers during the one-year period compose the generated data volume. Those data were stored Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 148 

Scaling spatial big data in a location‐based social network 

in PostgreSQL/PostGIS and Hadoop/HBase database systems. During the loading of the data into HBase it was observed slowness in the data insertion task. This issue can be explained by the usage of the GeoHash algorithm, which performs a spatially sorted data insertion, therefore the insertion cost may be high when lots of data are simultaneously sent for insertion like a batch process. On the other hand, such sorted storage strategy enables efficient spatial data retrieval. While only one server was running PostgreSQL/PostGIS and storing all generated data, we created a Hadoop/HBase cluster composed of seven nodes, three region servers and two-node replication schema. The hardware setup of the server which ran SQL DB was an Intel Core i7, 36 GB RAM, 3 TB 7200 RPM HDD, running Windows 7 OS. The cluster hardware setup is described as follows: (1) Two workstations Intel Core i7, 8 GB RAM, 0.7 TB 7200 RPM HDD, running Debian Linux over virtualized Windows OS; (2) A workstation Intel Core i7, 4 GB RAM, 0.7 TB 7200 RPM HDD, running Debian Linux over virtualized Windows OS; (3) A workstation Intel Core2Duo, 2 GB RAM, 160 GB 5400 RPM HDD, running Debian Linux OS; (4) Three workstations Intel Dual Core, 1 GB RAM, 160 GB 5400 RPM HDD, running Debian Linux over virtualized Windows OS. The Hadoop/HBase has automatically balanced the stored data over the cluster nodes and has managed the data replication. It is important to notice the heterogeneity of cluster nodes setup. It works well even without an optimal hardware setup. Using the Crowd4City system user's interface, we assumed the role of an ordinary user and performed the two available spatial queries: BUFFER and CONTAINS. These queries were performed in both NoSQL and SQL Databases. The BUFFER query was performed varying randomly the centroid point into São Paulo's geographic boundaries and the radius size so that the circle area switches approximately between 0.5 and 300 square kilometers. Five variations of such query were defined in total. Just like the BUFFER query, five variations of the CONTAINS query were defined by distinguishing the centroid point and the polygon shape and area. The centroid points were randomly selected and the polygons had geographical area ranging from 0.5 to 700 square kilometers within São Paulo's geographic boundaries. These queries were performed solely and at different moments for each database solution, thus enabling the retrieval of unaffected response times by the other database. Furthermore, 30 repetitions were performed for each variation in each query so it could be possible to perform an analysis based on average response times of each database solution applied in this study. Figure 4 shows examples of both BUFFER (a) and CONTAINS (b) spatial queries performed in the Crowd4City system user interface. Both spatial queries illustrated in Figure 4 were initiated from the graphical user interface and then were forwarded to the suitable data layer in both previous and novel data storing approaches (see Figure 1 and 3) as described in the previous section. In the case of the SQL DB, the queries were translated to SQL language and delivered to the RDBMS. For the NoSQL DB, the queries were

Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 149 

OLIVEIRA, M. G.; FALCÃO, A. G. R.; BAPTISTA, C. S.; FIGUEIRÊDO, H. F.; LEITE, D. F. B. 

forwarded to the web service, which translated them to the Java programming language using the HBase API and the developed extensions for both spatial queries performed.

Figure 4: The Crowd4City user interface illustrating the spatial queries definition: a) BUFFER query from a map point; b) CONTAINS query from a polygon drawn by user

Frame 1 presents an example in the SQL language of the BUFFER query performed in the RDBMS solution. Frame 2 presents an example in the SQL language of the CONTAINS query performed also in the RDBMS solution. It is important to highlight that both SQL queries depends on PostGIS spatial extension over PostgreSQL DB.

SELECT title, description, timestamp, ST AsText(geom), ... FROM markers WHERE ST_DWithin( ST GeomFromText(`POINT($LATLNG)', 4326), geography(geom), $RADIUS ) Frame 1: Example of the BUFFER query performed by the SQL-based solution

The BUFFER query (as shown in Frame 1) is performed by the ST_DWithin() PostGIS function, which is used to calculate the buffer formed by both user-defined centroid point and radius, respectively represented by $LATLNG and $RADIUS variables on the shown SQL code. The CONTAINS query (as shown in Frame 2) is performed by the ST_Contains() PostGIS function, which is used to filter the spatial markers by a user-defined polygon, returning only what is spatially contained in such polygon. All the points that form the search polygon are encapsulated in a comma-separated string available in the $POINTS_LIST variable.

SELECT title, description, timestamp, ST_AsText(geom), ... FROM markers WHERE ST Contains( ST GeomFromText( 'POLYGON(($POINTS_LIST))', Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 150 

Scaling spatial big data in a location‐based social network 

4326 ), geom ) Frame 2: Example of the CONTAINS query performed by the SQL-based solution

RESULTS Figure 5 shows two charts, each one with the results of a performed query in terms of average response time and explored geographical area.

Grafic 1: Comparative charts between response times of BUFFER and CONTAINS spatial queries processing

The obtained results, presented in Figure 5, show that the performance of our new NoSQLbased approach was substantially better than the RDBMS-based solution. Concerning the BUFFER query we can notice that the response times for the NoSQL-based solution were about 35% of the response times for the SQL-based solution. This result seems very good if we consider the limited hardware setup used to perform the current study. Furthermore, it was shown that the geographical area of the circle formed in this type of query influences in the results of both solutions. Concerning the CONTAINS query, we can notice that the difference between the response times of both solutions was higher than the observed for the BUFFER query, as can also be seen in Figure 5. However, our proposed approach had a better performance, as also observed in the first query. It seems another very good result for our approach if we take our limited hardware setup into consideration. The response times of the NoSQL-based approach in the second query type were about 8% of the response times of the SQL-based approach. The geographical polygon area influenced results in both solutions as well as the first query type. Likewise, the response times for the SQL-based solution proved even more sensitive to variation of geographical area in comparison with the results of the previous query. It is important to highlight the decrease of the response times when the geographical area is higher than 300 km². This happened because of the non-existence of new spatial markers when the area increases from that point. Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 151 

OLIVEIRA, M. G.; FALCÃO, A. G. R.; BAPTISTA, C. S.; FIGUEIRÊDO, H. F.; LEITE, D. F. B. 

Focusing on the results for the CONTAINS spatial query, Figure 6 shows a chart comparing the variation of response times of each approach in relation to the variation of edge points used to form the polygons of the query.

Grafic 2: Comparative chart with the response times of the second query in relation to the quantity of edge points required to form polygons

From the chart shown in Figure 6 we can notice that the response time in both solutions tends to be greater as the number of edge points for the polygon increases. However, the curve is much steeper in the SQL-based solution. This shows that the response time is much more sensitive to variations of edge points in that solution. In terms of response times, considering both the geographical area and the number of points to form the polygon, the presented charts show that the performance of the NoSQL-based proposed approach tends to be much better than the SQL-based one when performing the CONTAINS query. Looking at the result data more closely, Figure 7 shows two box plots illustrating the variation of the response times for the BUFFER spatial query of both solutions. The presented charts show that the variation of the response times is very low for PostgreSQL and remains very close to the average. On the other hand, the HBase solution presented a variation of about 30 seconds and some outliers.

Grafic 3: Box plots illustrating the variation between response times of BUFFER spatial query for both solutions Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 152 

Scaling spatial big data in a location‐based social network 

Grafic 4: Box plots illustrating the variation between response times of CONTAINS spatial query for both solutions

Figure 8 shows two box plots illustrating the variation of the response times for the CONTAINS spatial query of both solutions. The presented charts show that the overall response time variation was about 25 seconds. Although the overall variation for the HBase solution has been more expressive, we can notice outliers in its box plot. Thus, the time variation for the HBase solution decreases about 10 seconds if we consider removing those outliers and then consider only the boxes. We could conclude by analyzing Figures 7 and 8 that each query response time achieved may vary from 10 to 30 seconds. Thus, this variation cannot affect significantly the charts presented on Figures 5 and 6, which were based on averages of response times. The confidence intervals for the BUFFER query were (in seconds): PostgreSQL [478, 759] and HBase [89, 253]. The confidence intervals for the CONTAINS query were (in seconds): PostgreSQL [2090, 3561] and HBase [87, 247]. Since there is no intersection between the confidence intervals for both PostgreSQL and HBase in each query, we can say statistically with 95% of confidence that the HBase has better performance for both spatial queries. We finish this analysis with the conclusion that even using a simple cluster's setup, the results are promising. Our solution based on Hadoop/HBase achieves better response times than the solution based on PostgreSQL/PostGIS for the two types of spatial query exploited by the Crowd4City LBSN. CONCLUSIONS We proposed a novel approach of data storage for a LBSN. We focused on performance issues concerning the Crowd4City system, a specific LBSN. The previous storage architecture of such system was studied and we concluded that the response times could not be acceptable from the user point of view. Although we chose such LBSN, our goal was to provide a novel approach applicable for any LBSN system. Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 153 

OLIVEIRA, M. G.; FALCÃO, A. G. R.; BAPTISTA, C. S.; FIGUEIRÊDO, H. F.; LEITE, D. F. B. 

The novel approach proposed was based on the NoSQL technology once it has been promising concerning the storage and management of huge volumes of data. A web service which assumes the role of a DBMS and offers storage and retrieval of spatial data services based on Hadoop and HBase technologies was developed. Such web service was connected to the Crowd4City system replacing its previous data repository based on PostgreSQL/PostGIS. We performed a comparative analysis between the novel and the previous approaches concerning the performance of the Crowd4City system. This analysis concluded that the novel approach performs much better than the previous one in terms of response time even in a limited cluster hardware setup. However, considering competitive response times in a user sensitive LBSN environment, the achieved response times should still be improved. A future work in this direction would be carrying out a study in a better cluster setup so that it could be observed that the proposed solution can bring acceptable response times for end users. Other future work is faced on the challenge of spatiotemporal data management and its overall complexity. We intend to improve our solution incorporating other spatial operation types such as set, metric, topological, and temporal operations as well. We also would like to perform experiments applying the Map/Reduce approach on both spatial queries implementation for HBase Database in order to reduce response time. REFERENCES BAPTISTA, C. S.; PIRES, C. E. S.; LEITE, D. F. B.; OLIVEIRA, M. G.; LIMA JUNIOR, O. F.. NoSQL Geographic Databases: An Overview. In: POURABBAS, E.. Geographical information systems: trends and technologies. CRC Press, p.73-103, 2014. BAYKURT, B.. Redefining Citizenship and civic engagement: political values embodied in FixMyStreet.com. Selected Papers of Internet Research, USA, 2012. BRABHAM, D. C.. Crowdsourcing the public participation process for planning projects. Planning Theory, v.8, n.3, p.242-262, 2009. CATTELL, R.. Scalable SQL and NoSQL Data Stores. ACM SIGMOD Record, v.39, n.4, p.12-27, 2010. DEMIRBAS, M.; BAYIR, M. A.; AKCORA, C. G.; YILMAZ, Y. S.; FERHATOSMANOGLU, H.. Crowd-Sourced Sensing and Collaboration Using Twitter. In: the proceedings of the International Symposium on a World of Wireless, Mobile and Multimedia Networks. IEEE, p.1-9, 2010. DIMIDUK, N.; KHURANA, A.. Scaling GIS on HBase. In: DIMIDUK, N.; KHURANA, A.. HBase in Action. Manning Publications Co., p.203-236, 2013. ERICKSON, T.. Geocentric crowdsourcing and smarter cities: enabling urban intelligence in cities and regions. In: THE PROCEEDINGS ACM INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING. Copenhagen, 2010. FALCÃO, A. G. R.; BAPTISTA, C. S.; MENEZES, L. C.. Crowd4City: utilizando sensores humanos como fonte de dados em cidades inteligentes. In: THE PROCEEDINGS OF BRAZILIAN SYMPOSIUM ON INFORMATION SYSTEMS. São Paulo, 2012. FURTADO, V.; AYRES, L.; OLIVEIRA, M.; VASCONCELOS, E.; CAMINHA, C.; D’ORLEANS, J.; BELCHIOR, M.. Collective intelligence in law enforcement – The WikiCrimes system. Information Sciences: an International Journal, v.180, n.1, p.4-17, 2010. Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 154 

Scaling spatial big data in a location‐based social network 

HARRISON, C.; DONNELLY, I. A.. A Theory of Smart Cities. In: The proceedings of the 55th Annual Meeting of the ISSS. Hull, p.1-15, 2011. HELAL, S.. IT Footprinting: groundwork for future smart cities. Computer, v.44, n.6, p.30-31, 2011. JIANG, Y.. HBase administration cookbook. Packt Publishing, 2011. LEE, K. H.; LEE, Y. J.; CHOI, H.; CHUNG, Y. D.; MOON, B.. Parallel data processing with mapreduce: a survey. ACM SIGMOD Rec., v.40, n.4, p.11-20, 2011. NATURE. Big Data: science in the petabyte era. Nature, v.7209, n.4, 2008. OLIVEIRA, M. G.; ALVES, A. L. F.; LEITE, D. F. B.; ROCHA, J. H.; ACIOLI FILHO, J. A. M.; BAPTISTA, C. S.. Introducing spatial context in comparative pricing and product search. In: THE PROCEEDINGS OF ACM MEDES, p.127-134, 2013. PATEL, A. B.; BIRLA, M.; NAIR, U.. Addressing big data problem using Hadoop and Map Reduce. In: THE PROCEEDINGS OF NIRMA UNIVERSITY INTERNATIONAL CONFERENCE ON ENGINEERIN, p.1– 5, 2012. SADALAGE, P. J.; FOWLER, M.. NoSQL Distilled: a brief guide to the emerging world of polyglot persistence. 1 ed. Addison-Wesley Professional, 2012. SCELLATO, S.; NOULAS, A.; LAMBIOTTE, R.; MASCOLO, C.. Socio-spatial properties of online location-based social networks. In: THE PROCEEDINGS OF ICWSM. Barcelona, 2011. SHANKAR, P.; HUANG, Y.; CASTRO, P.; NATH, B.; IFTODE, L.. Crowds replace experts: Building better location-based services using mobile social network interactions. In: THE PROCEEDINGS OF THE IEEE PERCOM. LUGANO, Switzerland, 2012, p. 20-29. TRAYNOR, D.; CURRAN, K.. Location-Based Social Networks. USA: IGI Global, p.243-253, 2013. VICENTE, C. R.; FRENI, D.; BETTINI, C.; JENSEN, C. S.. Location-Related privacy in geo-social networks. IEEE Internet Computing, v.15, n.3, p.20-27, 2011. WANDHÖFER, T.; VAN EECKHAUT, C.; TAYLOR, S.; FERNANDEZ, M.. WeGov Analysis Tools to Connect Policy Makers with Citizens Online. In: GOV TRANSFORMING GOVERNMENT WORKSHOP. 2012. WHITE, T.. Hadoop: The Definitive Guide. 3 ed. USA: O’Reilly Media, 2012.

Revista Brasileira de Administração Científica    v.5 ‐ n.2      Anais do SBTI 2014 ‐ Out 2014 

P a g e  | 155 

Suggest Documents