Performance Comparisons of Spatial Data Processing Techniques for a Large Scale Mobile Phone Dataset Apichon Witayangkurn
Teerayut Horanont
Ryosuke Shibasaki
Department of Civil Engineering The University of Tokyo Komaba, Tokyo 153-8505, JAPAN
Institute of Industrial Science The University of Tokyo Komaba, Tokyo 153-8505, JAPAN
Center for Spatial Information Science The University of Tokyo Kashiwa-shi, Chiba 277-8568, JAPAN
[email protected]
[email protected]
[email protected]
ABSTRACT Mobile technology, especially mobile phone, is very popular nowadays. Increasing number of mobile users and availability of GPS-embedded mobile phones generate large amount of GPS trajectories that can be used in various research areas such as people mobility and transportation planning. However, how to handle such a large-scale dataset is a significant issue particularly in spatial analysis domain. In this paper, we aimed to explore a suitable way for extracting geo-location of GPS coordinate that achieve large-scale support, fast processing, and easily scalable both in storage and calculation speed. Geo-locations are cities, zones, or any interesting points. Our dataset is GPS trajectories of 1.5 million individual mobile phone users in Japan accumulated for one year. The total number was approximately 9.2 billion records. Therefore, we conducted performance comparisons of various methods for processing spatial data, particularly for a huge dataset. In this work, we first processed data on PostgreSQL with PostGIS that is a traditional way for spatial data processing. Second, we used java application with spatial library called Java Topology suite (JTS). Third, we tried on Hadoop Cloud Computing Platform focusing on using Hive on top of Hadoop to allow SQL-like support. However, Hadoop/Hive did not support spatial query at the moment. Hence, we proposed a solution to enable spatial support on Hive. As the results, Hadoop/hive with spatial support performed best result in large-scale processing among evaluated methods and in addition, we recommended techniques in Hadoop/Hive for processing different types of spatial data.
Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – Spatial databases and GIS. H.3.4 [Information Storage and Retrieval]: System and Software – Distributed system, Performance evaluation (efficiency and effectiveness).
General Terms Performance, Design, Experimentation
Keywords GPS, Mobile Phone, Spatial Query, Cloud Computing, Hadoop
1. INTRODUCTION With the advancement of mobile phone technology, mobile devices are mostly embedded with GPS and Wi-Fi module so called smartphone. GPS-enabled devices allow users to acquire their own location and also report their location to servers. With accumulate large number of mobile phone data, we can mine and discover many useful information. In recent years, a number of researchers have been working on GPS trajectories data of mobile users for various fields, such as a study on human behavior, people mobility, and transportation planning. Abnormal event detection is one of our main targets for analyzing mobile trajectory data. However, to detect events, moving pattern and density pattern have to be developed and definitely, it needs to be able to process on very large scale spatial dataset. To achieve the main target, there are several issues need to be addressed at the early stage. First is large scale data processing because current dataset size is about 600 GB and tend to increase day by day. Second is spatial data processing support since our data are GPS data. An extracting of geo-location method is mostly adapted on the dataset. Third, it should be scalable in term of data storage and processing time. Therefore, in this paper, we are focusing on finding proper techniques for processing such a large scale spatial dataset by comparing performance of three difference techniques specific for mobile dataset which are PostgreSQL with PostGIS, Spatial library-based application, and Cloud computing platform. Regarding Cloud platform, we evaluated Hadoop cloud computing platform particularly on Hive, since Hadoop is wellknown for large scale processing, and also there are various research achievements on Hadoop with spatial processing. However, at level of Hadoop, users are mostly required to deal with MapReduce programming which could be a limitation to expand user community. Therefore, we emphasized on Hive, a data warehouse service built on top of Hadoop, to provide SQLlike query to users which is more familiar than MapReduce. Nevertheless, Hive does not support spatial query at moment; hence, we introduced several techniques to enable Hive support spatial query. With Java Topology Suite (JTS), a spatial function library, and User-Defined Function (UDF) on Hive, it allows Hadoop/Hive support spatial function and can be used for processing spatial data such as extracting city that point belonged to by using city polygon data.
The rest of this paper is organized as follows. Section 2 describes related works. Section 3 explains evaluated spatial data processing techniques. Section 4 shows experiment result. Finally section 5 draws our conclusions and presents the future work.
2. RELATED WORK GPS trajectories represent location history of users and they are widely used in different field of study during the past years such as understanding people moving pattern [1], predicting movement of people [2], detecting transportation mode [3], and mining location and travel sequence for travel recommendations [4]. The target of our research is to detect abnormal events from GPS trajectories from a large mobile phone dataset. However, extracting geo-location from GPS is an essential step to understand where people are and belong to. For example, we want to know number of people (represented by trajectories) in each city and to which city they go and during which period of time. These questions can be addressed by spatial function processing. Database with spatial support is a famous system for spatial data processing and PostgreSQL with PostGIS which is mostly used since it is open-source software, and also fully support spatial [5]. Java Topology Suite (JTS) is a spatial function library developed using native java language [6]. Enhanced with that library, application development is able to process spatial function. Nevertheless, for large-scale data processing, cloud computing platform is an option for processing such large data in range of terabytes to petabytes. Cloud technology provides computation, software, data access and storage services that is not required to know physical location of the resources. It involves with dynamically scalable and virtualized resources. It normally expresses the following characteristic: Cost, Location dependence, Reliability, Scalability, Performance, and Security [7][8]. In this research, we are focusing on one cloud computing technology called Hadoop. Hadoop is open source large-scale distributed data processing that are mainly designed to work on commodity hardware [9] meaning it does not require high performance server-type hardware. Hadoop are used at Facebook in order to support the ever increasing amount of data and flexible enough to scale up with cost effective manner [10]. To increasing system performance and storage, simply done by adding new node without code modification and that is the reason why Google, Yahoo and Facebook use it as backend system [11]. Zhang et al. [12][13] described how spatial query are expressed with Hadoop and MapReduce not only in spatial queries evaluation but also introduced spatial join with MapReduce. Hive is a data warehouse running on top of Hadoop to serve data analysis and data query by providing SQL-like language called HiveQL [14][15]. Hive allows users familiar with SQL language to easily understand and able to query data. Hive does not natively support spatial query. Even though, there is an attempt on using MapReduce and Hive process spatial query [16], they use proprietary software and also need to use MapReduce to do spatial process which is a limitation for other users and researches; therefore, we purposed methods to enable spatial support on Hive using open source spatial library without MapReduce.
3. SPATIAL DATA PROCESSING The processing requirement is based on mobile phone dataset collected from about 1.5 million mobile users in Japan over one year periods. The total number of records is 9.2 billion and about 600GB in size. Data files are kept in CSV format separated one
file per day. Initially, we want to extract geo-location of each GPS point. Our interesting locations consisted of three difference level of spatial coverage including prefectures, cities, and unified 500 by 500 meter grids. The coverage files used in this process are geometry data of Japan in Shape-file format. To process data, there are four important issues needed to be mentioned: large scale support, fast processing, spatial support and scalable in term of data size and processing speed. Therefore, we do performance evaluation and comparison on several processing techniques which are spatial-enable database system, spatial library-based application and cloud computing platform.
3.1 Spatial-Enable Database PostgreSQL was chosen for our evaluation considering opensource database system and fully support spatial function by PostGIS module. Since our dataset were in CSV format, several steps were needed to process before we were able to do spatial query as shown in figure 1. For import data, COPY command was used for bulk import without constrains checking; however, table partitioning using date was enhanced as well as configuration tuning to increase performance. Disk usage was approximately 1.2 TB, 2 times of original data.
Figure 1. Pre-processing step for spatial query in database. Next step, we tried to create geometry column and spatial index on daily table and unexpectedly, it consumed a lot of time for those processing on our Xeon 8-cores server. To create geometry and spatial index of one day data, it took approximately 20 hours. By calculation, it will take about half year to process all data and from original data size (600GB) will become 3 TB. Considering geo-location extracting, we used spatial functions in PostGIS, namely ST_Within, to test whether geometry is fully within the other. In our case, it is to attach geo-location id to individual GPS point of this mobile dataset.
3.2 Spatial Library-Based Application In this method, we developed an application using Java Language and utilized spatial library to be able to process geometry data. Java Topology Suite (JTS) is an open java library that provides an API for spatial predicates and functions. JTS implements geometry and function based on standard specification defined by Open Geospatial Consortium (OGC) that is widely used in geospatial areas. It also supports spatial index that increases searching speed for geometry lookup. Figure 2 describes flow diagram of an application. At first, Shape data of prefectures, cities, and 500m grids were exported to Well-Known-Text format (WKT) to be easy for loading to lookup table in the application and rather than using hash table for lookup, SR-tree spatial index [17] was applied to boost searching function for large polygon such as 500 meters grids that contained about 1.5 million grid polygons. Special classes like PreparedGeometryFactory and IndexedPointInAreaLocator were employed instead of using standard geometry object and resulted in significant enlarged polygon finding speed.
Figure 2. Flow diagram of an application. WKT format files of Shape files were used as input for building SR-tree index list. Input files containing GPS point with latitude and longitude were loaded and created geometry for each individual point. In finding geometry step, geometry of each point was looked up in SR-tree to find the nearest polygons and since SR-tree is Sphere/Rectangle based searching function, it was possible to have multiple results if inputs are multi-polygons or in unstructured shape. To overcome that issue, contains function was applied on match polygon to find exact polygon of point. The final outputs were id and name of polygon that points were inside and were saved in CSV files.
3.3 Cloud Computing Platform Hadoop is an open source cloud computing software framework for data intensive and distributed application. There are many services and framework under Hadoop umbrella; however, in this research, we focused on Hadoop Distributed File System (HDFS) and Hive. To setup and use Hadoop for full operation mode, it required to run five components: NameNode, DataNodes, Secondary NameNode (SNN), JobTracker, and TaskTrackers. NameNode is the bookkeeper of HDFS; it keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem. DataNodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing. Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS and the SNN help snapshots NameNode to help minimize the downtime and loss of data. JobTracker is the liaison between your application and Hadoop. Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they are running. TaskTrackers are responsible for executing the individual tasks that the JobTracker assigns and manages the execution of individual tasks on each slave node. Figure 3 shows our hadoop cluster. It consisted of five computers with same specification that is Xeon 2.6GHz, 8GB memory, and 2x2TB disk. One computer runs as NameNode and others as DataNodes and TaskNodes. Hence, in total, this hadoop has 32 cores, 32GB memory and 16TB storages. It can have up to 28 tasks running at the same time.
Figure 3. Testing systems for Hadoop Cluster. Hive is a data warehousing package built on top of Hadoop. It target users who are familiar and comfortable with SQL to do adhoc query, summarization and data analysis. Web GUI and JDBC are provided for interacting with Hive by issuing queries in a SQL-like language called HiveQL [9]. Hadoop/Hive do not support spatial query; however, Hive allow developers to create User-Defined Function (UDF) that could be any function based on user requirement and since Hive is natively developed by using Java language, Java is also used for UDF developing. Spatial library (JTS) used in section 3.2 is employed to make spatial function with UDF. We also tested several assumptions on methods including “JOIN” and “LATERAL VIEW” in order to increase performance of query. With combination of those components, Hive is able to do spatial processing with fully utilized cloud computing platform.
3.3.1 Join Methods Hive supports several types of JOIN methods such as basic join, common join, and map join. Those methods are evaluated the possibility of using spatial query as well as performance. A query statement for testing is to joins two tables, one is input data and another one is prefecture. “sp_within” function is spatial userdefined function used for finding point in geometry and will result true if it found match polygon.
3.3.1.1 Basic Join It is standard join method that uses condition statement in Where clause as shown in figure 4. With this method, there is no join optimization and all records from two tables will do Cartesian join first and send all join rows to filter conditions in reduce tasks; hence, a great deal of work load will be on reduce tasks.
Figure 4. Basic Join.
3.3.1.2 Common Join For common join, join clause is put in join section and optimization process will be done before passing intermediate data to reduce tasks; however, at the moment, it supports only
equal join and does not support Theta-join (unequal-join) as shown an example query in figure 5. Therefore, this method could not be used for our purpose.
For our case, we used Lateral View together with “json_tuple” UDTF function introduced in Hive 0.7. It uses extract a tuple of values from JSON string, and then additional column values can be created using this method. We developed “sp_within” UDF function that returns multiple in JSON format and with “json_tuple”, multiple output will be put as new columns as shown in figure 8 and figure 9 shows an example query of this method.
Figure 5. Common Join.
3.3.1.3 Map Join Opposed to common join which two tables are read, join and write into an intermediate file in mapper stage, Map join loads one of the join tables that small enough to fit into memory. Every mapper can hold the data in memory and process join in mapper stage. Distributed Cache in Hadoop is also used for storing small tables to be able to access in mappers. Therefore, for Map Join methods, there would be only mapper tasks and no reduce tasks which increase a lot of performance because, normally, reduce tasks have less number than mapper tasks especially for some specific query, like “order by.” It resulted in just one reduce task that would much degrade performance of overall process. An example of Map join query is shown in figure 6 and to enable map join, “hive.auto.convert.join” parameter in Hive need be set to true.
Figure 8. An example of Lateral View with json_tuple
Figure 9. A query example of Lateral View
4. EXPERIMENTS In this section, we first present the experimental setup. Second, we describe the evaluation approaches. Third, experimental results are presented with some discussion. Figure 6. Map Join.
3.3.2 Lateral View Previously described methods use “JOIN” method for processing data and in case of large dataset, number of processing data will gradually increase due to cross join function. For example, if input data have 100,000 records and cities data have 1,500 records, at intermediate level, number of records needed to be processed will become 100,000*1,500 = 150,000,000 records, and all those records have to be examined with spatial function. For Lateral View, we used different concept. There was no join between two tables; besides, lookup concept was used instead. The only input table will be processed with spatial function. In that function, we customized to receive WKT format of Shape file which same as one used in section 3.2 from Hadoop distributed caches. Lateral View was used in conjunction with user-defined table generating functions (UDTF) to generate virtual table having supplied row or column alias which standard database could not do same as this. As an example shows in figure 7, leftside table has 2 records and output with using Lateral View became 5 records by extracting data from array in column.
Figure 7. Lateral View
4.1 Experimental Setup As mentioned in previous section, we desired to evaluate three approaches: database system, spatial library-base application, and cloud computing platform. We setup systems for all those approaches which were 8-Cores Xeon 2.66GHz, 8GB RAM, 2TB disks with CentOS 6.0 64-bit for Database system, and librarybased application. PostgreSQL 9.0.6 with PostGIS 1.5.3 was installed in the system. For Hadoop, we used a cluster with five nodes with same specification as above except storage that have two of 2TB disk to increase I/O performance and Gigabit switch was used between cluster nodes. One computer was for master node and other four computers were for task processing. IN total, there were 32 processing cores; however, we set number of concurrent tasks to 7 tasks for each node because one core was reserved for other process; hence, for processing, it can process up to 28 concurrent tasks (4*7 Cores). The version of Hadoop was 0.20.2, and the version of Hive was 0.8.0. The version of JTS was 1.12. We used a GPS dataset from mobile phone collected from about 1.5 million across Japan in one year period. Total records were 9.2 billion and 600GB in size. For one day, there were about 22 million records and the size was 1.5 GB in average. Figure 10 depicts the distribution of GPS data. Density of point cloud was directly reflected with the size of the city. Considering the privacy issues, we used these datasets anonymously.
Table 2. Processing time on spatial query for one day data Method
No. of processed records per second
Total Time (Mins)
Database (PostgreSQL)
227
1269
Spatial library-based application
171,490
1.7
8,148
35
21,361
13.5
173,205
1.6
288,675
1
Hive using Map Join (6 tasks, Normal) Hive using Map Join (22 tasks) Hive using Lateral View (6 tasks, Normal) Hive using Lateral View (22 tasks)
Figure 10. Data distribution in Japan
4.2 Evaluation Approaches We separated evaluation into two parts: preparation time and spatial computing time since some approaches required a lot of time for importing and loading data to system until they were ready for spatial query. Spatial computing time was the time used for executing one spatial query and for this experiment, we used “Within” function that was used for finding point geometry in polygon. For dataset, we first started from one day data stored in one CSV files, one line was one point. After that, we increased the number of files to measure the performance and effect of large scale data size.
4.3 Results
Table 2 shows processing time of spatial query on one day data. Database was uncomparable with other two methods. Spatial library-based application obtained processing speed of 171,490 records per second which is a bit less than speed of Hive with Lateral View. However, when adjusted configuration of Hive to run on 22 tasks rather than normal settings (6 tasks), processing speed was increased to 288,675 records per second (about 66% improved). Map Join performance was 4% of Lateral View processing speed. Nevertheless, Map Join is Cartesian Join operation meaning that total number of records that have to be processed was much more than Lateral View Method which process on just one table.
As the results of the experiments, Table 1 shows preparation time of all methods for one day dataset. Table 1. Preparation time of all methods for one day data Method
Task
Time (Sec)
Database (PostgreSQL)
Import Data, Create Geometry, Create Spatial Index
12,073 (3.3 hrs.)
Spatial librarybased application
-
0
Cloud platform (Hadoop/Hive)
Import Data, Convert to Binary File
78
In case of database, we turned off “auto vacuum” to increase overall performance; otherwise, vacuum will run in conjunction with import process and result in very poor performance. To import one day data, it took 3.3 hours and most processing time is at creating spatial index step which is 3.2 hours. By calculation, it would take about 50 days to process all dataset. For Hadoop, it took only 78 seconds for loading and converting to sequence file format (binary) and for all data, it took about 8 hrs. Indeed, Hive is able to process data both in CSV file and binary file; however, based on our test, sequence file process much faster than text file.
Figure 10. Processing time at different number of data In addition to previous testing, we increased number of data by processing on one day (22million), two days (44 million) and five days (100 million). The result is shown in figure 10. Hive method beat other methods and one significant thing is that Hive could process data of one day and two days using same amount of time because it still has more available tasks that can be used for other processing. After evaluation on all mentioned methods, we decided to do processing on the whole dataset using Hadoop/Hive. The process was to find prefecture and city that each GPS point was located. As shown in figure 11, to process all data, Hadoop/Hive used 2,275 mapper tasks and 618 reduce tasks. Total processing time was about 17 hours, and 9,201 million records are processed. The result was stored in anew table in Hive.
6. ACKNOWLEDGMENTS The work described in this paper was conducted at Shibasaki Laboratory with an agreement from Zenrin Data Com to use mobile phone dataset of personal navigation service users for the research. This work was supported by GRENE (Environmental Information) project of MEXT (Ministry of Education, Culture, Sports, Science and Technology).
7. REFERENCES [1] Liao, L., et al. 2005. Building Personal Map from GPS Data. In proceedings of IJCAI MOO05, Springer Press(2005): 249265 Figure 11. Processing Result from Hive Interface
5. CONCLUSION AND FUTURE WORK We conducted performance evaluation for large scale spatial data processing of three difference techniques consisted of Database system techniques, Spatial Library-based application techniques, and Cloud computing technique. Moreover, various additional methods were applied in the experiment to increase performance of each technique such as database tuning, spatial index, parameter turning in Hadoop/Hive and multiple types of join methods in Hive. We also introduced new techniques allowing spatial data processing on Hive by using combination components of Java Topology Suite (JTS), User-Defined Function (UDF), Join and Lateral View. Besides, we also evaluated performance of several types of Join and Lateral View. As a result, it showed that Cloud computing platform technique, Hadoop/Hive, outperforms other evaluated techniques both in data scalable and processing time viewpoints. Hadoop/Hive offered several advantages including large scale data support, work on commodity hardware, fault-tolerant, fast processing and scalable. We could increase overall performance of the system including data storage and faster processing by simply adding new nodes to the cluster. With spatial processing on Hive, Lateral View method obtained best performances of all methods and Map Join method obtained highest scores among join method. In particularly for those two methods, if compared data have no overlap among examined polygons, Lateral View method is recommended; however, if there are some overlap in examined polygons, Map join is more suitable. For example, compared data is point geometries and examined polygons are polygons of cities. In this case, there is no overlap among cities areas; hence, Lateral View method is more preferable. In addition, Hadoop/Hive with spatial support purposed in this work allowed conducting spatial research on whole spatial dataset rather sample or a part of dataset due to performance limitation. In the future, with the promising results so far, we want to emphasize more on Cloud computing platform especially Hadoop and Hive to make it fully supportive on spatial query so that advanced analysis on large-scale dataset can be performed smoothly. Using hardware acceleration such as GPGPU to accelerate performance of Hadoop is also one of our future targets. At the meantime, processed data of whole dataset from this experiment will serve for other analysis such as people mobility, area-based population density and abnormal event detection from mobile dataset. However, we will explore more on how to deliver online services for mobile data process which is able to serve both researches and commercial.
[2] Ashbrook, D., and Starner, T. 2003. Using GPS to learn significant locations and predict movement across multiple users. Personal and Ubiquitous Computing 7(5), 275-286 [3] Zheng, Y., et al. 2008. Learning transportation mode from raw GPS data from geographic applications on the Web. In Proceedings of WWW 2008, (Beijing, China, April 2008), ACM Press: 247-256 [4] Zheng, Y., et al. 2009. Mining interesting location and travel sequences from GPS trajectories. In Proceedings of WWW 2009, (Madrid, Spain, April 2009), ACM Press: 791-800 [5] PostGIS: http://postgis.refractions.net/ [6] Java Topology Suite: http://tsusiatsoftware.net/jts/main.html [7] Yang, J. and Wu, Su. 2010. Studies on Application of Cloud Computing Techniques in GIS. In Proceedings of IGASS 2010, (China, 2010), pp. 492-495. [8] Buyya, R., et al. 2008. Cloud Computing and Emerging IT Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility. Future Generation Computer Systems 25(6), pp. 599-616 [9] Hadoop Project: http://hadoop.apache.org/ [10] Thusoo, A., et al. 2010. Data warehousing and analytics infrastructure at Facebook. In Proceedings of ACM SIGMOD 2010, pp. 1013-1020. [11] Lam, C. 2011. Hadoop in Action. Connecticut, pp.-17-19. [12] Zhang, S., et al. 2009. Spatial queries evaluation with MapReduce. In Proceedings of International Conference on Grid and Cooperative Computing 2009, IEEE Computer Society (2009), pp. 287-292. [13] Zhang, S., et al. 2009. SJMR: Parallelizing spatial join with MapReduce on clusters. CLUSTER (2009), pp. 1-8. [14] Hive Project: http://hive.apache.org/ [15] Thusoo, A., et al. 2010. Hive - a petabyte scale data warehouse using hadoop. In Proceedings of ICDE 2010, pp. 996-1005. [16] Wang, F., et al. 2011. Hadoop-GIS: A High Performance Query System for Analytical Medical Imaging with MapReduce. Technical Report. Center for Comprehensive Information, Emory University. [17] Katayama, N. and Satoh, S. 1997. The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries. In Proceedings of ACM SIGMOD 1997, (Arizona, USA, 1997), pp. 396-380.