Mining Location Information from Users' Spatio ...

7 downloads 0 Views 394KB Size Report
the People's Climate March, and Derek Jeter's last game with the Yankees. These themes range from specific events to more general depending on the size of ...
Mining Location Information from Users’ Spatio-temporal Data Sage Jenson∗ , Majerle Reeves† , Marcello Tomasini‡ and Ronaldo Menezes‡ ∗

Department of Computer Science & Department of Mathematics Oberlin College [email protected] † Department of Mechanical Engineering & Department of Mathematics California State University-Fresno [email protected] ‡ BioComplex Laboratory, School of Computing Florida Institute of Technology [email protected], [email protected] Abstract—The availability of large localized datasets from popular social-networking sites such as Foursquare and Twitter has enabled the expansion of the field of human mobility analysis. Extracting location information from user trajectories helps to increase the understanding of human mobility and is important for improving public health, city planning, as well as crime prevention. In this work, we introduce a framework to automatically discover and classify locations based solely from geo-temporal data and without relying on contextual data, such as user explicit check-ins. We collected a dataset of 3 millions geotagged tweets of the New York City area to test our framework. By using a combination of clustering algorithms to accurately pinpoint location position (from a cloud of points representing tweets), and the popular time/days histograms as features to train the classifier, we were able to reach up to 53% average accuracy for the location type classification. Index Terms—Human Mobility, LBSNs, Data Mining, Location Classification, Crowd Sensing

I. I NTRODUCTION The understanding of human mobility patterns is integral to the advancement of many fields including, but not limited to, public health [1], city planning [2], and economic forecasting [3]. The study of human mobility has experienced a massive growth due to the recent availability of cell phone Call Detail Records (CDRs) and large localized datasets from social networking sites. These datasets provide large amounts of mobility data thus allowing the study of many aspects of human dynamics [4]. Yet, when looking at geo-tagged datasets, we are faced with several challenges. CDR data is usually limited to cellphone tower identifiers and timestamps, therefore lacking any contextual information, while data from Location Based Social Networks (LBSNs) is sparse and often not publicly available. Furthermore, GPS traces show an unexpected problem regarding the high resolution of the data. The GPS information collected from several users creates a cloud of points representing the movements between specific locations (e.g., a bar, a coffee shop), however many different GPS coordinates exist within what we would call a single location. Hence, we

have to provide a mechanism to define and exactly pinpoint the locations from this high-resolution cloud of GPS points. Finally, if we do not have any contextual information, such as the location type, we need to extract such information from the GPS points. In this work, we introduce a framework to augment mobility related data by relying solely on geographic coordinates, not necessarily originated from check-ins, and associated timestamps. We seek to first identify the positions of the locations using a clustering algorithm and second to classify the locations based on their category (e.g., restaurant, office, night club). In order to classify the locations, we extract features based on hourly and daily visits to each location. Due to the fact that the data we collected is unlabeled, we use Foursquare check-in data, Google Places API, and the content/text of the tweets to label the train data for the classifier. Additionally, we use a small subset of the tweets containing data generated by Foursquare and a semi-supervised learning approach to spread the labels to points for which we do not have labels [5] to validate a Support Vector Machine (SVM) classifier, resulting in an average classification accuracy of 53% over 7 location categories. Finally, we use spectral clustering [6] to analyze the feature space. We then perform a text analysis of the content of the tweets within these clusters to reveal common themes or events associated with the locations, such as museums, 9/11, the People’s Climate March, and Derek Jeter’s last game with the Yankees. These themes range from specific events to more general depending on the size of the cluster. The remainder of the paper is organized as follows: Section II provides a short review on human mobility, location based social networks, and location mining; Section III presents the methods for data collection and processing; Section IV describes the findings of this work; last, Section V emphasizes the significance and impact of the proposed solution. II. BACKGROUND AND M OTIVATION The study of human mobility witnessed a dramatic growth since the seminal work of Brockmann et al. [7] in 2006;

the authors revolutionized the modeling human mobility by attempting to study individual mobility instead of population patterns. Their approach was founded on the analysis of movement of bank notes as a proxy of human mobility. When large phone Call Detail Records (CDRs) became available, researchers like Gonzales et al. [8] and Song et al. [9] were able to identify and model patterns of human mobility, such as a regular schedule at very few locations and the fact that movement is usually contained to a small specific area. More recent models, further extend such research by studying the frequency that people visit locations [10], how recently people have visited locations [11], motifs in human movements [12], and the connection between mobility and individuals’ social network [13], [14]. Lately, the human mobility research started to focus on classifying people in groups based on the intricacies of their movement. Pappalardo et al. [15] classified people based on differences in their radius of gyration, while Xiao et al. [16] used the history of visited places by a user to group users based on a similarity metric. In [17], the authors used the information related to the activity performed in each place as the purpose for the movement to augment the ability to detect urban mobility patterns and anomalies. Finally, Wu et al. [18] worked on modeling the intra-urban mobility of people by using an activity-based framework. At the same time, the increasing use of of smart phones and social media have created a large pool of data that scientists can study. The data that comes from Location Based Social Networks (LBSNs) has the advantage, when compared to CDRs, to have contextual information associated with geographical position, and therefore allows the extraction of more information related to the users. Scientists have been able to identify user information such as relevant locations to individual users [19], and give recommendations on future locations to visit based on location history [20], [21]. LBSNs and CDRs data are also able to reveal information that is not related just to mobility. Lian et al. [22] extracted location names from the current location, time and check-in histories of the users. Silva et al. [23] used Twitter data linked to Foursquare to find connections between different types of locations based on user movement. Liu et al. [24] used tweet content and user history information to infer location type. Ye et al. [25] used a combination of explicit patterns of individual places, such as number of visitors and distribution of checkins over time, and implicit relationships among similar places to annotate places (add tags). Wakamiya et al. [26] were able to classify urban characteristics using geo-tagged tweets and their timestamps. Yuan et al. [27] used human mobility and points of interest to classify different sections of Beijing. Shaw et al. [28] used millions of Foursquare check-ins and relative metadata to develop a search algorithm to associate a noisy GPS user location to a point of interest. Check-in data on LBSNs have also allowed scientists to study the connections between social aspects and human mobility. For example, people who are popular on social media have a larger radius of gyration [29] and people who are more socially connected are more likely to have friends that are farther away [30].

CDRs and LBSNs data complement each other. On one hand, CDRs data lacks useful contextual information, on the other hand LBSNs data tends to suffer of sparsity and is hard to get hold of (not publicly available). More importantly, it is quite rare to have both type of data available at the same time (for the same location). Therefore, there is the need to augment the data before analyses by either extracting contextual information or inferring missing data. Our work aims to introduce a framework to augment mobility related data by relying solely on geographic coordinates of social media posts (not solely explicit check-ins) and associated timestamps to extract useful information regarding the locations visited by the users. Our approach is quite general and it is not restricted to any specific source of data. The framework we propose is a useful tool to provide some contextual information from datasets that are either not rich in information or relatively sparse/incomplete with respect to the individual trajectories. III. M ETHODS AND T ECHNICAL S OLUTIONS The dataset we use in this work contains one month of geotagged tweets (from August 28th to September 29th, 2014) that were collected using the Twitter stream API. There were initially 3,372,000 tweets collected from 186,964 users from the New York City area. However, only tweets from inside the Manhattan area were considered because of the high amount of activity in that area. This step removed 63.9% of the tweets and 41.5% of the users, resulting in 1,217,229 tweets from 109,305 users. We then applied a filtering step to remove non-human Twitter’s users. First, users that moved faster than 80 m/s were removed. 80 m/s has been considered a relative safe threshold for users’ movement speed. Second, we removed users with a suspiciously high activity. All users with more than 500 tweets in a month were removed. Although we understand that it may be possible for a user to tweet that many times, we did not want to run the risk to include “robots” in our dataset. The final dataset, D1 , contains 1,022,286 tweets from 108,341 users. A. Tweets Location Clustering We transformed the geographical coordinates from spherical to planar by using the Mercator projection [31]. The Mercator projection deforms area proportionally to higher latitudes, yet conserves relative distances as long as these distances are suitably small (less than one degree latitude/longitude). For the Manhattan bounding box, the width (w), height (h), and diagonal distance (diag) errors were found to be w ' 6.09%, h ' 6.23%, diag ' 6.09% respectively. This distance error was calculated by comparing the haversine distance [32] between the real-world coordinates to the euclidean distance between the projected points. The points were clustered into locations using several clustering algorithms: k-means, mean shift [33], and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) [34]. We tested k-means with k ∈ [5, 000, 50, 000] with a step increment of 5, 000. Mean shift with a bandwidth value b from ≈ 1m to ≈ 110m and

noise classification both enabled and disabled. HDBSCAN was tested with the minimum cluster size m from 2 to 45. Several methods were used to evaluate the quality of the clustering algorithms as a function of the specific parameters of the clustering algorithms, including: the Bayesian Information Criterion (BIC) [35], the silhouette score [36], the elbow rule [37], the number of clusters found, the number of unclustered points, as well as visual inspection. This allowed us to identify the proper parameters for each clustering algorithm and to identify where they agreed. The optimal parameters according to the aforementioned metrics are k = 25, 000 for k-means, b ∈ [22.2, 44.5] for the mean shift, and m ∈ [7, 10] for HDBSCAN. A comparison of the results showed that the optimal number of clusters from the k-means analysis is in agreement with mean shift and HDBSCAN results. Specifically, for mean shift a bandwidth b = 44.5 resulted in 28,663 clusters, while HDBSCAN with a minimum cluster sizes of m = 7 yielded to 26,093 clusters. We decided to use the clustering performed by HDBSCAN due to the advantages of this algorithm over the others (e.g., it allows for non-round clusters, clusters with different density, and noise classification). We can confidently use HDBSCAN to produce a robust clustering by using the aforementioned parameters. B. Feature Extraction for Location Classification The HDBSCAN clustering identified 26,093 locations. In order to classify these locations, we extracted feature vectors associated with the temporal aspect of visitation by the users. The feature we used is a popular-times histogram. However, since the clustering was performed using HDBSCAN with a minimum cluster size of seven, some of the 24 hour histograms can potentially have only seven points of data. Two solutions were implemented to combat this sparsity: first, we applied a one dimensional Gaussian filter [38] to smooth and fill in areas of the histogram (Figure 1). Second, we created a popular days of the week histogram to improve the power of the features to discriminate among various locations (Figure 2). The feature vectors generated this way show that some locations have very specific week/weekend behavior. Once the feature vectors were extracted and normalized, they showed very characteristic temporal patterns associated with the type of the locations (Figure 3). After normalization, we labeled the data using Foursquare data. We queried the Foursquare dataset provided by Yang et al. [39] and labeled every location generated by the clustering of Twitter data within 20 meters of a Foursquare check-in with the Foursquare location type label. We used twenty meters as the given radius, because even though many points remained unclassified, expanding the radius would lead to incorrect labeling and the introduction of noise into the data. The process generated 3,959 labeled locations and 174 categories. In order to improve the labeling accuracy, and help the training of the classifier, we removed underrepresented categories (< 2% of total dataset) and combined similar categories (e.g., Food and Drink Shop, Mexican Restaurant, American Restaurant, Bar)

into larger categories. This reduced the number of labeled locations to 1,619 and the number of categories to 9. We also filtered the locations based on their properties. We improved accuracy by only testing robust locations, that is, those that have enough data produced by several different users but not too much data as there is the possibility that multiple locations were clustered into one. We filtered based on several criteria: number of users per location, maximum cluster size, and minimum cluster size. We required a minimum number of users to remove private residences, imposed a maximum cluster size to eliminate densely packed areas that may be skyscrapers that have locations on top of each other by nature of being on separate floors (these aggregated location are hard to classify), and removed clusters below a certain size to increase the data points in each histogram. The ideal parameters for each of these filtering were found to be a minimum of 10 users, a maximum cluster size of 80 points, and a minimum cluster size of 24 points. The filtering left 326 locations and 6 categories. The filtering removed many locations (almost 80% when removing unclassified points), therefore to confirm that our approach is robust we used a second dataset of labels gathered querying Google Places API to find the nearest locations within 20 meters to each of the locations we identified. Google Places is more detailed and leaves only one third of the points as unclassified. However, the Google results did lead to some issues because each location returns a maximum of ten labels and each label is a list of locations from most specific to least specific (e.g., bakery, store, food or furniture store, clothing store, home goods store, store). This means that depending on whether the most or least specific label is used, some categories are sub or super sets of the others (e.g. bakery and clothing store are both a subset of store). For this reason, we first tried using the SVM classifier with the most general result to reduce the number of categories and group similar categories. We filtered the data similarly to the filtering of the Foursquare data, by removing underrepresented categories, enforcing a minimum number of users, and setting an upper and lower bound on cluster size. The ideal parameters were found to be a minimum of 20 users per location and cluster sizes between 36 and 80. C. Extraction of Events and Activities We examined the structure of the feature space using spectral clustering. The idea is to use spectral clustering as a way to learn the “labels” while solving a manifold learning problem on the feature vector space [6]. In order to learn the labels we run a textual analysis on the content of the tweets of each cluster. The topic extracted from the text of the tweets is representing the category of the locations in the cluster, and it is assigned as cluster label. The text content of the tweets needs to be treated with the usual text processing techniques. First all punctuation and capitalization was removed from the tweets. This analysis is only interested in words within the tweet that have semantic

0.16

0.14

0.14

0.12

0.12 0.10 0.10

p

p

0.08 0.08

0.06 0.06 0.04 0.04 0.02

0.02

hine [4]

0.00

0

5

10 Time (hours)

15

20

(a) Popular-times histogram.

0.00

0

5

10 Time (hours)

15

20

(b) Popular-times histogram after the use of a Gaussian filter.

Figure 1. (a) The histogram of popular times for a location. (b) We applied a one dimension Gaussian filter to smooth the histogram in (a) as an attempt to combat the sparsity of the data for some of the locations.

Normed Number of Tweets

Sample Days of the Week Histogram

Sunday - Saturday Figure 2. Example of popular days of the week histogram for a place that is frequented especially over the week-end. The popularity of a location during the days of the week is combined with the popularity of the location during the time of the day. The resulting feature vectors have 31 (24+7) components.

relevance, thus we removed common stop-words from the tweets using NLTK [40]. We also removed words that lacked thematic significant and were very common in the dataset, namely the set {‘new’, ‘york’, ‘ny’, ‘newyork’, ‘city’, ‘nyc’}. We then examined the frequency of specific words within each cluster to give an insight into the clustering. IV. E MPIRICAL E VALUATION We studied the performance of a SVM classifier after applying several filters to the data and applying the labels either from Foursquare (Table I) or Google (Table II). While filtering improves classification performance, it should be noted that each filtering step significantly reduces the number of valid labeled locations to use to train and test the classifier. Therefore, there is the possibility that the increase in performance is driven by the fact that the remaining locations may be easier to classify. Also, the classification accuracy achieved by using Google Places labels does not get nearly as high as Foursquare data. There were several problems with using Google data. The

main problem is that because Manhattan is such a dense city, when searching for locations types within a 20 meter radius of the location up to 10 locations are returned. This means that we were unable to select a correct location since we are in the range of GPS accuracy. Since we lacked the ground truth data, the amount of noise in the labels used to train the classifier and the limited amount of points prevent the SVM classifier from being able to learn effectively. To overcome the limited amount of labeled data, we took a new approach based on a semi-supervised learning algorithm. By using a small portion of the tweets that we collected that were actually posted by the Foursquare app with a location in the text of the tweet and using keyword searches we were able to confirm 556 locations. The location categories have largely varying weights and the “food/bar” and “coffee” labels comprise over 70% of the confirmed 12 labels (Table III) which are taken in account by weighting the classes to remove any bias. Also, such locations representation confirms the intuition that people mostly check-in in places of leisure. We used the label propagation algorithm to spread the labels to the rest of the unlabeled locations using a knearest neighbors approach [5]. An SVM classifier trained on these newly labeled locations resulted in the average accuracy to be improved to 53% accuracy, 59% precision and 54% recall. However, some classes are remarkably harder to classify correctly then others (Figure 4). The textual analysis on the content of the tweets aggregated in each cluster revealed a surprising strong structure that enabled us to extract the topic representing the locations in the cluster. By using BIC and the silhouette score we identified the optimal number of cluster k at k = 15. A simple frequency analysis of the common text in each cluster reveals a pattern. There are two types of locations: smaller clusters that correspond to specific events and larger clusters that correspond to more general types of locations (Table IV). Within the smaller clusters, there exists a 9/11 cluster (note that we gathered information in September of 2014), a Derek

Table I F OURSQUARE SVM CLASSIFICATION PERFORMANCE AFTER EACH STEP OF FILTERING .

2% Filter

Consolidate

> 10 Users

< 80 Points

> 24 Points

Accuracy Precision Recall

25% 31% 25%

40% 49% 40%

40% 52% 41%

45% 53% 43%

49% 54% 43%

Locations Categories

1,619 174

1,619 9

803 6

755 6

326 6

Metric

Table II G OOGLE SVM CLASSIFICATION PERFORMANCE AFTER EACH STEP OF FILTERING .

2% Filter

Remove Store

> 20 Users

< 80 Clusters

> 36 Clusters

Accuracy Precision Recall

15% 25% 14%

17% 31% 17%

20% 38% 21%

23% 39% 21%

26% 44% 26%

Locations Categories

13,595 9

10,384 8

1,138 8

957 8

487 8

Metric

Table III B REAKDOWN OF CONFIRMED FOURSQUARE LABELS . Category food/bar coffee transportation park gym museum office

Table IV S IZES AND NAMES OF PROMINENT TOPICS IN THE CLUSTERS . Size (%)

Size (%) 35.4 35.1 11.3 6.7 5.8 2.9 2.9

Jeter’s last game cluster (Jeter was an acclaimed Yankees baseball player that played his last game during our data collection period), a People’s Climate March cluster, and a Smorgasbord Flea Market cluster. These clusters have clear keywords relating to the event, as well as a single large peak in the day of the week corresponding to when the event took place (Figure 5). There were two other small clusters, Soulcycle and Salsa, that have distinct temporal peaks (6 am and 3 am respectively). Each large cluster shows a distinct temporal signature (Figure 6). The “Night Party” category intuitively is far above average popularity between midnight and 5 am, while the “Museum” category is above average in the late afternoon. Unlike the smaller categories, all of the locations in these clusters are not actually specific to the most popular words. However, the name of the cluster can give information as to the typical behavior of such a location, e.g., the “Madison Square Garden” category contains locations that are popular in the late evening, and the “Class” category contains locations that have peaks around 9 am. V. S IGNIFICANCE AND I MPACT The method we introduced attempts to automatically discover and classify locations based on geo-temporal data without relying on contextual information. Popular time and days

Times Square NYFW I NYFW II Museum Madison Square Garden Yankees Unknown Night Party Class Jeter Climate March 9/11 Smorgasburg Flea Soulcycle Salsa

28.4 16.9 13.0 8.5 8.3 7.9 6.3 3.6 2.2 1.5 1.2 0.5 0.5 0.5 0.5

histograms proved to be powerful features to classify user locations. This was also shown by the spectral clustering of the feature space, which clustered locations into small clusters when they were related to specific events (e.g., a Yankees game, a popular salsa night) while large clusters were bound to general activities (e.g., visiting a museum, visiting Time Square). The benefit of our approach is that, by taking advantage of the ubiquity of smartphones with GPS capability and the “wisdom of crowds”, users locations do not need to be known a priori but can be inferred by analyzing the clusters of coordinate points generated by the activity of the users themselves. The presented approach to classify users locations provides a robust framework to analyze geo-temporal data and can further improve the understanding of human activity and mobility. This is in turn of great importance to public health, city planning, and crime reduction; for instance, there is a lot

72 coffee

73%

14%

1%

0%

2%

1%

7%

food/bar

20%

71%

0%

0%

0%

1%

5%

gym

42%

29%

7%

0%

2%

6%

10%

museum

26%

10%

1%

35%

4%

12%

8%

office

53%

8%

1%

0%

30%

2%

3%

park

15%

15%

0%

1%

0%

55%

12%

transportation

29%

15%

1%

0%

2%

5%

44%

0.6

64

True Label

0.4

0.0

32

−0.2 Coffee Shop Food Gym / Fitness Center Hotel Office Travel

24 16 8

n tio or

ta

e

k pa r

fic

m

sp tr

an

Time (hours)

m

ar

ffe

e

0 25

of

20

eu

15

us

10

gy

5

m

0

co

−0.6

fo

−0.4

48 40

od /b

Normalized Data

0.2

56

Predicted Label

(a) Popular Times

Figure 4. Confusion matrix for a SVM classifier trained on the propagated labels. Some categories are problematic to classify due to either lack of enough training data or because the types of places are often co-located.

0.4 0.3

0.1

3.5

9/11 Jeter Flea Climate

0.0

3.0 −0.1

2.5 −0.2 Coffee Shop Food Gym / Fitness Center Hotel Office Travel

−0.3 −0.4 −0.5

0

1

2

3 Days of the Week

4

5

6

(b) Popular Days Figure 3. The normalized, aggregated per location category, feature vectors show that each category has a specific temporal pattern. For instance, office is very popular at around 8 am and during weekdays, but not at night and in the week ends.

more information on knowing that people tend to go from a bar to a restaurant, than knowing in a more general way that they went from point A to point B (with unclassified data). The regularities in human movement based on classified information are a lot more powerful than regularities in unclassified data. Furthermore, the geo-mapping of locations can be exploited by companies that want to provide location-based services (e.g. Yelp, Trip Advisor). Currently these companies rely on expensive local inspection, by convincing businesses to subscribe to their paid services, or rely on user reporting. It is quite common for users to be asked to select a specific location they are in, if they want to “check-in” using their location-based services. Our classification can make this task irrelevant given that the location can be classified later, that is, users will not need to select where they are because this can be automated. Our approach would also introduce a more unified approach for location classification; while there has been some data sharing between companies, such data has been an increasing competitive advantage, and therefore is

Average Day Feature

Normalized Values

0.2

2.0 1.5 1.0 0.5 0.0 −0.5 −1.0

Su

M

Tu

W Day

Th

F

Sa

Figure 5. The small clusters in the feature space are associated with specific events. These clusters have clear key words relating to the event, as well as a single large peak in the day of the week corresponding to when the event took place.

considered extremely valuable. A general method as the one presented has the potential to drastically reduce the cost of this services while simultaneously improving the quality of the service. ACKNOWLEDGMENTS This material is based in part on work supported by the National Science Foundation under Grant No. 1560345. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Ronaldo Menezes is partially funded by the Army Research Office (ARO) contract No. W9111NF-17-1-0127.

2.5 NYFW II NYFW I Night Party Class Yankees Madison S.G. Unknown Times Square Museum

2.0

Average Hour Feature

1.5

1.0

0.5

0.0

−0.5

−1.0

0

5

10

15

20

25

Time (hours)

Figure 6. The larger cluster in the feature space have distinct hourly peaks associated with the typical popular time for such activities.

R EFERENCES [1] I. M. Longini, M. E. Halloran, A. Nizam, and Y. Yang, “Containing pandemic influenza with antiviral agents,” American journal of epidemiology, vol. 159, no. 7, pp. 623–633, 2004. [2] Y. Yang, D. Gerstle, P. Widhalm, D. Bauer, and M. Gonzalez, “Potential of low-frequency automated vehicle location data for monitoring and control of bus performance,” Transportation Research Record: Journal of the Transportation Research Board, no. 2351, pp. 54–64, 2013. [3] X. Gabaix, P. Gopikrishnan, V. Plerou, and H. E. Stanley, “A theory of power-law distributions in financial market fluctuations,” Nature, vol. 423, no. 6937, pp. 267–270, 2003. [4] E. Cho, S. A. Myers, and J. Leskovec, “Friendship and mobility: user movement in location-based social networks,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011, pp. 1082–1090. [5] Y. Bengio, O. Delalleau, and N. Le Roux, “Label propagation and quadratic criterion,” Semi-supervised learning, vol. 10, 2006. [6] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007. [7] D. Brockmann, L. Hufnagel, and T. Geisel, “The scaling laws of human travel,” Nature, vol. 439, no. 7075, pp. 462–465, 2006. [8] M. C. Gonzalez, C. A. Hidalgo, and A.-L. Barabasi, “Understanding individual human mobility patterns,” Nature, vol. 453, no. 7196, pp. 779–782, 2008. [9] C. Song, Z. Qu, N. Blumm, and A.-L. Barab´asi, “Limits of predictability in human mobility,” Science, vol. 327, no. 5968, pp. 1018–1021, 2010. [10] C. Song, T. Koren, P. Wang, and A.-L. Barab´asi, “Modelling the scaling properties of human mobility,” Nature Physics, vol. 6, no. 10, pp. 818– 823, 2010. [11] H. Barbosa, F. B. de Lima-Neto, A. Evsukoff, and R. Menezes, “The effect of recency to human mobility,” EPJ Data Science, vol. 4, no. 1, pp. 1–14, 2015. [12] C. M. Schneider, V. Belik, T. Couronn´e, Z. Smoreda, and M. C. Gonz´alez, “Unravelling daily human mobility motifs,” Journal of The Royal Society Interface, vol. 10, no. 84, p. 20130246, 2013. [13] D. Karamshuk, C. Boldrini, M. Conti, and A. Passarella, “Human mobility models for opportunistic networks,” Communications Magazine, IEEE, vol. 49, no. 12, pp. 157–165, 2011. [14] J. L. Toole, C. Herrera-Yaq¨ue, C. M. Schneider, and M. C. Gonz´alez, “Coupling human mobility and social ties,” Journal of The Royal Society Interface, vol. 12, no. 105, p. 20141128, 2015. [15] L. Pappalardo, F. Simini, S. Rinzivillo, D. Pedreschi, F. Giannotti, and A.-L. Barab´asi, “Returners and explorers dichotomy in human mobility,” Nature communications, vol. 6, 2015. [16] X. Xiao, Y. Zheng, Q. Luo, and X. Xie, “Inferring social ties between users with human location history,” Journal of Ambient Intelligence and Humanized Computing, vol. 5, no. 1, pp. 3–19, 2014. [17] L. Gabrielli, S. Rinzivillo, F. Ronzano, and D. Villatoro, “From tweets to semantic trajectories: mining anomalous urban mobility patterns,” in Citizen in Sensor Networks. Springer, 2014, pp. 26–35.

[18] L. Wu, Y. Zhi, Z. Sui, and Y. Liu, “Intra-urban human mobility and activity transition: evidence from social media check-in data,” PloS one, vol. 9, no. 5, p. e97010, 2014. [19] M. Mamei, M. Colonna, and M. Galassi, “Automatic identification of relevant places from cellular network data,” Pervasive and Mobile Computing, 2016. [20] Y. Zheng and X. Xie, “Learning location correlation from gps trajectories,” in Mobile Data Management (MDM), 2010 Eleventh International Conference on. IEEE, 2010, pp. 27–32. [21] D. Yang, D. Zhang, Z. Yu, and Z. Wang, “A sentiment-enhanced personalized location recommendation system,” in Proceedings of the 24th ACM Conference on Hypertext and Social Media. ACM, 2013, pp. 119–128. [22] D. Lian and X. Xie, “Learning location naming from user check-in histories,” in Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 2011, pp. 112–121. [23] T. H. Silva, P. O. Vaz de Melo, J. M. Almeida, J. Salles, and A. A. Loureiro, “Revealing the city that we cannot see,” ACM Transactions on Internet Technology (TOIT), vol. 14, no. 4, p. 26, 2014. [24] H. Liu, B. Luo, and D. Lee, “Location type classification using tweet content,” in Machine Learning and Applications (ICMLA), 2012 11th International Conference on, vol. 1. IEEE, 2012, pp. 232–237. [25] M. Ye, D. Shou, W.-C. Lee, P. Yin, and K. Janowicz, “On the semantic annotation of places in location-based social networks,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011, pp. 520–528. [26] S. Wakamiya, R. Lee, and K. Sumiya, “Crowd-based urban characterization: extracting crowd behavioral patterns in urban areas from twitter,” in Proceedings of the 3rd ACM SIGSPATIAL international workshop on location-based social networks. ACM, 2011, pp. 77–84. [27] J. Yuan, Y. Zheng, and X. Xie, “Discovering regions of different functions in a city using human mobility and pois,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012, pp. 186–194. [28] B. Shaw, J. Shea, S. Sinha, and A. Hogue, “Learning to rank for spatiotemporal search,” in Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 2013, pp. 717–726. [29] Z. Cheng, J. Caverlee, K. Lee, and D. Z. Sui, “Exploring millions of footprints in location sharing services.” ICWSM, vol. 2011, pp. 81–88, 2011. [30] S. Scellato, A. Noulas, R. Lambiotte, and C. Mascolo, “Socio-spatial properties of online location-based social networks.” ICWSM, vol. 11, pp. 329–336, 2011. [31] J. P. Snyder, Map projections–A working manual. USGPO, 1987, no. 1395. [32] C. Robusto, “The cosine-haversine formula,” The American Mathematical Monthly, vol. 64, no. 1, pp. 38–40, 1957. [33] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 5, pp. 603–619, 2002. [34] R. J. Campello, D. Moulavi, and J. Sander, “Density-based clustering based on hierarchical density estimates,” in Advances in Knowledge Discovery and Data Mining. Springer, 2013, pp. 160–172. [35] C. Fraley and A. E. Raftery, “How many clusters? which clustering method? answers via model-based cluster analysis,” The computer journal, vol. 41, no. 8, pp. 578–588, 1998. [36] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of computational and applied mathematics, vol. 20, pp. 53–65, 1987. [37] T. M. Kodinariya and P. R. Makwana, “Review on determining number of cluster in k-means clustering,” International Journal, vol. 1, no. 6, pp. 90–95, 2013. [38] A. Buades, B. Coll, and J.-M. Morel, “A review of image denoising algorithms, with a new one,” Multiscale Modeling & Simulation, vol. 4, no. 2, pp. 490–530, 2005. [39] D. Yang, D. Zhang, V. W. Zheng, and Z. Yu, “Modeling user activity preference by leveraging user spatial temporal characteristics in lbsns,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 45, no. 1, pp. 129–142, 2015. [40] J. Perkins, Python text processing with NLTK 2.0 cookbook. Packt Publishing Ltd, 2010.

Suggest Documents