2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing
Understanding Human Dynamics of Check-in Behavior in LBSNs
Yun Feng, Zhiwen Yu, Xinjiang Lu
Jilei Tian
School of Computer Science Northwestern Polytechnical University Xi’an, P. R. China
[email protected],
[email protected],
[email protected]
Nokia Mobile Phone Services Beijing, P. R. China
[email protected]
In this paper, we aim at understanding human dynamics of online check-in behavior based on the dataset crawled from Jie Pang, a well-known LBSN service in China. The main contributions are as follows: (1) utilize radius of gyration to represent human’s travel distance and find that user mobility shows a high degree of locality with longer travel distance than the existing study [10]; (2) analyze human’s check-in behavior of different gender and four distinct cities. The results reveal that female users are more active than male users. In addition, the more prosperous the city is, the more active the users are; (3) carry out spatial and temporal analysis at population level as well as individual level. The results demonstrate that spatial and temporal distributions at population level both follow Weibull distribution, but not all the individuals follow Weibull distribution. Some of them follow power law distribution. The remainder of this paper is organized as follows. We discuss related work in Section 2. The dataset used in this study is described in Section 3. Section 4 presents the analysis of human dynamics of check-in behavior both at the population level and the individual level. Finally we conclude the paper in Section 5.
Abstract—With the increase of popularity and pervasive use of sensor-embedded smart phones, location-based social network services (LBSNs) are widely used in recent years. In this paper, we investigate human dynamics of the check-in data crawled from Jie Pang, a famous Chinese LBSN service. We study interval time and jump size (i.e. distance) between consecutive check-ins at both population level and individual level. We find out that both the interval time and jump size follow a Weibull distribution rather than a power law distribution at the population level. As for individual level for the top 10,000 most active users, we find out that on one hand 9406 individuals follow a power law distribution and only 594 individuals follow a Weibull distribution in interval time distribution. On the other hand, 5096 individuals follow a Weibull distribution and 4904 individuals follow a power law distribution in jump size distribution. In addition, human check-in behavior from different gender and different cities are analyzed. Our experimental results show that users in Shanghai are more active than the users from other cities and females are more active than males in terms of check-in service. Keywords-human dynamics; location-based social networks (LBSN); power law distribution; Weibull distribution
I.
II.
INTRODUCTION
There have been a number of studies conducted over the past years to analyze human dynamics and mobility patterns by using the data directly extracted from Web [11], business transaction [12], online games [13], etc. Dezsö et al [11] divided a major news portal network into stable nodes and news documents and investigated the dynamics of visitation of the two node types. Gao et al [12] captured the attributes of creation times of purchase orders to an individual vendor, as well as to all vendors, and further investigated whether they have some kind of dynamics by applying logarithmic binning to the construction of distribution plots. Henderson et al [13] analyzed three specific features of the data gathered by regularly polling 2256 game servers located all over the Internet: the number of players in a game, the inter-arrival times between players and the length of a player's session. All the works primarily focus on the temporal characteristics of human behaviors. However, our work also takes the spatial features into account. Mobile phone data is widely used in understanding human dynamics in recent years [14]. Researchers use mobile phone communication data (call log and text message)
Understanding the statistical patterns and dynamics of the human behaviors plays an important role in many areas, such as urban planning [1], public health [2], traffic controlling [3], biological viruses spreading [4, 5], etc. However, it is challenging to collect large-scale data that record human behaviors. With the increase of popularity and pervasive use of sensor-embedded smart phones, location-based social network services (LBSNs) are widely used in recent years, such as Foursquare [6], Bliin [7], and Jie Pang [8]. These social network services not only allow users to explore location-aware information, but also provide interfaces to write reviews and share their locations and experiences with others. All the locations uploaded by users consisting of geographic coordinate and time information compose spatial-temporal trajectories of their daily lives. We regard them as “digital footprints” [9]. These digital footprints allow us to explore user behaviors, for example, how often people would like to make check-ins and how far people would like to travel.
978-0-7695-5046-6/13 $26.00 © 2013 IEEE DOI 10.1109/GreenCom-iThings-CPSCom.2013.160
RELATED WORK
923
to represent human’s travel data. They regard the location of the tower routing the communication as user’s location. Barabasi et al [15] mined individual human mobility patterns by studying the trajectories of mobile phone users and found that human trajectories show a high degree of temporal and spatial regularity. In their research, human trajectories are extracted from cell-ID, but we use the real location data uploaded by users themselves to analyze their travel patterns. Recently researchers start to use location-based social network data, e.g., Foursquare to investigate human dynamics. Noulas et al [16] presented a large scale study of geo-temporal dynamics of collective user activity on Foursquare. However, our work not only considers collective user activity on LBSN, but also takes individual user activity into account. Furthermore, we group users according to their cities and gender, and study the temporal spatial features of different groups. III.
DATA
Figure 1. The distribution of check-in number of ordered individuals
The data used in this paper are collected from Jie Pang, which is one of the location-based social network services in China that allows users to share their locations, write tips, upload photos, etc. Jie Pang has released client applications on smart phones. These client applications can use GPS and other technologies for automatic location sensing. Also, users can leverage “check in” feature to update their current locations. We used the open API [17] provided by Jie Pang to collect users’ check-in data. The dataset used in this paper was collected in two months from Apr. 2012 to Jun. 2012. We randomly selected a set of 89,936 anonymous users from those who made at least one check-in on Jie Pang during the period from users’ registration time to our collection time. The dataset includes a total of 2,407,647 records and each record contains user ID, check-in time, and geographic coordinate (latitude / longitude). TABLE I.
DATASET OVERIEW
Number of entries Number of users Average number of entries per user Average number of entries of top 20% users Percentage of entries of top 20% users TABLE II.
2,407,647 89,936 26.77 105.3 78.7%
IV.
THE NUMBER OF CHECK-IN ENTRIES OF DIFFERENT CITIES
City Shanghai Beijing Guangzhou Hangzhou TABLE III.
The overview of the dataset is presented in TABLE I. The average number of entries per user is only 26.77. We also examine the percentage of entries of top 20% active users, and we can see that the top 20% active users have contributed about 80% check-in entries, which follows Pareto law. Each of the top users makes approximate 105.3 check-ins in average, which is much larger than 26.77. We order the individuals according to their number of entries. The distribution of entries among the ordered users is illustrated in Fig. 1, which follows the power law. Specifically, the figure reveals that the most active user in our dataset makes about 3200 check-ins and there are only about 5% users who make more than 100 check-ins, whereas the majority of the users are rather inactive. Within the collected dataset, the four cities with the most check-ins are Shanghai, Beijing, Guangzhou, and Hangzhou. The number of users and check-in entries in the four cities are depicted in TABLE II. The number of check-ins in Shanghai is much more than that of the other three cities. Furthermore, we list the number of users and check-in entries of different gender in TABLE III. We can see that most people prefer not to indicate their gender. Among those who indicated their gender, females are more active than males.
Number of users 48,843 12,195 2,010 1,880
In this section, we analyze spatial and temporal patterns of human check-in behavior both at population and individual level.
Number of entries 1,454,515 315,787 42,759 38,607
A. Travel Distance Travel distance refers to the typical distances covered by individuals during their daily lives. We adopt the radius of gyration [15] to represent the moving range of users. As we know, human’s check-in behaviors are highly heterogeneous. Some users rarely share their locations while others prefer to record life trajectories. To investigate whether this heterogeneity affect the distribution of travel distance, we
THE NUMBER OF CHECK-IN ENTRIES OF DIFFERENT GENDER
Gender Female Male Unknown
Number of users 18,084 13,933 57,919
HUMAN DYNAMICS OF CHECK-IN BEHAVIOR
Number of entries 772,843 447,164 1,187,640
924
Figure 2. Travel distance distribution of groups with different check-in num.
Figure 4. Interval time distribution at population level and its fits.
Figure 3. Cumulative distribution of users covering different travel distances.
Figure 5. Interval time distribution of groups with different check-in num.
divide users into four groups based on their total number of check-in records. Fig. 2 presents the distribution of the typical distance covered by the four groups. For each group, we measured the probabilities of different distances. It shows that different groups have the similar distribution, which means the heterogeneity does not affect users’ travel distance. We can see that most individuals’ daily activity is confined to a limited neighborhood, while a small number of users cover hundreds of kilometers. Fig. 3 depicts that more than 70% of the population’s travel distance are within 25 km. From the distribution of individual’s travel distance, we can see that user mobility shows a high degree of locality which agrees with earlier results on the geographical distribution of mobile phone communication [10]. But our result indicates longer travel distance than the result reported in [10].
B. Human Dynamics at Population Level 1) Interval Time From the temporal aspect, we study the characteristics of interval time between successive check-in records. We plot interval time distribution of all individuals in Fig. 4. It is obvious that the interval time distribution at population level follows heavy-tail distribution. In order to find the specific distribution, we make a hypothesis that the interval time distribution follows either a Weibull distribution [18] or a power-law distribution. Then we apply the least-square method to estimate the power law exponents and Weibull parameters to give complete fits of the duration distribution curve in Fig. 4. We then use Kolmogorov-Smirnov (KS) test [15] to measure the goodness of fit, which is widely used to evaluate whether a distribution function fits the real data.
925
Figure 8. Interval time distribution of users with different gender.
Figure 6. Statistical distribution of interval time of different groups.
Figure 7. Interval time distribution of users in different cities.
Figure 9. Aggregated jump size distribution and its fits
T for three groups in Fig. 6. We can observe that for all groups, the statistical number demonstrates vibration attenuation periodically. And the period of all groups is 24 hours. We would like to investigate the user check-in characteristic of different cities and gender. The users are grouped based on their registered city and gender. Fig. 7 illustrates the interval time distribution of check-in behavior of users who live in Shanghai, Beijing, Guangzhou, and Hangzhou. The shape parameter of these four distributions is 0.4408, 0.4384, 0.4192 and 0.4052 respectively. According to this, we can conclude that users in Guangzhou and Hangzhou tend to wait longer time before the next check-in than the users in Beijing and Shanghai. The users in Shanghai tend to wait the shortest time among the four cities. We believe this phenomenon can reflect the city size and
With two distribution functions, the one with smaller KS value fits better [18]. The solid line in Fig. 4 presents a Weibull fit with We get KS value of 0.18 for the Weibull fit and 1.73 for the power law fit. It means that for all individuals, the interval time distribution is better fit by the Weibull fit than the power law fit. To characterize the duration distribution of human checkin behavior, we divide users into five groups according to their total number of check-in records. Fig. 5 presents the probability that the interval time between two consecutive check-in records is T for each group. We can see that users with less check-in records tend to wait longer time before the next check-in. We also present the statistical number that the interval time between two consecutive check-in activities is
926
Figure 10. Aggregated jump size distribution of users in different cities.
Figure 12. KS test value for each of the 10,000 individuals.
Figure 11. Aggregated jump size distribution of users with different gender.
Figure 13. Interval time distributions of three randomly chosen individuals.
distribution. We assume that it is a power law distribution or a Weibull distribution. The solid line in Fig. 9 shows the aggregated jump size over all users can be fit by a Weibull $& * +
it #$ $