DOI 10.1111/tgis.12266
SHORT TECHNICAL NOTE
Big location-based social media messages from China’s Sina Weibo network: Collection, storage, visualization, and potential ways of analysis Michael Jendryke1 | Timo Balz1,2 | Mingsheng Liao1,2 1
State Key Laboratory of Information Engineering, in Surveying, Mapping and Remote Sensing, Wuhan University, 129 Luoyu Road, 430076 Wuhan, China
2
Collaborative Innovation Center for Geospatial Technology, 129 Luoyu Road, Wuhan 430079, China Correspondence Timo Balz, State Key Laboratory of Information Engineering, in Surveying, Mapping and Remote Sensing, Wuhan University, 129 Luoyu Road, 430076 Wuhan, China
[email protected],
[email protected]
Abstract China’s social media platform, Sina Weibo, like Twitter, hosts a considerable amount of big data: messages, comments, pictures. Collecting and analyzing information from this treasury of human behavior data is a challenge, although the message exchange on the network is readable by everyone through the web or app interface. The official Application Programming Interface (API) is the gateway to access and download public content from Sina Weibo and is used to collect messages for all mainland China. The nearby_timeline() request is used to harvest only messages with associated location information. This technical note serves as a reference for researchers who do not speak Mandarin but want to collect data from this rich source of information. Ways of data visualization are presented as a point cloud, density per areal unit, or clustered using Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The relation of messages to census information is also given. KEYWORDS
big data, data collection, location based services, Social Media
1 | INTRODUCTION Sina Weibo is the largest social media platform in China with an asymmetric user relationship structure and completely open to everyone. The principle is: Users can write, publish and comment on messages that are publicly readable by everyone, everyone can follow everyone else without the consent of the other party and everyone is able to comment on any public message that is posted on the network. The transparency and accessibility of location-based information are a treasury of data and information and an opportunity for researchers to analyze and interpret voluntarily shared information from a large group of people. Users interact on the network to exchange short messages of up to 140 Chinese characters - this is more content than Twitter allows on its network. The message content is about news, celebrities, friends and social interaction in general, but is also used by organizations to distribute information, or by companies for advertisement and marketing. Information about the user composition and usage of Sina Weibo are published in Fu and Chau (2013). In this short technical note, we are interested in the collection, storage, visualization and possibilities of analysis of messages with
Trans in GIS. 2017;1–10
wileyonlinelibrary.com/journal/tgis
C 2017 John Wiley & Sons Ltd V
|
1
2
|
JENDRYKE
ET AL
associated metadata: the GPS location and timestamp of the message. The message text itself is also collected, but not further interpreted in this research. We introduce Sina Weibo as a source of information with a focus on the message collection process. This data source may be interesting to people dealing with (big) spatial data from location-based services. One (of many) possible application is to look at the spatiotemporal distribution of social media messages to gain information about human activity patterns in urban areas. A certain number of messages has to be present in an area over time to make a meaningful statement about variations in urban activity. By collecting only messages with location information we retrieve a small subset of the entire message exchange on the network. With the presented social media dataset, problems of visualization arise, which will also be addressed.
2 | BACKGROUND 2.1 | Sina Weibo and Twitter Sina Weibo ranks very high for social media users in China and Chinese-speaking countries (Guan et al., 2014). However, the body of research is rather small compared to Twitter. The research is often focused on the semantic analysis of the message’s content and its spatial distribution (Zook & Poorthuis, 2014). Twitter is very similar to Sina Weibo, a detailed comparison is provided in Chen, Zhang, Lin, and Lv (2011) While the functionality of Sina Weibo (e.g. a user can direct messages to another @user, use #hashtags, set the privacy status, and attach location information) is similar to Twitter it has received less attention by the research community. A search for the term “Sina Weibo” on sciencedirect.com (in August 2016) resulted in 246 published articles compared to “Twitter”, which has been found in 11,371 papers. The location information that is retrieved through the API is presented as a coordinate pair in decimal degrees with five decimal points. This information stems from the GPS location of the mobile device. However, a user could potentially input select from a drop-down list – an already existing location, which may not be his/her actual location. We have to accept this special case as a source of error. Users who opt out from attaching the location information will not be collected using the nearby_timeline() API request.
2.2 | Critiques about Social Media Research Before starting to work with data from social media networks, one should be aware of the data limitations, the critique, and the fallacies that can arise when dealing with such data. In a recent Science article, Ruths and Pfeffer (2014) the authors list aspects of social media research that should be considered, e.g. the comparison of two similar networks to overcome biases that occur when using one network alone. Another example is targeting research that focuses on the semantics of messages to predict voting behavior (Cohen & Ruths, 2013). Caution is necessary when collecting, analyzing and interpreting social media data. Twitter data was used in multiple studies as the main source of social media information – possibly due to its accessibility, e.g. the distribution of different beer brands in the United States (Zook & Poorthuis, 2014) or the prediction of consumer behavior (Goel, Hofman, Lahaie, Pennock, & Watts, 2010). Google used search requests to analyze and predict flu trends with its Google Flu Trends algorithm (GFT) which was critiqued by Lazer, Kennedy, King, and Vespignani (2014) because it predicted twice as many doctor visits than actually happened. Compared to census data, the strength of this kind of data is the high position accuracy associated with a precise timestamp. This indicates an activity of a person at the time of sending a message. Tapping into the message content itself, this data offers a source of information where the users become citizens as sensors (Goodchild, 2007) to extract potential information about air quality for example (Kay, Zhao, & Sui, 2015) On the downside of this data set, it should be mentioned, that only a subset of the population is sampled since not everyone is using it. A technology affine environment with easy access to the internet is necessary to post a message. Additionally, it is not clear to what extent Sina Weibo is filtering message availability (e.g. messages from VIP users).
JENDRYKE
ET AL
|
3
2.3 | Other data sources Social media data is often a proxy for data that is otherwise not accessible or not collected at all. In this research, we focus on the collection of location information rather than, e.g. friendship connections (small worlds) or mobility patterns. Another very similar dataset would be mobile phone connections (Calabrese, Diao, Di Lorenzo, Ferreira, & Ratti, 2013; Reades, Calabrese, Sevtsuk, & Ratti, 2007) which also shows the current activity of a person at a certain location with a specified timestamp. However, this kind of data is very difficult to obtain from a mobile phone network provider and possibly without the consent of the user.
3 | THE SINA WEIBO API Sina Weibo is accessible in two ways: as a regular user via the web interface of the application for iOS or Android or as a developer who can interact through various Software Development Kits (SDK) and programming languages directly with the database of Sina Weibo. This means that a developer can send (post()) and retrieve (get()) data from the platform without using the web interface or mobile phone application. A query is sent to the Application Programming Interface (API) and the result is returned as a text string. A developer account is an opt-in option for regular users. Additional information about the user are needed and the purpose for using the API has to be stated. After that, a user has the option to create up to 10 so called app keys with their associated app secrets to access the regular API queries. A developer can also gain access to special API queries with a paid Very Important Person (VIP) account. Here, we only use the free-of-charge possibilities with their associated limitations. The limitations when using the API are, for example, restricted numbers of requests per hour and app key is set to 150. This is a great limitation and does not allow one to interact with the API too frequently since over usage will be time penalized by Sina Weibo. In addition, the maximum number of returned database entries is restricted to 50 in the case of the nearby_timeline() API. The possibilities of the API are identical to the requests that are sent and received when using for example, the web interface as a regular user. Messages can be sent and retrieved, follow connections established, pictures posted, comments written and searches performed. A full list of all the post() and get() requests are listed in Weibo (2015). The advantage of the SDKs in conjunction with the app keys to interact with the Sina Weibo database is simply to implement all of the interactions automatically. Requests can be timed, so that over usage is avoided, data that is returned can be directly inserted in a local database. The returned information text string is a JSON object, which contains more information than is visible on the web interface. A general disadvantage of using data from the API or reading the web interface is, that it is rather unclear which messages are actually returned: It might be possible – even though not documented – that the messages are filtered by Sina Weibo. For example, Sine Weibo could decide to prefer VIP accounts (very important people; essentially a paid account) and return the results more often, or messages from certain locations are chosen more frequently.
4 | HARVESTING SINA WEIBO In this study we want to collect all messages over mainland China that contain location information using the specific API request nearby_timeline(). This request returns only messages that have location information in the metadata. Certain input parameters are needed to send a request to the API, see Table 1. The mandatory parameters to define a search area are latitude and longitude in decimal degrees to define a center coordinate and a search radius. Together they create a circular area (we refer to this area as a [harvesting] field) to retrieve messages that are within the distance (radius) from the center coordinate. Additional parameters are for example, the start and end-time in which a message has been posted (the default is ‘most recent messages’), the search radius (default is 2,000 m, maximum 11,132 m), options for sorting and formatting
4
|
JENDRYKE
TA BL E 1
ET AL
Required and optional parameters of the ‘nearby_timeline()’ request
Attribute
required
description
source
false
Licensing options
access_token
false
The OAuth2.0 access token
lat
true
Latitude coordinate in decimal degrees
long
true
Longitude coordinate in decimal degrees
range
false
Search radius (default 2,000 m, max. 11,132 m)
starttime
false
UNIX time of first possible message
endtime
false
UNIX time of last possible message
sort
false
Sorting
count
false
Return count (default 20, max. 50 messages)
page
false
Paging of results
base_app
false
Base app
offset
false
offset
the output of the returned text string. A full list of necessary and optimal parameters are given in Table 1 and Sina Weibo (2014). The complete request is formed (a httpget() request) as shown below, where ACCESS-TOKEN is a code that is retrieved through OAuth2.0 authentication with the app key and secret: https://api.weibo.com/2/place/nearby_timeline.json?access_token5{ACCESS-TOKEN}&lat530.1234&long5121.5678 A JSON object is returned from Sina Weibo, which has to be deserialized before passing it to the database. A subset of the returned values is shown in Table 2. A total of 38 different attributes is associated with each message. Not only the message text with its unique ID, the time the message has been posted and the coordinates of the message are revealed, but also statistics and information about the user’s province and the city of origin are presented. TA BL E 2
Subset of the values that are returned by the API
Message details
User details
Attribute
Value
Description
created_at
Tue May 31 12:00:00 10800 2016
Timestamp of message (in UTC)
ID
123455678909
Unique ID of the message
text
“hello”
Message text
in_reply_to_status_id
345634754864
Message is a reply to the message of this ID
coordinates
[121.5678, 30.1234]
Coordinate pair in decimal degrees
reposts_count
10
Number of times the messages has been
comments_count
33
Number of comments to this message
screen_name
Wang
User’s screen name
province
11
User’s province code
city
5
User’s city code
statuses_count
100
User’s total number of messages
JENDRYKE
ET AL
|
5
Mainland China is covered with 49,296 hexagon fields as center coordinates for data collection; here showing the number of times each field has been queried (white: 28-114 times black: 2,436-3,783 times)
FIGURE 1
4.1 | Data Collection To collect data via the API, OAuth2.0 authentication has to be obtained using the app key, which is provided for developers. One app key has the usage limitation of 150 times per hour with a maximum return of 50 entries. Due to this restriction and the huge number of messages that are exchanged on the platform, up-scaling is necessary when the purpose is to collect/harvest all messages over mainland China. An appropriate search algorithm with multiple app keys has to be installed to manage, weight and distribute queries. In total 20 accounts of regular users are initiated, which results in 20 developer accounts with 10 app keys each. These app keys are then used by a software program, written with the C# SDK provided by XuanChenLin (2013) to generate the httpget() requests, parse the received JSON string object, and insert the data into a Microsoft SQL server database. The goal is to generate the mentioned API request with its authentication code, and the center coordinate sent it to Sina Weibo, and retrieve the result. With a maximum search radius of 11,132 m, 49,296 center coordinates must be created to cover mainland China, see Figure 1. The grid of center coordinates (we also call them harvesting fields) is converted to hexagons for visualization purposes as shown in Figure 2. The software uses these fields to manage the collecting/harvesting process, making sure that the app keys are effectively employed and weighted for areas with larger message density. We assume that not every field will return the same number of messages; however, it is unknown which fields (center coordinates for the search) return more messages and which do not – weighting is needed. For example, a field is harvested more often when two conditions are met:
FIGURE 2
Collecting coordinates represented as hexagons over the municipal area of Beijing (122 fields)
6
|
JENDRYKE
FIGURE 3
ET AL
Histogram of message count per day for the entire time of data collection
1. The API returns the maximum number of 50 messages per request; and 2. There are new messages, which are not in the database already. The message IDs (remember each message has a unique ID) of the current API result are compared with the message IDs that are already in the local database. A particular field is queried less often if no messages are returned. Its priority is lowered, increasing the time before a request is sent again. This allows for a drastic reduction of ineffective API requests and does not require the creation of more developer accounts with app keys. The effectiveness is especially obvious when coloring each hexagon according to the number of times it has been harvested, as shown in Figure 1. Just by visualization of the weighing algorithm that has been implemented, urban areas become visible. It is not causality, but there is a correlation between urban areas and the number of posted Sina Weibo messages. The optional parameter of endtime is also set, to avoid only collecting recent messages. Our experience is that messages of up to six months in the past are obtainable – older messages seem to be archived by Sina Weibo and inaccessible with a free-of-charge account. This is, unfortunately, undocumented as well. After an initial phase where all fields have the same priority and are harvested at least once, the harvesting algorithm will continue 24 hours a day. Per day 150req.lim * 20accounts * 10keys per acc. * 24hours 5 720,000 requests are made, while randomly selecting an end time between three months in the past and now with a higher probability to set the end time to now.
4.2 | Data Storage The entire dataset has over 100 million messages, collected between 2013-12-29 and 2015-8-4, stored in an MS SQL server database. Each collected message has a unique message ID, a time stamp and coordinates associated with the message text and the user information. One server was used to run the harvesting software and a second computer served as a replication instance. Figure 3 shows the time distribution of the collected messages. Due to fluctuations in the availability of the Sina Weibo API for some time periods, a low number of messages (less than 100,000 messages) has been collected whereas other periods delivered over 700,000 messages per day.
5 | RESULTS 5.1 | Data Vizualization Spatiotemporal visualization of big data sources is a research topic itself; here we are interested in the distribution of messages in an urban area. Figure 4 shows over 8 million points that have been collected over the Beijing
JENDRYKE
FIGURE 4
ET AL
|
7
Visualizing over eight million points that fall exactly within the administrative boundaries of Beijing municipal
area municipal area. By looking at the point pattern alone, one can see a concentration of message points in the city center. Variations of message density – or localized distribution of messages – are not visible in this cartographic style. A common approach is to aggregate data to known shapes such as hexagons (Zook & Poorthuis, 2014) which allow the calculation of message density but have no real world resemblance. An example of Beijing’s message density represented in hexagons is illustrated in Figure 5. The view of message densities in Beijing (Figure 5) can be easily extended to the entire area of mainland China, see Figure 6. This shows a very similar picture compared with Figure 1, urban areas produce the largest number of messages in the eastern parts and coastal regions of China with concentrations in central China where larger message accumulations are visible for cities such as Chengdu, Chongqing or Xi’an as well.
5.2 | Possibilities for Data Analysis An approach to making large point clouds such as the collected data more comprehensible is by clustering. In Figure 7 a subset of the over eight million messages that have been collected for Beijing has been clustered using the Density-Based Spatial Clustering of Applications with Noise DBSCAN algorithm as proposed by Tran, Drab, and Daszykowski (2013). Essentially this clustering method groups all message points together if the distance
Message density in Beijing represented with hexagons (white 5 1-7 messages per km2, black 5 268-380 messages per km2)
FIGURE 5
8
|
JENDRYKE
FIGURE 6
ET AL
Number of messages collected for each field: minimum 0 (white), and maximum 166,996 (black)
between points is not larger than 2 m and the number of messages is at least 50. These parameters are adjustable to achieve fewer clusters - by increasing the distance and decreasing the minimum number of points to form a cluster, or vice versa, to gain more clusters. Relating the messages to other data sources opens up new possibilities for analysis. In Figure 8, the total number of messages is related to the census information per administrative unit. Here districts (level 3 administrative units) units are used to showcase the ratio of messages per total population, also considering messages from Sina Weibo users that are not necessarily residents of this specific administrative unit: e.g. over two messages per (census) resident are recorded in Lenghu, a remote rural region with a low number of registered residents but which contains the popular tourist destination around Qinghai lake.
6 | CONCLUSIONS We have presented a way to access, download, store and analyze data from the Chinese social media network Sina Weibo. While research with data from the popular Twitter network is common, it seems that China’s network has attracted less attention even though a lot of people use it and it encompasses a huge market. Data is available even with free accounts.
F I G U R E 7 Message clusters generated for 20,000 messages in Beijing using the DBSCAN algorithm at 2 m maximal distance and minimum 50 messages to resemble a cluster (colored), black points are noise and do not belong to a cluster
JENDRYKE
ET AL
|
9
F I G U R E 8 Number of messages per total population for administrative units at sub-province level (blue 5 low density, red 5 high density)
The collected messages have a timestamp and location information, which allows spatiotemporal analysis of message patterns. We see higher message densities in urban areas where technophile people frequently use the network to exchange information. The collected dataset is a treasury that allows for various aspects of analysis, not only spatial but also semantic. For researchers who do not speak Chinese, one of the challenges is to understand and navigate through the API documentation and SDKs that are made available by Sina Weibo. Once this difficulty is overcome, establishing a continuous collection process (a server should collect the data 24 hr a day), and effective weighting of the harvesting fields (the center coordinates for the API request) are technical problems that must be addressed. The data has potential for the geospatial community. Besides the suggested spatial analysis of grouping into hexagons or clustering using DBSCAN, one can imagine a whole set of semantic analyses by looking at the message text itself. Search for keywords such as ‘air pollution’ or ‘traffic jam’ opens up the possibilities for urban land cover applications and event analysis. Another branch is the option to identify friendship networks (small worlds) by following the replies to messages by a certain user.
AC KNOWLEDG EME NT S This work is financially supported by the National Natural Science Foundation of China (grant nos. 61331016 and 41174120), and the German Academic Exchange Service (DAAD).
RE FE RE NCE S Calabrese, F., Diao, M., Di Lorenzo, G., Ferreira, J., & Ratti, C. (2013). Understanding individual mobility patterns from urban sensing data: A mobile phone trace example. Transportation Research: Part C, Emerging Technologies, 26, 301–313. Chen, S., Zhang, H., Lin, M., & Lv, S. (2011). Comparison of microblogging service between Sina Weibo and Twitter. In Proceedings of the IEEE International Conference on Computer Science & Network Technology (vol. 4, pp. 2259-2263). Harbin, China: IEEE. Cohen, R., & Ruths, D. (2013). Classifying political orientation on Twitter: It’s not easy! In Proceedings of the Seventh International AAAI Conference on Weblogs & Social Media (pp. 91–99). Cambridge, MA: AAAI. Fu, K. W., & Chau, M. (2013). Reality check for the Chinese microblog space: A random sampling approach. PLoS ONE, 8 (3), e58356. Goel, S., Hofman, J. M., Lahaie, S., Pennock, D. M., & Watts, D. J. (2010). Predicting consumer behavior with Web search. Proceedings of the National Academy of Sciences of the United States of America, 107(41), 17486–17490. Goodchild, M. F. (2007). Citizens as sensors: The world of volunteered geography. GeoJournal, 69, 211–221. Guan, W., Gao, H., Yang, M., Li, Y., Ma, H., Qian, W., . . ., & Yang, X. (2014). Analyzing user behavior of the micro-blogging website Sina Weibo during hot social events. Physica A: Statistical Mechanics and Its Applications, 395, 340–351.
10
|
JENDRYKE
ET AL
Kay, S., Zhao, B., & Sui, D. (2015). Can social media clear the air? A case study of the air pollution problem in Chinese cities. The Professional Geographer, 67(3), 351–363. Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google flu: Traps in big data analysis. Science, 343, 1203–1205. Reades, J., Calabrese, F., Sevtsuk, A., & Ratti, C. (2007). Cellular census: Explorations in urban data collection. IEEE Pervasive Computing, 6(3), 30–38. Ruths, D., & Pfeffer, J. (2014). Social media for large studies of behavior. Science, 346, 1063. Sina Weibo. (2014). The “nearby timeline” API request 2/place/nearby timeline - 微博API. Retrieved from http://open. weibo.com/wiki/2/place/nearby_timeline Tran, T. N., Drab, K., & Daszykowski, M. (2013). Revised DBSCAN algorithm to cluster data with dense adjacent clusters. Chemometrics & Intelligent Laboratory Systems, 120, 92–96. Weibo, S. (2015). Sina Weibo API capabilities. Retrieved from http://open.weibo.com/wiki/%E5%BE%AE%E5%8D% 9AAPI XuanChenLin. (2013). 新浪微博开放平台API for .Net SDK – Home. Retrieved from http://weibosdk.codeplex.com/ Zook, M., & Poorthuis, A. (2014). Offline brews and online views: Exploring the geography of beer tweets. In M. W. Patterson & N. Hoalst-Pullen (Eds.), The geography of beer: Regions, environments, and societies (pp. 201–209). Berlin, Germany: Springer
How to cite this article: Jendryke M, Balz T, Liao M. Big location-based social media messages from China’s Sina Weibo network: Collection, storage, visualization, and potential ways of analysis. Trans in GIS. 2017;00:1–10. https://doi.org/10.1111/tgis.12266