Twitter for Public Health: An Open-source Data Solution a
Son Nghiem , Pratik Mehta
a
b
and Liang Tao
c
Centre of National Research on Disability and Rehabilitation Medicine (CONROD), The University of Queensland, Australia; Email:
[email protected] b c
Saama Technologies, California, USA; Email:
[email protected]
University of Illinois, Urbana-Champaign, Illinois, USA; Email: taoliang0227
Abstract This study examines the role big data plays in the improvement of medical services in areas such as service quality and operational eciency via the analysis of Twitter data. Despite big data have been used widely to improve business operational eciency (e.g., recommendation system of Netix, Amazon, eBay), the use of big data analysis in health care is relatively new. We conduct an open-source project that collects, cleans and presents up-to-date Twitter data for the general public and those who are interested in analyzing it for public health applications.
As a
demonstration, we use the sample data to analyze sentiments of tweets for health-related keywords, which can proxy the satisfactions of patients with health services.
Key words: Twitter data analysis, public health, social network.
1
1
Introduction
Twitter is one of the most popular social networks with more than one hundred millions users worldwide [1]. Twitter data have been analyzed for various applications, including disasters management in the aftermath of earthquake in Haiti and Tohoku [2, 3] and the Boston bombing incident [4]; predicting stock market prices [5, 6], political elections outcomes [7], and impacts of scientic papers [8]. The popularity of Twitter data makes it attractive to examine medical related messages (i.e., tweets), for example, to explore the concurrent and correlation within and between factors of interests such as diseases, medications, hospitals and doctors. Twitter have been used by health organizations to promote health literacy [9, 10]; to improve health services [11]; to examine eects on cervix and breast cancer [12]; and to monitor pandemic break [13, 14, 15]. However, most previous applications of Twitter analyzes for health care are based on a short snapshot of data. This study contributes to the literature and empirics of public health by proposing an open-source solution for obtaining and analyzing Twitter data.
2
Data and Methodology
Twitter messages were obtained using the Application Programming Interface (API), which allow us to download JavaScript Object Notation (JSON) formatted tweets. We focus on collecting data for the United States as this country contributes for more than 50 per cent of Twitter users worldwide [16].
We
use a comprehensive list of words to classify a tweet as medically relevant: 80,000 words for drugs based on the list of the US Food and Drug Administration (FDA); and the list of keywords for diseases from the health care hastag community (http://www.symplur.com/healthcare-hashtags/). In addition, we use the International Classication of Diseases - Version 10 (ICD-10), which include more than 10 thousand diseases to extract medical tweets. The geocode and personal/locational names from tweets are used to identify hospital and doctor names. We collected the sample of data in four weeks of June 2013, rstly as an eort to complete a project required by the course Introduction to Data Science oered by the University of Washington on coursera.org.
After the course completed, members of the project continued the
data collection and cleaning; and cleaned data are posted in the project website (http://twitteranalysis.webuda.com/main/) for the public and interested analysts. The data were cleaned using various measures.
We exclude non-English
tweets as the follow-up analysis such as the calculation of sentiment scores only use English keywords. We excluded tweets with health keywords but were used on non-health contexts (e.g., cancer as a star sign instead of a medical condition). We also removed punctuations characters such as _@:!,.\*()/;? from tweets to minimize loss of tweets due to unmatched keywords. After cleaning, the sample data set include 250,000 tweets containing disease or drug
2
keywords. We use the list of 2477 English keywords constructed by [17], each has a sentiment score from -5 for negative words such as bastard, nigger and bitch, to 5 for positive words such as breathtaking, hurrah and outstanding to evaluate the mood of each tweet. The sentiment score of keywords that are not in the list of 2477 words is measured as the average sentiment score of the listed keywords in a tweet. For example, the tweet doctor gosnell is a bastard murderer! has the average sentiment score of -3.5 due to the contribution of two negative keywords bastard:-5 and murderer:-2; this score (-3.5) is assigned to all non-listed words in the tweet: doctor and gosnell. The sentiment score of all non-list keywords in the sample data is measured as the average scores of each non-list words found in all tweets such that they also range from -5 to 5.
3
A Sample Analysis and Discussions
Results of sentiment analysis show that average sentiment scores of keywords related doctors surgery and hospital cares are mostly negative (see Table 1). One noticeable observation is that most popular keywords with negative sentiment scores include gosnell, kermit, abortion and philadelphia. These keywords refer to the case of Dr Kermit Gosnell from Philadelphia, who run an abortion clinic and was sentenced to life in prison without parole for murder. This story is demonstrated clearly in the connections and relative importance of a keyword map constructed from sample tweets, where keywords about this case 1
have the biggest nodes
2
and thickest edges
(see Figure 1a). Another possible
interpretation is that patients who call doctors or hospitals regarding their pain and sickness are generally not in a happy mood. In contrast, the sample data show an overwhelmingly positive sentiments for keywords health, care Angelina Jolie and mastectomy. The social mapping analysis revealed that these tweets refer to the news of actress Angelina Jolie who underwent a mastectomy surgery to prevent breast cancer (see Figure 1b). In addition, we found that popular keywords such as nurses bed medical and soon received positive sentiment. Thus, another possible interpretation from the list of popular keywords with positive sentiment score is that patients are generally happy once they get to hospitals and receive help from nurses or know that their case will be addressed soon. The concurrent of keywords shows a similar picture from our sample data. For example, the most concurrent keyword to doctor is abortion, followed by life, convicted and prison; whilst the most concurrent keywords with medical are choice, Angelina, center, Jolie, health and story.
The
most frequent keywords associated with hospitals are personnel pronounces for family members (e.g., mom, dad, children, baby) and time (e.g., day, tomorrow
1 a bigger node indicates that the keyword links with more other keywords 2 the edge between two keywords is thick indicates that there are more tweets refer to these two keywords. For more detailed discussions of edges and nodes in social mapping, see for example, [18, 19, 20]
3
Table 1: Sentiment scores of most frequent keywords Keywords
Average sentiment score
Keywords
Average sentiment scores
kermit
-4.49
jolie
gosnell
-4.44
angelina
1.24 1.06
abortion
-4.01
choice
0.89 0.53
philadelphia
-3.99
mastectomy
cancer
-1.10
health
0.43
waiting
-0.69
nurse
0.28
doctor
-0.66
medicine
0.22
hours
-0.39
soon
0.22
patients
-0.37
medical
0.18
hospital
-0.30
bed
0.03
appointment
-0.21
study
0.53
surgery
-0.12
double
0.56
Note: only non-listed keywords (i.e., outside 2477 words) are presented; both positive and negative lists are ordered by frequency
and waiting). Thus, one possible explanation is that these tweets mainly refer mostly to the arrivals/departures of family members to/from hospitals. We believe that Twitter data can provide a rich set of information for public health applications, hence, we posted cleaned data on the project website for the general public and public health analysts. The data is currently organized in both frequency tables and concurrent graphs, which associate diseases with 3
symptoms, aected body parts, hospitals, drugs and doctors . We also list top tweets related to selected keywords such that users can take a more detailed evidence to support their ndings/stories. For example, our sample data show that the most frequent drug used for depressions is Prozac, a popular antidepressant drug. This nding is very well in-line with the current norms in medical practices, suggesting that people with depression follow medical advices closely. However, the top tweets show that the new treatment for depression, turmeric extract, is also widely spoken among Twitter users (see Figure 2).
Detailed
tweets also show that the main reason that insulin was included in the list of drugs for depression is due to popularity of a story about the link between obesity and depression.
4
Concluding Remarks
Twitter and other social networks oer a rich source of data for various applications such as disasters management, prediction of stock prices and political elections outcomes. This paper presented an open-source project that collects, cleans and displays Twitter data publicly for potential uses in public health applications. The open-source nature of this project allows users to download and
3 the degree of concurrent is presented as the distance between the selected keyword and highly frequent terms (closer distance refers to higher frequency)
4
Figure 1: Keywords map: story of two persons (a) Kermit Gosnell
(b) Angelina Jolie
note: the two maps are constructed from two communities of the most linked keywords. There are other communities from the sample tweets but we did not include in the analysis for brevity.
Figure 2: Website example: frequency of drugs associated with depression
Note: This is only a sample screen shot of the website. For more details, visit http://twitteranalysis.webuda.com/main/
5
modify our codes such that the project will provide good data for users on a continual basis. The project will save time and resources for future analysts and provide useful information for the general public about, for example, the trends and patterns in public health matters.
The project is still in an infant stage
but we hope that as more users, volunteers and possibly donors are aware and interested in the project, it will be improved further with additional features such as data export, interactive maps and more built-in analysis tools.
References [1] S. J. Sullivan,
A. G. Schneiders,
C.-W. Cheang,
O. H. Ahmed,
E. Kitto,
H. Lee,
and P. R. McCrory,
'what's
J. Redhead,
S. Ward,
happening?'
a content analysis of concussion-related trac on twitter.
British Journal of Sports Medicine,
vol. 46, no. 4, pp. 258263, Mar 2012.
[Online]. Available: http://dx.doi.org/10.1136/bjsm.2010.080341 [2] A. Lobb, N. Mock, and P. L. Hutchinson, Traditional and social media coverage and charitable giving following the 2010 earthquake in haiti.
Prehosp Disaster Med,
vol. 27, no. 4, pp. 319324, Aug 2012. [Online].
Available: http://dx.doi.org/10.1017/S1049023X12000908 [3] J. Umihara and M. Nishikitani, Emergent use of twitter in the 2011 tohoku earthquake.
Prehosp Disaster Med,
pp. 17, Jul 2013. [Online].
Available: http://dx.doi.org/10.1017/S1049023X13008704 [4] C. A. Cassa, R. Chunara, K. Mandl, and J. S. Brownstein, Twitter as
a
sentinel
marathon
in
emergency
explosions.
situations:
PLoS Curr,
vol.
lessons 5,
2013.
from
the
[Online].
boston
Available:
http://dx.doi.org/10.1371/currents.dis.ad70cd1c8bc585e9470046cde334ee4b [5] J. Bollen, H. Mao, and X. Zeng, Twitter mood predicts the stock market,
Journal of Computational Science, vol. 2, no. 1, pp. 18, 2011.
[6] X. Zhang, H. Fuehres, and P. A. Gloor, Predicting stock market indicators through twitter:
I hope it is not as bad as i fear,
Behavioral Sciences, vol. 26, pp. 5562, 2011.
Procedia-Social and
[7] J. Borondo, A. J. Morales, J. C. Losada, and R. M. Benito, Characterizing and modeling an electoral campaign in the context of twitter: 2011 spanish presidential election as a case study.
Chaos, vol. 22, no. 2, p. 023138, Jun
2012. [Online]. Available: http://dx.doi.org/10.1063/1.4729139 [8] G. Eysenbach, Can tweets predict citations?
metrics of social impact
based on twitter and correlation with traditional metrics of scientic impact.
Journal of Medical Internet Research,
vol. 13, no. 4, p. e123,
2011. [Online]. Available: http://dx.doi.org/10.2196/jmir.2012
6
[9] L. Donelle and R. G. Booth, Health tweets: promotion on twitter.
an exploration of health
Online Journal of Issues in Nursing, vol. 17, no. 3,
p. 4, Sep 2012. [10] H.
Park,
S.
nizations'
Rodgers,
use
Commun,
of
vol.
and
twitter
18,
no.
J.
for 4,
Stemmle,
promoting
pp.
410425,
Analyzing
health
orga-
health
literacy.
J Health
2013.
[Online].
Available:
http://dx.doi.org/10.1080/10810730.2012.727956 [11] I.
Masic,
Social
ciomed,
S.
Sivic,
networks vol.
S. in
24,
Toromanovic,
T.
Borojevic,
improvement
of
health
no.
1,
pp.
4853,
and
care.
2012.
H.
Pandza,
Mater
[Online].
So-
Available:
http://dx.doi.org/10.5455/msm.2012.24.48-53 [12] C. R. Lyles, A. Lopez, R. Pasick, and U. Sarkar, "5 mins of uncomfyness is better than dealing with cancer 4 a lifetime": an exploratory qualitative analysis of cervical and breast cancer screening dialogue on twitter.
Journal of Cancer Education,
vol. 28, no. 1, pp. 127133, Mar 2013.
[Online]. Available: http://dx.doi.org/10.1007/s13187-012-0432-2 [13] C. ter:
Chew
and
content
PLoS One,
G.
Eysenbach,
analysis vol.
5,
of
no.
Pandemics
tweets 11,
p.
during
the
e14118,
in
the
2009
2010.
age
h1n1
[Online].
of
twit-
outbreak. Available:
http://dx.doi.org/10.1371/journal.pone.0014118 [14] A. Culotta, Towards detecting inuenza epidemics by analyzing twitter messages, in
Proceedings of the rst workshop on social media analytics.
ACM, 2010, pp. 115122. [15] A. Signorini, A. M. Segre, and P. M. Polgreen, The use of twitter to track levels of disease activity and public concern in the u.s. during the inuenza a h1n1 pandemic.
PLoS One,
vol. 6, no. 5, p. e19467, 2011.
[Online]. Available: http://dx.doi.org/10.1371/journal.pone.0019467 [16] K. W. Prier, M. S. Smith, C. Giraud-Carrier, and C. L. Hanson, Identifying health-related topics on twitter, in
modeling and prediction. [17] F. 321,
Å.
Nielsen, DK-2800
Ann, Kgs.
Social computing, behavioral-cultural
Springer, 2011, pp. 1825. Richard
Lyngby,
mar
Petersens 2011.
Plads,
[Online].
Building Available:
http://www2.imm.dtu.dk/pubdb/p.php?6010 [18] M. Girvan and M. E. Newman, Community structure in social and biological networks,
Proceedings of the National Academy of Sciences,
vol. 99,
no. 12, pp. 78217826, 2002. [19] D. Liben-Nowell and J. Kleinberg, The link-prediction problem for social
Journal of the American society for information science and technology, vol. 58, no. 7, pp. 10191031, 2007. networks,
7
[20] R. Kumar, J. Novak, and A. Tomkins, Structure and evolution of online social networks, in
Link Mining: Models, Algorithms, and Applications.
Springer, 2010, pp. 337357.
8