Twitter for Public Health: An Open-source Data Solution

6 downloads 90028 Views 612KB Size Report
with health services. Key words: Twitter data analysis, public health, social network. 1 .... for keywords health , care Angelina Jolie and mastectomy . The social.
Twitter for Public Health: An Open-source Data Solution a

Son Nghiem , Pratik Mehta

a

b

and Liang Tao

c

Centre of National Research on Disability and Rehabilitation Medicine (CONROD), The University of Queensland, Australia; Email: [email protected] b c

Saama Technologies, California, USA; Email: [email protected]

University of Illinois, Urbana-Champaign, Illinois, USA; Email: taoliang0227

Abstract This study examines the role big data plays in the improvement of medical services in areas such as service quality and operational eciency via the analysis of Twitter data. Despite big data have been used widely to improve business operational eciency (e.g., recommendation system of Netix, Amazon, eBay), the use of big data analysis in health care is relatively new. We conduct an open-source project that collects, cleans and presents up-to-date Twitter data for the general public and those who are interested in analyzing it for public health applications.

As a

demonstration, we use the sample data to analyze sentiments of tweets for health-related keywords, which can proxy the satisfactions of patients with health services.

Key words: Twitter data analysis, public health, social network.

1

1

Introduction

Twitter is one of the most popular social networks with more than one hundred millions users worldwide [1]. Twitter data have been analyzed for various applications, including disasters management in the aftermath of earthquake in Haiti and Tohoku [2, 3] and the Boston bombing incident [4]; predicting stock market prices [5, 6], political elections outcomes [7], and impacts of scientic papers [8]. The popularity of Twitter data makes it attractive to examine medical related messages (i.e., tweets), for example, to explore the concurrent and correlation within and between factors of interests such as diseases, medications, hospitals and doctors. Twitter have been used by health organizations to promote health literacy [9, 10]; to improve health services [11]; to examine eects on cervix and breast cancer [12]; and to monitor pandemic break [13, 14, 15]. However, most previous applications of Twitter analyzes for health care are based on a short snapshot of data. This study contributes to the literature and empirics of public health by proposing an open-source solution for obtaining and analyzing Twitter data.

2

Data and Methodology

Twitter messages were obtained using the Application Programming Interface (API), which allow us to download JavaScript Object Notation (JSON) formatted tweets. We focus on collecting data for the United States as this country contributes for more than 50 per cent of Twitter users worldwide [16].

We

use a comprehensive list of words to classify a tweet as medically relevant: 80,000 words for drugs based on the list of the US Food and Drug Administration (FDA); and the list of keywords for diseases from the health care hastag community (http://www.symplur.com/healthcare-hashtags/). In addition, we use the International Classication of Diseases - Version 10 (ICD-10), which include more than 10 thousand diseases to extract medical tweets. The geocode and personal/locational names from tweets are used to identify hospital and doctor names. We collected the sample of data in four weeks of June 2013, rstly as an eort to complete a project required by the course Introduction to Data Science oered by the University of Washington on coursera.org.

After the course completed, members of the project continued the

data collection and cleaning; and cleaned data are posted in the project website (http://twitteranalysis.webuda.com/main/) for the public and interested analysts. The data were cleaned using various measures.

We exclude non-English

tweets as the follow-up analysis such as the calculation of sentiment scores only use English keywords. We excluded tweets with health keywords but were used on non-health contexts (e.g., cancer as a star sign instead of a medical condition). We also removed punctuations characters such as  _@:!,.\*()/;?&# from tweets to minimize loss of tweets due to unmatched keywords. After cleaning, the sample data set include 250,000 tweets containing disease or drug

2

keywords. We use the list of 2477 English keywords constructed by [17], each has a sentiment score from -5 for negative words such as bastard, nigger and bitch, to 5 for positive words such as breathtaking, hurrah and outstanding to evaluate the mood of each tweet. The sentiment score of keywords that are not in the list of 2477 words is measured as the average sentiment score of the listed keywords in a tweet. For example, the tweet doctor gosnell is a bastard murderer! has the average sentiment score of -3.5 due to the contribution of two negative keywords bastard:-5 and murderer:-2; this score (-3.5) is assigned to all non-listed words in the tweet: doctor and gosnell. The sentiment score of all non-list keywords in the sample data is measured as the average scores of each non-list words found in all tweets such that they also range from -5 to 5.

3

A Sample Analysis and Discussions

Results of sentiment analysis show that average sentiment scores of keywords related doctors surgery and hospital cares are mostly negative (see Table 1). One noticeable observation is that most popular keywords with negative sentiment scores include gosnell, kermit, abortion and philadelphia. These keywords refer to the case of Dr Kermit Gosnell from Philadelphia, who run an abortion clinic and was sentenced to life in prison without parole for murder. This story is demonstrated clearly in the connections and relative importance of a keyword map constructed from sample tweets, where keywords about this case 1

have the biggest nodes

2

and thickest edges

(see Figure 1a). Another possible

interpretation is that patients who call doctors or hospitals regarding their pain and sickness are generally not in a happy mood. In contrast, the sample data show an overwhelmingly positive sentiments for keywords health, care Angelina Jolie and mastectomy. The social mapping analysis revealed that these tweets refer to the news of actress Angelina Jolie who underwent a mastectomy surgery to prevent breast cancer (see Figure 1b). In addition, we found that popular keywords such as nurses bed medical and soon received positive sentiment. Thus, another possible interpretation from the list of popular keywords with positive sentiment score is that patients are generally happy once they get to hospitals and receive help from nurses or know that their case will be addressed soon. The concurrent of keywords shows a similar picture from our sample data. For example, the most concurrent keyword to doctor is abortion, followed by life, convicted and prison; whilst the most concurrent keywords with medical are choice, Angelina, center, Jolie, health and story.

The

most frequent keywords associated with hospitals are personnel pronounces for family members (e.g., mom, dad, children, baby) and time (e.g., day, tomorrow

1 a bigger node indicates that the keyword links with more other keywords 2 the edge between two keywords is thick indicates that there are more tweets refer to these two keywords. For more detailed discussions of edges and nodes in social mapping, see for example, [18, 19, 20]

3

Table 1: Sentiment scores of most frequent keywords Keywords

Average sentiment score

Keywords

Average sentiment scores

kermit

-4.49

jolie

gosnell

-4.44

angelina

1.24 1.06

abortion

-4.01

choice

0.89 0.53

philadelphia

-3.99

mastectomy

cancer

-1.10

health

0.43

waiting

-0.69

nurse

0.28

doctor

-0.66

medicine

0.22

hours

-0.39

soon

0.22

patients

-0.37

medical

0.18

hospital

-0.30

bed

0.03

appointment

-0.21

study

0.53

surgery

-0.12

double

0.56

Note: only non-listed keywords (i.e., outside 2477 words) are presented; both positive and negative lists are ordered by frequency

and waiting). Thus, one possible explanation is that these tweets mainly refer mostly to the arrivals/departures of family members to/from hospitals. We believe that Twitter data can provide a rich set of information for public health applications, hence, we posted cleaned data on the project website for the general public and public health analysts. The data is currently organized in both frequency tables and concurrent graphs, which associate diseases with 3

symptoms, aected body parts, hospitals, drugs and doctors . We also list top tweets related to selected keywords such that users can take a more detailed evidence to support their ndings/stories. For example, our sample data show that the most frequent drug used for depressions is Prozac, a popular antidepressant drug. This nding is very well in-line with the current norms in medical practices, suggesting that people with depression follow medical advices closely. However, the top tweets show that the new treatment for depression, turmeric extract, is also widely spoken among Twitter users (see Figure 2).

Detailed

tweets also show that the main reason that insulin was included in the list of drugs for depression is due to popularity of a story about the link between obesity and depression.

4

Concluding Remarks

Twitter and other social networks oer a rich source of data for various applications such as disasters management, prediction of stock prices and political elections outcomes. This paper presented an open-source project that collects, cleans and displays Twitter data publicly for potential uses in public health applications. The open-source nature of this project allows users to download and

3 the degree of concurrent is presented as the distance between the selected keyword and highly frequent terms (closer distance refers to higher frequency)

4

Figure 1: Keywords map: story of two persons (a) Kermit Gosnell

(b) Angelina Jolie

note: the two maps are constructed from two communities of the most linked keywords. There are other communities from the sample tweets but we did not include in the analysis for brevity.

Figure 2: Website example: frequency of drugs associated with depression

Note: This is only a sample screen shot of the website. For more details, visit http://twitteranalysis.webuda.com/main/

5

modify our codes such that the project will provide good data for users on a continual basis. The project will save time and resources for future analysts and provide useful information for the general public about, for example, the trends and patterns in public health matters.

The project is still in an infant stage

but we hope that as more users, volunteers and possibly donors are aware and interested in the project, it will be improved further with additional features such as data export, interactive maps and more built-in analysis tools.

References [1] S. J. Sullivan,

A. G. Schneiders,

C.-W. Cheang,

O. H. Ahmed,

E. Kitto,

H. Lee,

and P. R. McCrory,

 'what's

J. Redhead,

S. Ward,

happening?'

a content analysis of concussion-related trac on twitter.

British Journal of Sports Medicine,

vol. 46, no. 4, pp. 258263, Mar 2012.

[Online]. Available: http://dx.doi.org/10.1136/bjsm.2010.080341 [2] A. Lobb, N. Mock, and P. L. Hutchinson,  Traditional and social media coverage and charitable giving following the 2010 earthquake in haiti.

Prehosp Disaster Med,

vol. 27, no. 4, pp. 319324, Aug 2012. [Online].

Available: http://dx.doi.org/10.1017/S1049023X12000908 [3] J. Umihara and M. Nishikitani,  Emergent use of twitter in the 2011 tohoku earthquake.

Prehosp Disaster Med,

pp. 17, Jul 2013. [Online].

Available: http://dx.doi.org/10.1017/S1049023X13008704 [4] C. A. Cassa, R. Chunara, K. Mandl, and J. S. Brownstein,  Twitter as

a

sentinel

marathon

in

emergency

explosions.

situations:

PLoS Curr,

vol.

lessons 5,

2013.

from

the

[Online].

boston

Available:

http://dx.doi.org/10.1371/currents.dis.ad70cd1c8bc585e9470046cde334ee4b [5] J. Bollen, H. Mao, and X. Zeng, Twitter mood predicts the stock market,

Journal of Computational Science, vol. 2, no. 1, pp. 18, 2011.

[6] X. Zhang, H. Fuehres, and P. A. Gloor, Predicting stock market indicators through twitter:

I hope it is not as bad as i fear,

Behavioral Sciences, vol. 26, pp. 5562, 2011.

Procedia-Social and

[7] J. Borondo, A. J. Morales, J. C. Losada, and R. M. Benito,  Characterizing and modeling an electoral campaign in the context of twitter: 2011 spanish presidential election as a case study.

Chaos, vol. 22, no. 2, p. 023138, Jun

2012. [Online]. Available: http://dx.doi.org/10.1063/1.4729139 [8] G. Eysenbach,  Can tweets predict citations?

metrics of social impact

based on twitter and correlation with traditional metrics of scientic impact.

Journal of Medical Internet Research,

vol. 13, no. 4, p. e123,

2011. [Online]. Available: http://dx.doi.org/10.2196/jmir.2012

6

[9] L. Donelle and R. G. Booth,  Health tweets: promotion on twitter.

an exploration of health

Online Journal of Issues in Nursing, vol. 17, no. 3,

p. 4, Sep 2012. [10] H.

Park,

S.

nizations'

Rodgers,

use

Commun,

of

vol.

and

twitter

18,

no.

J.

for 4,

Stemmle,

promoting

pp.

410425,

 Analyzing

health

orga-

health

literacy.

J Health

2013.

[Online].

Available:

http://dx.doi.org/10.1080/10810730.2012.727956 [11] I.

Masic,

 Social

ciomed,

S.

Sivic,

networks vol.

S. in

24,

Toromanovic,

T.

Borojevic,

improvement

of

health

no.

1,

pp.

4853,

and

care.

2012.

H.

Pandza,

Mater

[Online].

So-

Available:

http://dx.doi.org/10.5455/msm.2012.24.48-53 [12] C. R. Lyles, A. Lopez, R. Pasick, and U. Sarkar,  "5 mins of uncomfyness is better than dealing with cancer 4 a lifetime": an exploratory qualitative analysis of cervical and breast cancer screening dialogue on twitter.

Journal of Cancer Education,

vol. 28, no. 1, pp. 127133, Mar 2013.

[Online]. Available: http://dx.doi.org/10.1007/s13187-012-0432-2 [13] C. ter:

Chew

and

content

PLoS One,

G.

Eysenbach,

analysis vol.

5,

of

no.

 Pandemics

tweets 11,

p.

during

the

e14118,

in

the

2009

2010.

age

h1n1

[Online].

of

twit-

outbreak. Available:

http://dx.doi.org/10.1371/journal.pone.0014118 [14] A. Culotta, Towards detecting inuenza epidemics by analyzing twitter messages, in

Proceedings of the rst workshop on social media analytics.

ACM, 2010, pp. 115122. [15] A. Signorini, A. M. Segre, and P. M. Polgreen,  The use of twitter to track levels of disease activity and public concern in the u.s. during the inuenza a h1n1 pandemic.

PLoS One,

vol. 6, no. 5, p. e19467, 2011.

[Online]. Available: http://dx.doi.org/10.1371/journal.pone.0019467 [16] K. W. Prier, M. S. Smith, C. Giraud-Carrier, and C. L. Hanson, Identifying health-related topics on twitter, in

modeling and prediction. [17] F. 321,

Å.

Nielsen, DK-2800

Ann, Kgs.

Social computing, behavioral-cultural

Springer, 2011, pp. 1825. Richard

Lyngby,

mar

Petersens 2011.

Plads,

[Online].

Building Available:

http://www2.imm.dtu.dk/pubdb/p.php?6010 [18] M. Girvan and M. E. Newman, Community structure in social and biological networks,

Proceedings of the National Academy of Sciences,

vol. 99,

no. 12, pp. 78217826, 2002. [19] D. Liben-Nowell and J. Kleinberg, The link-prediction problem for social

Journal of the American society for information science and technology, vol. 58, no. 7, pp. 10191031, 2007. networks,

7

[20] R. Kumar, J. Novak, and A. Tomkins, Structure and evolution of online social networks, in

Link Mining: Models, Algorithms, and Applications.

Springer, 2010, pp. 337357.

8

Suggest Documents