Validating search protocols for mining of health and disease events on Twitter Aditya Lia Ramadona1,2*, Lutfan Lazuardi3, Sulistyawati1,4, Anwar Dwi Cahyono5, Åsa Holmner6, Hari Kusnanto3, Joacim Rocklöv1 The International Conference on Public Health (ICPH) Solo, Indonesia; September 14-15, 2016 https://arxiv.org/abs/1608.05910
Introduction
Twitter • free social networking and microblogging service • 140-character: news, events, personal feeling and experiences, … • May 2016: 24.34 million Indonesian active users ~ 10% (Statista, 2016)
Twitter offers streams of the public data flowing • might contain health-related information • can be explored for public health monitoring and surveillance purposes (Paul et al. 2016) Indonesia Social Media Trend (Jakpat, 2016)
Introduction Previous studies • • • •
Signorini et al. 2011: track levels of disease activity Eichstaedt et al. 2015: predicts heart disease mortality Strom et al. 2013: measuring health-related quality of life many more…
Methodological challenges • data and language processing • model development
www.bahasakita.com
Subjects and Methods Develop groups of words and phrases relevant to disease symptoms and health outcomes in the Bahasa Indonesia historical Twitter
14d
real-time
Twitter stream
Subjects and Methods Sentiment analysis • examining a tweet from Twitter feeds • the decisions were made by people with expert knowledge millions of tweets: time-consuming and inefficient
Replicating expert assessment • develop a model, interpret results and adjust the model • make predictions
Results: text analysis Historical Twitter feeds: 390 tweets • "rumah OR sakit OR rawat OR inap OR demam OR panas -cuaca OR berdarah OR pendarahan OR tombosit OR badan OR muntah OR badan OR tua OR ':('"
Preprocessing • removing retweets and eliminate some noise • removing punctuation, numbers, capitalization, and the Bahasa stop-words (e.g. kamu and aja)
[107] "@XYZ kamu izin aja, bilang kamu sakit :((" [107] "xyz izin bilang sakit"
Results: text analysis 1,632 words • the most highly correlate words: sakit (sick, ill, pain) hati (0.23) ~ shame, broken heart, … rasa (0.13) ~ pain perut (0.12) ~ stomach ache
Figure 1. Words that appear more than 10 times
Results: model development Predictors • highest words frequencies (22) • counting the number of the predictor words in a tweet
Classification and Regression Trees model (Breiman et al. 1983) • rpart package (Therneau et al. 2015)
Results: model development 390 tweets historical Twitter feeds • 273 tweets (70%): training • 117 tweets (30%): validating 1,145,649 tweets Twitter stream feeds: testing Indonesia: between 11°S and 6°N and 95°E and 141°E, 7 days: 26th July – 1st August 2016
• 100 from 6,109 TRUE results • 100 from 1,139,540 FALSE result
Results: model development
Results
Results Model Performance AUC Sensitivity Specifity Positive Predictive Value Negative Predictive Value
Validation 0.82 80.0 84.6 86.7 77.2
Testing 0.70 42.0 98.0 95.5 62.8
Limitations + Challenges = Future Work team member involved • academics, health workers
Twitter users • telecommunications infrastructure • characteristics of people
methods • data: streaming (Indonesia, 7d/24h ~ 1.5GB in csv format) • model: CART, RandomForest, GBM, …
Summary Monitoring of public sentiment on Twitter + contextual knowledge • a nearly real-time proxy for health-related indicators
Models do not replace expert judgment • accurately analyze small amounts of information (tweets) • improve and refine the model • bias and emotion: integrate assessments of many experts
Summary
1
Department of Public Health and Clinical Medicine, Epidemiology and Global Health, Umeå University 2 Center for Environmental Studies, Universitas Gadjah Mada 3 Department of Public Health, Faculty of Medicine, Universitas Gadjah Mada 4 Department of Public Health, Universitas Ahmad Dahlan 5 District Health Office, Yogyakarta 6 Department of Radiation Sciences, Umeå University *
[email protected]
www.themexpert.com/images/easyblog_articles/270/twitter_cover.jpg