This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2016.2539972, IEEE Journal of Biomedical and Health Informatics
Assessing anti-depressants using intelligent data monitoring and mining of online fora Altug Akay, Member, IEEE, Andrei Dragomir, Member, IEEE, Bjorn-Erik Erlandsson, Senior Member, IEEE, Metin Akay, Fellow, IEEE
Abstract— Depression is a global health concern. Social networks allow the affected population to share their experiences. These experiences, when mined, extracted, and analyzed, can be converted into either warnings to recall drugs (dangerous side effects), or service improvement (interventions, treatment options) based on observations derived from user behavior in depression related social networks. Our aim was to develop a weighted network model to represent user activity on social health networks. This enabled us to accurately represent user interactions by relying on the data’s semantic content. Our three-step method uses the weighted network model to represent users activity, and network clustering and module analysis to characterize user interactions and extract further knowledge from users posts. The network’s topological properties reflect user activity such as posts’ general topic as well as timing, while weighted edges reflect the posts semantic content and similarities among posts. The result, a synthesis from word data frequency, statistical analysis of module content and the modeled health network’s properties, has allowed us to gain insight into consumer sentiment of antidepressants. This approach will allow all parties to participate in improving future health solutions of patients suffering from depression. Keywords: depression, social media, network analysis, user sentiment, semantic analysis, online fora, data mining
D
I. INTRODUCTION
EPRESSION is the leading cause of disability worldwide, a major contributor to the global burden of disease, and is affecting more than 350 million people worldwide [1][2]. Untreated depression has been linked to problems ranging from stroke to coronary artery disease [3], two of the top ten leading causes of death in the world in 2012 [4]. Less than half of the global population receives the proper attention and treatment in both the under-developed countries and industrialized countries [5]. Forums and social media websites dedicated to depression have recently sprung up for patients and healthcare workers to share their experiences from managing depression in their daily lives to A. Akay is with the School of Technology and Health, KTH Royal Institute A. Dragomir, PhD., is with the Department of Biomedical Engineering, University of Houston, Houston, TX, 77204-5060, USA (e-mail:
[email protected]). B-E Erlandsson, PhD., is with the School of Technology and Health, KTH Royal Institute of Technology, Huddinge, SE-141 52 Sweden (e-mail:
[email protected]). M Akay, PhD., is with the Department of Biomedical Engineering, University of Houston, Houston, TX, 77204-5060, USA (e-mail:
[email protected]).
their reactions to anti-depressants. Such voluminous information can provide unlimited opportunities for patients, healthcare organizations, and industry to improve solutions through intelligent data mining, extraction, and analysis. A social media network is a virtual networking environment that is composed of nodes and edges. Its contents can be modeled and extracted using computational tools that can map trends, formulate predictions, and assess user relationships. Graphical representation can visually represent the information. A socio-matrix can represent a social network's structure. Topological parameters such as node degrees and network densities can elucidate specific dynamics within a network and specific algorithms can map underlying information-rich structures (such as clusters). Identifying these clusters enables node- (or cluster) centered information mining. Such crucial data can help healthcare organizations, physicians, staff, and patients to improve services based on feedback from 'smart' data mining of health-specific social media sites. Social media data collection drastically differs from traditional social science methods in that the former enables fast, sometimes even real-time quantitative data monitoring. The latter requires time-consuming methods and less quantitative data that must be periodically updated. Several methods have been developed to rectify the problem of data collection, extraction, and analysis [6-16]. Other methods were used in different real-world scenarios [17-29]. The bulk of the methods used in collecting data from social media networks are lexicon-based, supervised classification, and concept extraction [30-33]. Other methods include the use of graph-based analysis [34], textbased analysis derived from a medical corpus [35] and topicmodel statistical analysis [36]. Zhao et al [37] use text based analysis (length of posts, frequency of certain words) and sentiment analysis to identify influential users online cancer survival communities. In [38] the authors use a supervised analysis framework based on sets of unweighted graphbased and semantic-based features to identify leaders in online health communities. Current depression-related studies based on data from social media use text-based mining and association rules to detect causality in psychiatric texts [31]. Others use text mining and various emotion based features, combined with online behavior features (such as time of posting, number of followers and replies) to derive depression indices, which may be used to
2168-2194 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2016.2539972, IEEE Journal of Biomedical and Health Informatics
predict depression [39-40]. In contrast, our approach combines weighted network models (to represent user activity), module (describing user interaction), and topographical (user activity) analysis with sentiment and text analysis to gain a greater understanding of user sentiment on anti-depressants and identifying influential users, as well as raise potential flags on drug side effects. We used a network-based approach for modeling cancerrelated forum posts in our previous work [41]. While relatively successful at identifying modules of densely interconnected users, it employed a network modeling approach based solely on user forum interactions. Our current study adds more semantic context to our network model by considering, in the model-building step, information contained within the posts. We added weights to the network edges that reflect the semantic similarity between the edge-connected nodes. The edge weights themselves result from an initial pre-processing step using the k-means clustering. We improved the analysis of the retrieved network modules by using statistical testing, based primarily on the hyper-geometric test, which enabled us to find significantly over-represented terms in certain modules. We also accounted for how the user ranks and their response time to any given post on specific information affected the social media network’s internal dynamics. We then identified influential users using the Hyperlink-Induced Topic Search (HITS) algorithm. II. METHODS
A. Initial Data Search and Collection The first step was to search for the most sought-after forums dedicated to depression. Our final list, which yielded the chart below, is of descending order. Thus, depressionforums.org was our choice. Depression Forums depressionforums.org forums.psychcentral.com/depression/ healingwell.com/community psychforums.com/depression depression-understood.org/mainforum
We chose the forum Depressionforums.org as our main source of information based on our findings on the first list (the number of users, posts, and threads). B. Text Mining, Preprocessing, and Tagging A data collection, analysis and processing tree was developed in Rapidminer (www.rapidminer.com) to discover the most frequent words (positive, negative, and side effects) to find their term-frequency-inverse document frequency (TF-IDF) scores within each post. Figure 1 shows the data collection and processing tree. The dataset was uploaded (‘Read Excel’), processed (‘Process Documents to Data’) using sub-components (‘Extract Content’, ’Tokenize’, ’Transform Cases’, ‘Filter
Stopwords’, ‘Filter Tokens,’ respectively) that filtered excess noise (misspelled words, common stop words, etc.) to ensure measurable variable uniformity. The result (‘Processed Data’) contained the final word list, with each word containing a specific TF-IDF score.
Fig. 1. The processing tree in Rapidminer to ascertain the TF-IDF scores of words in the data
We assigned every word a specific score using the following formula: 𝑠𝑐𝑜𝑟𝑒!,! =
𝑙𝑜𝑔𝑡𝑓!,! + 1 log 0
! !!
if 𝑡𝑓!,! ≥ 1
(1)
tfi,d is word frequency (t) in the document (d), n is the sum of documents in the entire collection, and xt is the number of documents where t is present. TF-IDF is a widely used standard frequency measure [42]. We analyzed the text containing the highest TF-IDF scores using a modified NLTK toolkit [43] within MATLAB to ensure that the context of the post was properly recognized (the tone and negativity of posts, when combined with tagged words, remained negative). A similar approach was used in [44]. The NLTK toolkit helped ensure that the context and tone of the posts reflected on the automated analysis. We then used the General Architecture for Text Engineering (GATE) [45] as a secondary toolkit for natural language processing. This allowed us to tag the terms in each post. We then combined the tagged terms of the NLTK and GATE with the National Institute of Health's Unified Medical Library System (UMLS) [46] and the Diagnostic and Statistical Manual of Mental Disorders (DSM) [47], which allowed us to tag medical terms ranging from medications to side effects. We used dictionaries such as Merriam-Webster (www.merriam-webster.com) and Thesaurus Synonym Database (www.language-databases.com) to reduce similar words in our databases, and manually compared the results by noting the same words in physical dictionaries. Our finalized wordlist was created using the above methods, and consisted of a word set frequently used (n>10) in the forum. These consisted of negative and positive words, drug names, and common side effects via mapping to UMLS. These terms are displayed in Tables 1, 2 and 3 below.
2168-2194 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2016.2539972, IEEE Journal of Biomedical and Health Informatics
TABLE 1 The Final Positive and Negative Wordlist Positive Accept Achieve Affection Affirmative Agree Alleviate Ameliorate Amenable Assure Bless Bliss Brilliant Calm Care Content Cuddle Dear Easy Easily Effective Empathy Empower Enable Encourage Energize Enjoy Enlighten Enthusiastic Euphoric Excellent Exciting Favorable Good Gratitude Great Happy Harmless Heal Helpful Hope Improve Kind Laugh Love Merry Miracle Peace Positive Prominent Relieve Stable Thanks Understand Warm Wonderful
Negative Difficult Dislike Abusing Evil Gloom Afflicted Afraid Aggravation Aggression Agony Alarmed Anger Anxiety Ashamed Attack Awful Backlash Bad Badgering Ballistic Baneful Banished Destroy Barbaric Bashing Horrible Beating Berating Bitter Bothered Break Broken Brutal Burden Catastrophic Terrible Clumsy Complain Condescending Contempt Morose Couldn't Coward Cranky Creepy Criticize Crooked Cry Cursed Cynical Damage Dangerous Degenerate Notorious Violent
TABLE 2 The Final Side-Effect Wordlist Side Effects Anxiety Apathy Anorgasmia Chills Constipation Confusion Depression Dizziness Depersonalization Fainting Headache Hypertension Insomnia Nausea Nervousness Palpitations Sweating Tremor Vomiting
TABLE 3 The Final Drug Wordlist Drugs Amitriptyline Benzodiazepine (Bromazepam, Rivotril, Linotril, Clonotril, Klonopin) Bupropion (Wellburtin) Carbamazepine (Tegretol) Citalopram (Celexa, Cipramil) Chlorpromazine (Thorazine, Largactil, Megaphen) Benzaptropine (Cogentin) Cyclobenzaprine (Amrix, Flexeril, Fexmid) Duloxetine (Cymbalta) Valproate Semisodium (Depakote) Desipramine (Norpramin, Pertofrane) Dextroamphetamine (Dexedrine, Dextrostat) Dosulepin (Prothiaden, Dothep, Thaden, Dopress) Desvenlafaxine (Pristiq) Escitalopram (Lexapro, Cipralex) Reboxetine (Edronax, Prolift)
Fluvoxamine (Faverin, Fevarin, Floxyfral, Luvox) Fluoxetine (Prozac, Sarafem) Gabapentin (Neurontin) Imipramine (Tofranil) Lofepramine (Lomont, Emdalen, Gamanil, Tymelyt) Lorazepam (Ativan, Orfidal) Mirtazapine (Avanza, Axit, Mirtaz, Mirtazon, Remeron, Zispin) Nefazodone (Dutonin, Nefadar, Serzone) Norttriptyline (Sensoval, Aventyl, Pamelor, Norpress, Allegron, Noritren, Nortrilen) Olanzapine (Zyprexa, Zypadhera, Lanzek) Promazine (Sparine) Promethazine (Phenergan, Promethegan, Romergan, Fargan, Farganesse, Fenazin, Prothiazine, Avomine, Atosil, Receptozine, Lergigan, Pipolphen, Sominex) Quetiapine (Seroquel) Risperidone (Risperdal) Sertraline (Zoloft, Lustral) Trazodone (Depyrel, Desyrel, Mesyrel, Molipaxin, Oleptro, Trazodil, Trazorel, Trialodine, Trittico) Tryptophan (Tryptan) Alpazolam (Xanax)
We represented the data with a vector set that contained the term’s TF-IDF scores in the wordlists: Each post was converted into a numerical vector as non-zero variables representing the TF-IDF scores corresponding to the wordlist terms that were present in the respective post, while the rest of the variables had values assigned to 0, as computed with Eq. 1. Concurrently, we created a database that stored the terms having non-zero TF-IDF score (i.e. terms in the wordlist that were present in the respective post) for each post. This database was used for term enrichment analysis, as described in Section IIF, to characterize the modules retrieved from our network-based modeling. C. Semantic Similarity of User Postings The TF-IDF scores in each post were built based on a representative word set present throughout the forum and reflects the posts’ semantic content. Therefore, we viewed a TF-IDF vector as the semantic profile of each post. Consequently, various measures of similarities can be derived to reflect how close the semantic profiles of two posts are, e.g. Euclidian distance or correlation. Additionally, clustering analysis can be performed to identify groups of similar semantic profiles. We used kmeans clustering [48] to roughly group the semantic profiles of all posts from our forum, as a preprocessing step necessary for the network-based modeling. D. Network-based Modeling of Forum Postings Forum posting activity consisting of threads containing thousands of postings and replies were modeled into a large user-centric network. The modeling approach aimed at reflecting user interactions while simultaneously considering the posts’ semantic content. The nodes in our network correspond to forum users and connecting directed edges correspond to two different types of interactions: direct and context interactions. Direct interactions correspond to direct user-to-user replies using the forum’s ‘Reply’ option. These interactions were modeled with bi-directional edges connecting the two corresponding nodes. This allowed us to model the mutual exchange of information between a poster and a direct replier. Context interactions reflect users posting within a specific thread (threads are topic-specific, and thread semantic content is homogeneous). Therefore, unidirectional edges were used to connect thread initiators to all other users posting within a specific thread. This allowed us
2168-2194 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2016.2539972, IEEE Journal of Biomedical and Health Informatics
to model the information transfer from thread initiators to users posting within the respective thread. Figure 1 shows an example of our thread-based network modeling approach.
Fig 2. The network model we considered: nodes represent users/posts and the edges represent information transferred among users.
In Figure 2, Node #1 is the thread initiator and as such there are directional edges linking this node to all other nodes within this specific thread. Node #5 is a direct reply to the thread initiator and as such it is linked with a bidirectional connection to the thread initiator. Similarly, Node #3 is directly replying to Node #2, reflected by the bidirectional edge linking the respective nodes. We then added weights to the edges in our network by using semantic profiles corresponding to each forum post and the cluster centroids resulting from clustering the semantic profiles using the k-means method (Section IIC). Specifically, the weight of each edge connecting two nodes x and y in our network is computed taking into consideration the clusters to which their corresponding semantic profiles belong to, and the respective distance between these clusters’ centroids Cx and Cy: 𝑤 𝑥, 𝑦 =
!! !! !!!
∙
!!
! ! ! !!!! ! ! !!!! !!! !! !!!
(2)
Where ||·|| is a distance metric (the Euclidian distance in our study). Constants η1 and η2 are introduced to provide additional confidence to the factors of the weight function. We chose η2 > η1 to emphasize the distance between cluster centroids. This minimized potential noise issues that may arise from semantic profiles. Similar network edge weighting strategies have been used in genetic network modeling [49]. rx and ry denote forum user ranks of users x and y, coded numerically, and are reflective of users forum posting activity (with lower values corresponding to new users and higher values to experienced users). The approach we introduce is aimed at adding more weight to edges connecting experienced users to beginners, thus reflecting the ensuing information transfer. It must be noted that, in cases where two nodes (users) are connected by more than one interaction (forum reply), the weight of the corresponding inter-connecting edge is computed as the mean of the weights corresponding to each interaction.
E. Identifying sub-graphs (Network Modules) Our modeling framework has consequently converted the forum posts into a large directional weighted network containing a number of densely connected units (or modules) (see Figure 3B). The sequence is as follows: A. Language Processing Block. First, the posts collected from the forum via Rapidminer are pre-processed using the NTLK Toolbox and GATE (Step A1) and transformed into a wordlist (Step A2). At this step, direct mapping to the UMLS tagging and synonymous thesaurus is used to identify words representing medical terms and depression-related side effects and to Thesaurus Synonyms Database and Merriam-Webster for synonyms matching. Based on the two wordlists, forum posts are transformed into numerical vectors containing word-frequency based TF-IDF scores, which are subsequently clustered using the k-means method. Additionally, a database consisting of all wordlist terms found in every post is created (step A3). B. Network Processing Block. In parallel, forum posts and replies are modeled as a weighted directed network (Step B1). Obtained network is further refined to identify communities/modules of highly interacting users, based on the MCSD method [50] (Step B2). Finally, the network modules are analyzed to identify statistically significant terms over-represented in modules, find influential users and highlight side effects intensively discussed within the modules, respectively (Step B3). Finding groups of nodes that share similar properties is an important step in network analysis as identifying such modules can provide crucial information about the underlying structure of the network and its functioning [51]. These modules are more densely connected internally (within the unit) than externally (outside the unit). Additionally, identifying modules and their boundaries allows classification of individual nodes and the role they play in module control and stability, based on local topological properties. We chose a multi-scale method that uses local and global criteria for identifying the modules, while maximizing a partition quality measure called stability [51]. In the context of network partitioning, stability is used both as a quality measure for evaluating a partitioning scheme and as an optimization function [51][52]. Under our current framework for detecting network modules, the network is considered a Markov chain, where each node represents a state and each edge a possible state transition. Markov time is used as a resolution parameter when creating various partitions. The method starts with an initial number of modules equal to the number of nodes in the network and subsequently merges initial modules into larger ones. A greedy algorithm is used for optimizing stability [51]. Several partitioning schemes were obtained pending on the range of scales employed by the method, with the optimal partitioning having the largest stability. Identification of the modules entails the largest part of the computational cost of our approach, as the algorithm is implemented with a complexity of O(n(m+ln2(n))), where n is the number of nodes in the network and m the number of edges.
2168-2194 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2016.2539972, IEEE Journal of Biomedical and Health Informatics
F. Modules’ Terms Enrichment Analysis We then proceeded to find terms that are significantly over-represented within each of the modules found by the network-partitioning algorithm (see Section IIE) using a database containing the terms annotating each post (see Section IIB) and the hypergeometric enrichment test that is based on the hypergeometric distribution [54]. Hypergeometric testing can evaluate whether a particular set of terms is represented more than expected by chance within a population sample (in our case a module) when compared to the total population (in our case all the forum postings), given that the term set is sampled without replacement from the finite population of forum posts. Once modules are detected, and based on the terms annotating all nodes of the respective module, we retrieve the total set of terms present within each module. Worth noting, when defining the set terms annotating a node, all posts of that specific node and their corresponding terms are considered. Subsequently, for each term in the module, we compute a probability as follows: if N denotes the population size of the forum (total number of posts) and we have a total of M posts annotated with that specific term, the probability of drawing by chance k or more posts annotated with the exact term within a module will be: !
𝑝=1−
!!!
! !!! ! !!! !
(3)
!
Where n is the number of posts found in the module. We set a confidence level for each term at 99% (p-value10 appearances in the posts and denoted side effects, drugs, as well as positive and negative terms. Our data was then transformed into a numerical matrix (7726 x 277) containing the TF-IDF scores for all forum posts. For the k-means analysis step, we used k=20 clusters for the initial rough clustering of the TFIDF derived semantic profiles. This value was determined by finding the minimum value of the Davis-Bouldin index, which corresponds to an optimal clustering [58]. In Eq. 1, η1 and η2 were chosen as 0.4 and 0.6, respectively. Our network modeling approach yielded an initial loosely connected network, linking all users within the forum. Subsequent module identification using the methods described in Section IIE yielded an optimal network partitioning containing fourteen densely connected modules. We varied our scale parameter within the interval t ∈ [0,2] in 0.1 increments, as suggested by [58][59]. Varying the scale parameter resulted in a set of partitions ranging from modules containing single individual users (for scale parameter t = 0), to large modules (for values of t close to the upper limit of the interval). The optimal partition (maximizing the stability based quality measure) was obtained for t=1. Figure 4 shows three of the fourteen modules identified.
2168-2194 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2016.2539972, IEEE Journal of Biomedical and Health Informatics
Once modules were identified, we further characterized the modules content using the term enrichment analysis described in Methods Section IIF. Supplementary Table 1 presents the terms identified to be significantly enriching the modules. We defined a measure that allowed us to quantify the ratio of positive and negative terms that enrich the modules to provide estimate on the user’s general opinion within a module: 𝑟! =
!! /!! !! /!!
(4)
Where pi is the number of positive terms enriching module i, pt is the total number of positive terms in the wordlist. ni is the number of negative terms enriching module i, nt is the total number of negative terms in the wordlist. Based on this measure, we characterized modules to express predominately positive mood/opinions when their corresponding ri measure was greater than 0.5 and predominantly negative when ri < 0.5. Out of the 14 modules
A
identified, 5 belong to the positive class, while 6 modules belong to the negative class. The remaining 3 modules were not enriched with neither negative, nor positive terms. Interestingly, we observed that positive class modules had significantly higher average user ranks (3.62±0.19) than negative class modules (2.66±0.24), p0.05 for terms ramble, horrendous, duloxetine and cyclobenzaprine). Modules 3 and 4 were not significantly enriched with any term in the original network and preserved this characteristic also in the reduced wordlist networks. Additionally, it must be noted that in both the 267- and 240-wordlist network modules preserved their original class (Modules 1,2,5,9 and
2168-2194 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2016.2539972, IEEE Journal of Biomedical and Health Informatics
Fig. 5. Module content analysis using the Jaccard similarity index. A: the network modules obtained using the 267-wordlist dataset were compared to the original network modules; and B: network modules obtained using the 240-wordlist dataset were compared to the original network modules.
and 13 were classified as positive and Modules 6,7,8,11,12 and 14 were classified as negative). IV. DISCUSSION A forum focused on depression was transformed into a series of scored vectors to measure sentiment on drugs used to treat depression using positive and negative terms alongside with drug names and side effect terms identified from UMLS. Our methods sought out consumer sentiment on depression treatment using drugs by modeling the forum and the exchange of information between users through network based modeling and analysis. In order to extract the most salient and relevant text features we represented the posted information on the forum using term frequencies, one of the commonly used pre-processing steps in text mining approaches [37],[40],[63]. Subsequent network based modeling yielded interesting insights on the underlying information exchange among users. To this goal we proposed a novel metric for adding semantic content to the edges of the modeled network. Our analytical methods were able to reveal the side effects of anti-depressants in greater detail. A search through the medical literature backed our analysis results, which associated specific side effects with major anti-depressants (duloxetine, citalopram, cyclobenzaprine, lorazepam, mirtazapine, chlorpromazine, promethazine and venlor). The analysis has additionally revealed that predominately positive opinion modules contained users with significantly higher user ranks than negative modules. A reading of the posts has revealed that the extent of positive opinion linked to advanced ranks is directly tied to the user experience in not only the drug itself, but also in managing and (in some cases) overcoming the more extremes bouts of depression. These users shared their experiences with the whole forum extensively, particularly to newcomers who were desperate for guidance and support. Additionally, both the average node degree and weighted node degrees were found to be significantly higher (thus
reflecting more densely connected structures and especially better connected users) in the positive modules. This aspect highlights the fact that users in these modules seek information through more sources and tend to interact more than their peers in negative modules. A reading of the posts that were connected with the larger weighted degrees has revealed a similarity in the choice of topics discussed. Further reading of the original posts and responses has revealed more comprehensive discussions of topics. Future solutions will require more advanced social media platform analysis. The starting point is to analyze user content that intelligently maps, and translates, complex posts into readable formats for faster analysis and response time to implement solutions more quickly. The next stage is studying user relationships in social media platforms, specifically time stamps (response time of posts) and the formation (or dissolution) of ties, friend lists, and 'likes' of specific content. The internal network dynamics can be revealed continuously at responses to news, events, and updates over a long period of time. These analyses can combine to peruse through factors of a specific disease that can result in improved solutions, from lifestyle changes to 'smart' drug development. The social media platform users can implement these solutions, leading to further room for feedback, analysis, and improvement. Additionally, we plan on expanding the exploratory analysis performed in the current study by incorporating a framework to quantify misdetections. Our current method can be modified to automatically detect signs of severe depression in users’ forum activity patterns, by incorporating specific terms into the wordlist creation step and the subsequent statistical analysis of the network modules. A way of achieving this would be to define a depression index for each user’ post based on reduced sets of words or n-grams from the initial wordlist, strongly indicative of depressive behavior as in [59]. This index in conjunction with topological features described in
2168-2194 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2016.2539972, IEEE Journal of Biomedical and Health Informatics
our present work, as well as other features, such as time of posting (e.g. depression literature indicates users showing depression signs tend to be more active at night) could constitute an efficient framework for establishing users’ severe depressive behavior. As it has been the case with other text mining studies of healthcare related data, the performance of the current approach is dependent on identifying a reasonable threshold for the number of occurrence of terms constituting the forum posts’ semantic profiles. Usually this is an empirical process, the limit on the minimum number of occurrence of terms in the current study being based on our previous studies [39],[41]. Choosing a higher threshold may result in loss of relevant information, while allowing terms with fewer occurrences to be represented in the semantic profiles may induce unwanted noise. Our analysis indicates that reducing the number of terms may result in the collapse of some of the modules and/or re-assignment of nodes to other modules. However, the network modeling approach we devised induces a ‘hard-wiring’ of the network edges to a certain extent, thus diminishing the effects of reducing the feature set. This study, as the vast majority of studies on online social influence is limited in revealing the interpersonal influence through interactions in the public online space. Certain online forums may offer users the possibility of direct contact between users, via direct messaging, or users may exchange email addresses and contact each other privately. These variables cannot be reflected by studies based on direct observational data, however, we believe private interactions lead to limited sentiment influence, while the network-based modeling we chose accurately reflects the online social context conducive to larger scale influence. To conclude, we believe the use of intelligent data mining tools is an opportunity to greatly improve the quality of healthcare by consumers, healthcare workers, and the industry while reducing costs. REFERENCES [1] [2]
[3]
[4] [5] [6] [7] [8]
World Health Organization. (2015, Jan, 5). Depression Fact Sheet No. 369. Available (http://www.who.int/mediacentre/factsheets/fs369/en/) A.J. Ferrari, F.J. Charlson, R.E. Norman, S.B. Patten, G. Freedman, C.J.L. Murray, T. Vos, H.A. Whiteford, “Burden of Depressive Disorders by Country, Sex, Age, and Year: Findings from the Global Burden of Disease Study 2010,” PLOS Medicine, Nov. 2013 WebMED (2015, Jan, 6). Depression Health Center: Untreated Depression. Available (http://www.webmd.com/depression/guide/untreated-depressioneffects) World Health Organizations, (2015, Jan, 7). The Top 10 causes of death. Available (who.int/mediacentre/factsheets/fs310/en/) Ibid L. Getoor and C. Diehl. “Link mining: a survey,” SIGKDD Explor. Newsl., vol. 7, pp. 3—12, Dec. 2005. Q. Lu. And L. Getoor, “Link-based Classification.” In Proc. of the 20th Int. Conf. on Machine Learning (ICML). Washington, D.C., 2003, pp. 496-503 A. Ng, A. Zheng, and M. Jordan, “Stable algorithms for link analysis,” in Proc. of the SIGIR Conf. on Information Retrieval.
[9] [10]
[11]
[12]
[13] [14]
[15] [16] [17]
[18] [19]
[20] [21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
New Orleans, Louisiana, 2001, pp. 258-266 B. Taskar, M. Wong, P. Abbeel, and D. Koller, “Link Prediction in Relational Data,” in Advances in Neural Information Processing Systems (NIPS), Vancouver, B.C., 2003 D. Liben-Nowell and J.M. Kleinberg, "The link prediction problem for social networks,"Journal of the American Society for Information Science and Technology, Vol. 57, pp. 556-559, May 2007. Z. Lacroix, H. Murthy, F. Naumann, and L. Raschid, “Links and Paths through Life Sciences data sources,” in Proc. of the 1st Int. Workshop on Data Integration in the Life Sciences (DILS), Leipzig, Germany., 2004, pp. 203-211 J. Noessner, M. Niepert, C. Meilicke, and H. Stuckenschmidt, “Leveraging Terminological Structure for Object Reconciliation” in The Semantic Web: Research and Applications, Heidelberg, Berlin: Springer, 2010, pp.334-348. M.E.J. Newman, “Detecting community structure in networks,” European Physical Journal, vol. 38, pp. 321-330, March 2004. J. Huan and J. Prins, “Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism,” in Proc. Of the 3rd IEEE Int. Conf. on Data Mining (ICDM’03), Melbourne, Florida. 2003, pp. 549-552 D. Hand, “Principles of Data Mining,” Drug Safety, vol. 30, pp. 621-622, July 2007. J. Hans and M. Kamber. Data Mining: Concepts and Techniques 2nd ed. Burlington, Mass: Morgan Kaufmann, 2006 C. Corley, D. Cook, A. Mikler, and K. Singh. "Text and Structural Data Mining of Influenza Mentions in Web and Social Media," Int. J. Environ. Res. Public Health, Vol. 7, 596-615, Feb. 2010. S.R. Das and M.Y. Chen, “Yahoo! for Amazon: Sentiment extraction from small talk on the Web,” Management Science, vol. 53, pp.1375-1388, Sept. 2007. E. Riloff, “Little words can make a big difference for text classification,” in 18th Annu. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1995, Seattle, Washington. pp. 130-136 W. Yih, P.H. Chang, and W. Kim, “Mining Online Deal Forums for Hot Deals,” in WI’04 Proc. of the 2004 IEEE/WIC/ACM Int. Conf. on Web Intelligence, 2004, Beijing, China. pp. 384-390 B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment Classification Using Machine Learning Techniques,” in EMNLP’02 Proc. of the ACL-02 Conf. on Empirical Methods in Natural Language Processing, Philadelphia, PA, 2002, pp. 7986 X. Feng, A. Cai, K. Dong, W. Chaing, M. Feng, et al., “Assessing Pancreatic Cancer Risk Associated with Dipeptidyl Peptidase 4 Inhibitors: Data Mining of FDA Adverse Event Reporting System (FAERS),” J Pharmacovigilance, vol. 1, July 2013. K.Y. Chan, C.K. Kwong, and T.C. Wong, “Modeling customer satisfaction for product development using genetic programming,” Journal of Engineering Design, vol. 22, No. 1, pp.56-68, Jan. 2011. I. Frommholz and M. Lechtenfeld, “Determining the Polarity of Postings for Discussion Search,” in LWA 2008-WorkshopWoche: Lernen, Wissen & Adaptivität, Proc., 2008, Würzburg, Germany. pp. 49-56 J. Schectman, (2013, May, 1). Glaxo Mined Online Parent Discussion Boards For Vaccine Worries [Online]. Available (http://blogs.wsj.com/cio/2013/05/01/glaxo-mined-onlineparent-discussion-boards-for-vaccine-worries/) R. McBride, (2012, August, 1). Merck to Draw on Social Network for Psoriasis Patients [Online]. Available (http://www.fiercebiotechit.com/story/merck-draw-socialnetwork-psoriasis-patients/2012-08-13) Five ways a Boston Children’s Hospital spin-off s using social media for public health (http://mobihealthnews.com/42072/fiveways-a-boston-childrens-hospital-spin-off-is-using-socialmedia-for-public-health/) GSK, Merck use social media to learn how patients use drugs outside the lab (http://mobihealthnews.com/47284/gsk-merckuse-social-media-to-learn-how-patients-use-drugs-outside-thelab)
2168-2194 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2016.2539972, IEEE Journal of Biomedical and Health Informatics
[29]
[30]
[31] [32]
[33] [34]
[35]
[36] [37] [38]
[39]
[40] [41]
[42]
[43] [44]
[45] [46] [47] [48]
[49]
Yih W, Qazvinian V. Measuring word relatedness using heterogenous vector space models. In: The 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-2012), Montreal, Canada; 2012. Sarker, A., Ginn, R., Nikfarjam, A., O’Connor, K., Smith, K., Jayaraman, S., Upadhaya T., Gonzalez, G., ”Utilizing Social Media Data for Pharmacolvigilance: A Review,” Journal of Biomedical Informatics, vol. 54, pp. 202-212, 2015 Wu JL, Yu LC, Chang PC., ”Detecting causality from online psychiatric texts using inter-sentential language patterns.” BMC medical informatics and decision making, vol. 12, 2012 Yu LC, Chan CL, Lin CC, Lin IC. Mining association language patterns using a distributional semantic model for negative life event classification. Journal of biomedical informatics, vol. 44, pp. 509-18, 2011 Dumontier M, Villanueva-Rosales N. Towards pharmacogenomics knowledge discovery with the semantic web. Briefings in bioinformatics, vol. 2, pp. 153-63, 2009 Moradi, F., Eklund, A.M., Kokkinakis, D., Olovsson, T., Tsifas, P., ”A Graph-Based Analysis of Medical Queries of a Swedish Health Care Portal” In: Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi) @ EACL 2014, pages 2–10, Gothenburg, Sweden, April 26-30 2014. Bedmar, I.S., Revert, R., Martinez, P., ”Detecting drugs and adverse events from Spanish health social media streams” In: Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi) @ EACL 2014, pages 106–115, Gothenburg, Sweden, April 26-30 2014. Paul, M., Dredze, M., ”Discovering Health Topics in Social Media Using Topic Models,” PLOS One, vol. 9, e103-408, 2014 Zhao, K., Yen, J., Greer, G., et al, ”Finding influential users of online health communities: a new metric based on sentiment influence,” J Am Med Inform Assoc, vol. 21, pp. 212-218, 2014 Zhao, K., Qiu, B., Caragea, C., Wu, D., Mitra, P., Yen, J., Greer, G., Portier, K., ”Identifying Leaders in an Online Cancer Survivor Community”, Proceedings of the 21st Workshop on Information Technologies and Systems WITS 2011. De Choudhury, Munmun, Scott Counts, and Eric Horvitz. "Social media as a measurement tool of depression in populations." Proceedings of the 5th Annual ACM Web Science Conference. ACM, 2013. De Choudhury, Munmun, Michael Gamon, Scott Counts, and Eric Horvitz. "Predicting Depression via Social Media." In ICWSM. 2013. Akay, Altug, Andrei Dragomir, and B. Erlandsson. "NetworkBased Modeling and Intelligent Data Mining of Social Media for Improving Care." IEEE Journal of Biomedical and Health Informatics, vol. 1, pp. 210-218, Jan. 2015. I. Mierswa, M. Wurst, W. Michael, R. Klinkenberg, M. Scholz, and T. Euler, “YALE: Rapid Prototyping for Complex Data Mining Tasks,” in Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD-06), 2006, Philadelphia, PA. pp. 935-940 S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python. Sebastopol, CA: O’Reilly Media, 2009, pp. 504 P. Soucy and G.W. Mineau, “Beyond TFIDF Scoreing for Text Categorization in the Vector Space Model,” IJCAI'05 Proc. of the 19th Int. Joint Conf. on Artificial intelligence, 2005, Edinburgh, Scotland, UK. pp. 1130-1135 https://gate.ac.uk/ http://www.nlm.nih.gov/research/umls/ American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Washington, DC: Author. Kanungo, Tapas, et al. "An efficient k-means clustering algorithm: Analysis and implementation." Pattern Analysis and Machine Intelligence, IEEE Transactions on 24.7 (2002): 881892. Maraziotis, Ioannis A., Konstantina Dimitrakopoulou, and Anastasios Bezerianos. "Growing functional modules from a seed protein via integration of protein interaction and gene expression data." Bmc Bioinformatics 8.1 (2007): 408.
[50]
[51] [52]
[53] [54] [55] [56]
[57] [58] [59] [60] [61] [62] [63]
E. Le Martelot, and C. Hankin, “Multi-Scale Community Detection using Stability as Optimisation Criterion in a Greedy Algorithm,” 2011 Int. Conf. Knowledge Discovery and Information Retrieval (KDIR 2011), Paris, October, pp. 216-225. SciTePress Fortunato, Santo. "Community detection in graphs." Physics Reports 486.3 (2010): 75-174. Lambiotte, Renaud. "Multi-scale modularity in complex networks." Modeling and optimization in mobile, ad hoc and wireless networks (WiOpt), 2010 Proceedings of the 8th International Symposium on. IEEE, 2010. ibid Tsoi, Lam C., et al. "Text-mining approach to evaluate terms for ontology development." Journal of biomedical informatics 42.5 (2009): 824-830. Kleinberg, Jon M. "Authoritative sources in a hyperlinked environment." Journal of the ACM (JACM) 46.5 (1999): 604632. Pujol, Josep M., Ramon Sangüesa, and Jordi Delgado. "Extracting reputation in multi agent systems by means of social network topology." Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1. ACM, 2002. ibid D.L. Davies, D.W. Bouldin. “A Cluster Separation Measure,” IEEE Transactions on Pattern Analaysis and Machine Intelligence. PAMl-1 (2): 224-227 Bremner, J. Douglas. "Structural changes in the brain in depression and relationship to symptom recurrence." CNS spectrums 7.02 (2002): 129-139. Saha, Kumar B., Stephanie Sampson, and Rashid U. Zaman. "Chlorpromazine versus atypical antipsychotic drugs for schizophrenia." The Cochrane Library(2013). Dorevitch, Abraham, and Hillel Davis. "Fluvoxamine-associated sexual dysfunction." Annals of Pharmacotherapy 28.7-8 (1994): 872-874. http://www.drugs.com/sfx/phenergan-side-effects.html Cambria E, Schuller B, Xia Y, and Havasi C. "New avenues in opinion mining and sentiment analysis." IEEE Intelligent Systems 2 (2013): 15-21.
Altug Akay, MA, received his BA from the University of Massachusetts at Amherst in 2005 in the US, and his MA from Dartmouth College in the US in 2007. He then worked at Siemens Healthcare in Erlangen, Germany under the supervision of Dr. Gudrun Zahlmann. He then worked at Spaulding Rehabilitation Hospital under the supervision of Dr. Paolo Bonato and Dr. Utkan Demirci of Harvard Medical School. He is currently pursuing his PhD in the School of Technology and Health at KTH Royal Institute of Technology in Stockholm. His current interests are medical engineering and public health.
Andrei Dragomir, PhD, received the BS degree in electrical engineering at the Politehnica University of Bucharest, Romania and the MS and PhD degrees in biomedical engineering at the University of Patras in Greece. He was a postdoctoral research associate at the Arizona State University, Tempe, Arizona from 2006-2010. He is currently an Assistant Professor at the University of Houston. His
2168-2194 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2016.2539972, IEEE Journal of Biomedical and Health Informatics
research interests are biocomplexity, neural engineering, pattern recognition, machine learning and bioinformatics.
Björn-Erik Erlandsson, PhD, received his PhD from Chalmers University of Technology, Gothenburg, Sweden, in the field of applied electronics. Nowadays Senior Advisor, Professor at the School of Technology and Health, KTH Royal Institute of Technology, Stockholm, with responsibility for Technology and Quality in Health care. Experience from research and development, quality management and regulatory affairs from international medical device industry; Siemens and Nobel Industries. Worked as Quality Manager in pacemaker industry and has been Director in Medical Informatics and Technology at the University Hospital of Northern Sweden, Umeå and Akademiska Hospital, Uppsala and also professor in Biomedical Engineering. In his role as director for technical operations at the university hospitals he has also been heavily involved in investment issues and investment management, and chairman of the investment planning groups. He is also involved in the standardization work in medical technology and medical informatics, chairman of SIS/TK334, Chairman of the Joint Working Group in Software and Medical Devices (SAMD) at CEN / CENELEC during 2010 and Cochairman during 2011. Chairman of the Medical Society's Division of Medical Informatics for two years, member of the Swedish Council on Health Technology Assessment, SBU Alertråd, and Coordinator EU policies at the Royal Institute Technology.
2168-2194 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.