Network-Based Modeling and Intelligent Data Mining of Social Media ...

2 downloads 272 Views 408KB Size Report
and network-based properties. Our approach can expand research into intelligently mining social media data for consumer opinion of various treatments to ...
210

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015

Network-Based Modeling and Intelligent Data Mining of Social Media for Improving Care Altug Akay, Member, IEEE, Andrei Dragomir, Member, IEEE, and Bj¨orn-Erik Erlandsson, Senior Member, IEEE

Abstract—Intelligently extracting knowledge from social media has recently attracted great interest from the Biomedical and Health Informatics community to simultaneously improve healthcare outcomes and reduce costs using consumer-generated opinion. We propose a two-step analysis framework that focuses on positive and negative sentiment, as well as the side effects of treatment, in users’ forum posts, and identifies user communities (modules) and influential users for the purpose of ascertaining user opinion of cancer treatment. We used a self-organizing map to analyze word frequency data derived from users’ forum posts. We then introduced a novel network-based approach for modeling users’ forum interactions and employed a network partitioning method based on optimizing a stability quality measure. This allowed us to determine consumer opinion and identify influential users within the retrieved modules using information derived from both word-frequency data and network-based properties. Our approach can expand research into intelligently mining social media data for consumer opinion of various treatments to provide rapid, up-to-date information for the pharmaceutical industry, hospitals, and medical staff, on the effectiveness (or ineffectiveness) of future treatments. Index Terms—Data mining, complex networks, neural networks, semantic web, social computing.

I. INTRODUCTION OCIAL media is providing limitless opportunities for patients to discuss their experiences with drugs and devices, and for companies to receive feedback on their products and services [1]–[3]. Pharmaceutical companies are prioritizing social network monitoring within their IT departments, creating an opportunity for rapid dissemination and feedback of products and services to optimize and enhance delivery, increase turnover and profit, and reduce costs [4]. Social media data harvesting for bio-surveillance have also been reported [5]. Social media enables a virtual networking environment. Modeling social media using available network modeling and computational tools is one way of extracting knowledge and trends from the information ‘cloud:’ a social network is a structure made of nodes and edges that connect nodes in various relationships. Graphical representation is the most common method to visually represent the information. Network modeling could

S

Manuscript received January 24, 2014; revised May 4, 2014 and June 19, 2014; accepted June 30, 2014. Date of publication July 10, 2014; date of current version December 30, 2014. A. Akay and B.-E. Erlandsson are with the School of Technology and Health, Royal Institute of Technology, Stockholm SE-141 52, Sweden (e-mail: [email protected]; [email protected]). A. Dragomir, is with the Department of Biomedical Engineering, University of Houston, Houston, TX 77204–5060 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JBHI.2014.2336251

also be used for studying the simulation of network properties and its internal dynamics. A sociomatrix can be used to construct representations of a social network structure. Node degree, network density, and other large-scale parameters can derive information about the importance of certain entities within the network. Such communities are clusters or modules. Specific algorithms can perform network-clustering, one of the fundamental tasks in network analysis. Detecting particular user communities requires identifying specific, networked nodes that will allow information extraction. Healthcare providers could use patient opinion to improve their services. Physicians could collect feedback from other doctors and patients to improve their treatment recommendations and results. Patients could use other consumers’ knowledge in making better-informed healthcare decisions. The nature of social networks makes data collection difficult. Several methods have been employed, such as link mining [6], classification through links [7], predictions based on objects [8], links [9], existence [10], estimation [11], object [12], group [13], and subgroup detection [14], and mining the data [15], [16]. Link prediction, viral marketing, online discussion groups (and rankings) allow for the development of solutions based on user feedback. Traditional social sciences use surveys and involve subjects in the data collection process, resulting in small sample sizes per study. With social media, more content is readily available, particularly when combined with web-crawling and scraping software that would allow real-time monitoring of changes within the network. Previous studies used technical solutions to extract user sentiment on influenza [17], technology stocks [18], context and sentence structure [19], online shopping [20], multiple classifications [21], government health monitoring [22], specific terms relating to consumer satisfaction [23], polarity of newspaper articles [24], and assessment of user satisfaction from companies [25], [26]. Despite the extensive literature, none have identified influential users, and how forum relationships affect network dynamics. In the first stage of our current study, we employ exploratory analysis using the self-organizing maps (SOMs) to assess correlations between user posts and positive or negative opinion on the drug. In the second stage, we model the users and their posts using a network-based approach. We build on our previous study [27] and use an enhanced method for identifying user communities (modules) and influential users therein. The current approach effectively searches for potential levels of organization (scales) within the networks and uncovers dense modules

2168-2194 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

AKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE

Fig. 1. Processing tree in Rapidminer to ascertain the TF-IDF scores of words in the data

using a partition stability quality measure [28]. The approach enables us to find the optimal network partition. We subsequently enrich the retrieved modules with word frequency information from module-contained users posts to derive local and global measures of users opinion and raise flag on potential side effects of Erlotinib, a drug used in the treatment of one of the most prevalent cancers: lung cancer [29].

211

loaded the data into the first component (‘Read Excel’). The uploaded data was then processed in the second component (‘Process Documents to Data’) using several subcomponents (‘Extract Content’, ‘Tokenize’, ‘Transform Cases’, ‘Filter Stopwords’, ‘Filter Tokens,’ respectively) that filtered excess noise (misspelled words, common stop words, etc.) to ensure a uniform set of variables that can be measured. The final component (‘Processed Data’) contained the final word list, with each word containing a specific TF-IDF score. We then assigned weights for each of the words found in the user posts using with the following formula:  log tfi,d + 1) log xnt if tf ≥ 1 t,d weighti,d otherwise 0 in which tfi,d represents the word frequency (t) in the document (d), n represents the number of documents within the entire collection, and xt represents the number of documents where t occurs [30]. C. Cataloging and Tagging Text Data

II. METHODS A. Initial Data Search and Collection We first searched for the most popular cancer message boards. We initially focused on the number of posts on lung cancer. The chart below gives the number of posts of lung cancer per forum: Forums Cancer-forums.net cancerforums.net forums.stupidcancer.org csn.cancer.org/forum

Posts on Lung Cancer 36 051 34 328 17 7959

We chose lung cancer because, according to the most recent statistics, it is the most commonly diagnosed cancer in the world for both sexes [30], and the second most prevalent cancer in the US between both sexes [31], [32]. We then compiled a list of drugs used by lung cancer patients to ascertain which drug was the most discussed in the forums. The drug Erlotinib (trade name Tarceva) was the most frequently discussed drug in the message boards. A further search revealed that Cancerforums.net, despite having slightly fewer posts on lung cancer, had more posts dedicated to Erlotinib than the other three message boards mentioned above. Next, we performed a search of the drug, using both the trade name (Tarceva) and drug name (Erlotinib). The trade name garnered more results (498) compared to the drug name (66). The search using the trade name returned 920 posts, from 2009 to the present date. B. Initial Text Mining and Preprocessing A Rapidminer (www.rapidminer.com) [33] data collection and processing tree was developed to look for the most common positive and negative words, and their term-frequency-inverse document frequency (TF-IDF) scores within each post. Fig. 1 shows the data collection and processing tree. We initially up-

Text data containing the highest TF-IDF scores were tagged with a modified NLTK toolkit (http://www.nltk.org/) [34] using MATLAB to ensure that they reflected the negativity of a negative word and the positivity of a positive word in context. This approach was used before using negative tags on positive words [35]. We added a positive tag on negative words. We used the NLTK toolkit for the analysis, and classification, of words to match their exact meanings within the contextual settings. For example, the context should be considered in phrases such as ‘I do not feel great’ so that the term ‘great’ would be adequately tagged as a negative one (in our case it is tagged as ‘great_n’ before it is returned to its specific position). Das and Chen used a similar approach in classifying words [18]. We went one step further and considered positive tag on negative words. A sentence that states ‘No side effects so I am happy!’ resulted in the word ‘No’ being tagged as ‘No_p’ (reflecting its positive context) before it is returned to its specific position. These tagged words were thus reclassified based on the context of the post. We then reduced the number of similar words, both manually (checking the words using online dictionaries such as Merriam-Webster (http://www.merriam-webster.com/), and automatically (synonym database software such as the Thesaurus Synonym Database (http://www.language-databases.com/) and Google’s synonym search finder. Our final wordlist was pruned using the aforementioned methods, with the results displayed in Table I, with the division of both positive and negative words. We eliminated each word that appeared less than ten times. This allowed us to achieve a uniform set of measurements while eliminating statistically insignificant outliers. The end result was a modified wordlist of 110 words (55 positive words and 55 negative words) shown in Table I. In a parallel procedure, we automatically browsed the user posts to look for side effects of Erlotinib. To this goal, we used the National Library of Medicine’s Medical Subject Heading (MeSH), which is a controlled vocabulary

212

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015

TABLE I FINAL POSTANALYSIS WORDLIST Positive Agree Appreciate Beneficial Benefit Comfort Comfortable Ease Easier Effective Enjoy Favorable Favorably Feasible Good Grateful Great Greater Greatest Greatly Help Helped Helpful Helping Helps Honest Honestly Hope Hoped Hopeful Hopefully Hopes Hoping Importance Important Importantly Impresses Improve Improved Improvement Improves Improving Inspiration Like Love Loved Positive Right Success Successful Support Thank Thanks Useful Well Wonderful

Negative Bad Cannot Concern Concerns Damage Dangerous Depression Didn Died Difficult Discomfort Don’t Doubt Error Failure Fear Hard Hasn Hate Hurt Impossible Isn Lack Limited Lose Loss Lost Miss Nasty Nausea Negative No Not Pain Painful Poor Problem Problems Sad Sacred Scary Severe Sorry Sucks Suffer Suffering Terrible Unable Unfortunately Wasn Weak Worried Worse Worst Wrong

(http://www.nlm.nih.gov/mesh/) that consists of a hierarchy of descriptors and qualifiers that are used to annotate medical terms. A custom designed program was used to map words in the forum to the MeSH database. A list of words present in forum posts that were associated to treatment side effects was thus compiled. This was done by selecting the words simultaneously annotated with a specific list of qualifiers in MeSH (CI – chemically induced; CO – complications; DI – diagnosis; PA – pathology, and PP – physiopathology). We then compared the

TABLE II FINAL SIDE EFFECTS WORDLIST Acne Cachexia Headaches Itching Lesion Pneumonia Rash Tremor Weakness Vomiting

Fig 2. Thread model where nodes represent users/posts and the edges represent information transferred among users.

full list of side effect words with the results that were fed into the Rapidminer processing tree: we kept the side effect words with the highest TF-IDF scores (ensuring that each word appeared at least ten times in the forum posts). Table II shows the final wordlist of the side effects. We subjected the initial side effect wordlist with the same methods that were used in Table I. After these preprocessing steps, our forum data was represented as two sets of vectors containing the TF-IDF scores of the words in the two wordlist. Namely, each user post in the forum was thus transformed into a vector of 110 variables representing the TF-IDF scores of positive and negative words, and a 10 variable vector containing the TF-IDF scores corresponding to the side effect terms (see Fig. 2, steps A1-A3). D. Consumer Sentiment Using a SOM For this part of the analysis, all posts were manually labeled according to the general user opinion observed within the post as positive and negative before feeding the collected data for exploratory analysis via SOMs. The manual labeling allowed us to use this as a method of results validation. SOMs are neural networks that produce low-dimensional representation of high-dimensional data [33]. Within this network, a layer represents output space with each neuron assigned a specific weight. The weight values reflect on the cluster content. The SOM displays the data to the network, bringing together similar data weights to similar neurons.

AKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE

The benefits and capabilities have been demonstrated where despite the reduction of the space size, the information, and identification schema of the clusters remained the same [36]. When new data is fed into the network, the closest weights matching the data change to reflect the new data. The neurons farther from the new data rarely change. This process continues until data is no longer fed, resulting in a two-dimensional map. The SOM toolbox (www.cis.hut.fi/projects/somtoolbox) [37] was used and the SOM was fed with our first wordlist (see Table I) TF-IDF vectors. The purpose was to assess the existence of clusters in the data and how the SOM weights of these clusters would correlate to positive and negative opinion. The SOM was trained using various map sizes, using quantization and topographic errors as validation measures. The former is the result of the average distance between every input vector and its best matching neuron (BMN), in addition to measuring how the trained map fits into the input data [33]. The latter uses the structure of the map to preserve its topology by representing its accuracy: it is calculated using the proportion of the weights for the first and second BMNs are farther than required for measuring the topology. The best map size was based on the minimum values of the quantization (0.24) and topographic (10−5 ) errors. The wordlist data was mapped and the emerging weights were analyzed for positive and negative variable correlations of the wordlist. Words of no interest, and groups containing three or fewer words, were eliminated. Subgroups were visually identified and analyzed for further information on the consumer opinion of Erlotinib. E. Modeling Forum Postings Using Network Analysis Discovering influential users was the next step in our analysis. To this goal, we built networks from forum posts and their replies, while accounting for content-based grouping of posts resulting from the existing forum threads. Networks are composed of nodes and their connections: they are either nondirectional (a connection between two nodes without a direction) or directional (a connection with an origin and an end). The nodal degree of the latter measures the number of connections from the origin to the destination. Four node types have been identified [38] within a network: Isolated, transmitter, receptor, and carrier. The network’s density measures the current number of connections. The network-based analysis is widely used in social network analysis based on its ability to both model and analyze intersocial dynamics. We devised a directional network model due to the nature of the forum under scrutiny (multiple threads with multiple thread initiators) and its internal dynamics among the members (members reply to thread initiators as well as to other users). Fig. 2 describes the approach we chose to build our network, which shows how each posting-reply pair is modeled. Based on the nature of the forum, all of the posters within each thread are context posters for the thread initiator (e.g., Node 1 is the thread initiator in Fig. 2 and Nodes 2, 3, 4, and 5 in represent context posters). Thus, all of the posters receive an incoming edge from the thread initiator. Some context posters respond

213

Fig. 3. Diagram describing the framework of our network-based analysis. First, the posts collected from the forum via Rapidminer are preprocessed using the NTLK Toolbox (Step A1) and transformed into two wordlists (Step A2). For this step, direct mapping to the MeSH vocabulary is used to identify words representing side-effects Based on the two wordlists, forum posts are transformed into numerical vectors containing word-frequency based TF-IDF scores (Step A3). In parallel, forum posts and replies are modeled as a directed network (Step B1). Obtained network is further refined to identify communities/modules of highly interacting users, based on the MCSD method [28] (Step B2). Finally, the two wordlist vectors datasets (their info reflecting the forum information content) are overlaid onto the network modules to identify influential users and highlight side-effects intensively discussed within the modules, respectively (Step B3).

directly to another poster, using the forum option ‘Reply.’ We used bidirectional edges to reflect the ensuing information transfer from the poster to the replier and vice versa (in Fig. 2, Node 5 is a direct replier to Node 4, as is Node 3 to Node 2). This user-interaction model allowed us to build a network that reflects faithfully the information content of the forum. F. Identifying Subgraphs Our modeling framework has consequently converted the forum posts into several large directional networks containing a number of densely connected units (or modules) (see Fig. 3, step A1). These modules have the characteristic that they are more densely connected internally (within the unit) than externally (outside the unit). We chose a multiscale method that uses local and global criteria for identifying the modules, while maximizing a partition quality measure called stability [28]. The stability measure considers the network as a Markov chain, with nodes representing states and edges being possible transitions among these states. In [28], the authors proposed an approach in which transition probabilities for a random walk of length t (t being the Markov time) enable multiscale analysis. With increasing scale t, larger and larger modules are found. The stability of a walk of length t can be expressed as   1  di dj At i , j − ∗ δ (i, j) (1) QM t = 2m i,j 2m where At is the adjacency matrix, t is the length of the network, m is the number of edges, i and j are nodes, di is node i’s (and j’s) strength, and δ (i,j) function becomes one if one of the nodes belong to the same network and zero if it does not belong to

214

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015

any network. At is computed as follows (in order to account for the random walk): At = D · M t , where M = D−1 · A (D being the diagonal matrix containing the degree vector giving for each node its degree) [28]. The method for identifying the optimal modules is based on alternating local and global criteria that expand modules by adding neighbor nodes, reassigning nodes to different modules, and significantly overlapping modules until no further optimization is feasible, according to (1). The approach follows similar methods presented in [28], [39], and [40]. Several partitioning schemes were obtained pending on the range of scales employed by the method, with the optimal partitioning having the largest stability. We named the modules thus retrieved information modules (see step A2 in Fig. 3). G. Module Average Opinion and User Average Opinion We then proceeded to refine the information modules through feeding them with the information obtained from the forum posts (using the wordlist vectors). In a first step, we aimed at identifying influential users within our networks. Influential users are users which broker most of the information transfer within network modules and whose opinion in terms of positive or negative sentiment towards the treatment is ‘spread’ to the other users within their containing modules. To this goal, we enriched the information modules obtained as described in Section II-F with the TF-IDF scores of the user posts corresponding to the users found in each module. The TF-IDF scores from the wordlist of positive and negative words (see Table I) were used to build two forms of measurement. The global measure (pertaining to the whole information module) is represented by the module average opinion (MAO). It examined the TF-IDF scores of postings matching the nodes in a specific module MAO = 

Sum+ − Sum− . Sumall

Sum+ = xij is the total sum of the TF-IDF scores matching the positive words in the wordlist vectors within the module. The units i represent post index. The unit j represents the wordlist index  (matching the positive words in the list). xij is the total sum of the TF-IDF scores Sum− = matching the negative words in the wordlist vectors within the module. The units i represent post index. The unit j represents the wordlist index the negative words in the list). M  (matching Sumall = N i=1 k =1 xik is the sum of both of the aforementioned sums. The unit k is the index running across variables throughout the entire wordlist. The local measure that illustrates specific user opinion to each node in the module (the user average opinion, or UAO) that examines the TF-IDF scores to the average of the collected posts of the user is the following: UAOi =

Sumi + − Sumi − . Sumiall

 Sumi+ = j ∈P xij is the TF-IDF score’s sum matching to positive words for the ith user’s wordlist vector. P is the index set denoting the wordlist’s positive variables.

Fig. 4.

U-Matrix of the posts from the Cancerforums.net forum.

 Sumi− = j ∈N xij is the TF-IDF score’s sum matching to negative words for the ith user’s wordlist vector. N is the index set denoting  the wordlist’s negative variables. Sumall = M j =1 xij is the total of both sums, and j is the index of the whole wordlist. H. Information Brokers Within the Information Modules We first ranked individual nodes in terms of their total number of connecting edges (in and out-degree) to identify influential users within the modules. We then looked nodes in each module based on the following criteria: 1) The nodes have densest degrees within the module (highest number of edges). 2) The UAO scores equate the signs of the MAO of the containing module. The nodes that qualified were dubbed information brokers, based on the aforementioned criteria. Their large nodal degrees ensure increased information transfer compared to other nodes while their matching UAO and MAO scores reflect consistency of positive or negative opinion within the containing module. I. Network-Based Identification of Side Effects In the second step of our network-based analysis, we devised a strategy for identifying potential side effects occurring during the treatment and which user posts on the forum highlight. To this goal, we overlay the TF-IDF scores of the second wordlist (see Table II) onto modules obtained in Section II-F. The TFIDF scores within each module will thus directly reflect how frequent a certain side-effect is mentioned in module posts. Subsequently, a statistical test (such as the t-test for example) can be used to compare the values of the TF-IDF scores within the module to those of the overall forum population and identify variables (side-effects) that have significantly higher scores. Fig. 3 presents a diagram that visually describes the steps in our network-based analysis. III. RESULTS Fig. 4 shows the unified matrix resulting from the SOM analysis for the wordlist vectors corresponding to the positive and negative terms from the message board Cancerforums.net. A subset consisting of 30% of the data was used for training the SOM. We used a 12 × 12 map size with 110 variables corresponding to the positive and negative terms to ascertain the

AKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE

215

TABLE III USER OPINION OF ERLOTINIB Satisfaction

Dissatisfaction

70 percent

30 percent BREAKDOWN OF USER OPINION Fully Satisfied (23) Full Dissatisfaction (4) Satisfied Despite Side Effects (37) Dissatisfaction because of Side Effects (20) Satisfied Despite Costs (10) Dissatisfaction because of Costs (6)

weight of the words corresponded to the opinion of the drug Erlotinib. As mentioned in Section II, each word from the list appeared more than ten times. This achieved a uniform measurement set while eliminating statistically insignificant outliers. Much of the user’s posts converged on three areas of the map. We checked the respective nodes’ correlation with their weight vectors’ values corresponding to positive or negative words to define the positive and negative areas of the map. The user opinion of Erlotinib was overall satisfactory, with Table III summarizing the satisfaction/dissatisfaction below: According to chart, and from our readings of both the user posts and the SOM, the most pressing concern from both camps was the side effects, which are extensively documented in the medical literature [41]–[46]. The costs of the drug were also another matter of concern (albeit limited). We then proceeded to identify influential users. Our modeling approach yielded initially a single loosely connected network, linking all users within the forum. Subsequent module identification using the methods described in Section II-F yielded an optimal partitioning containing five densely connected module. We varied our scale parameter within the interval t ࢠ [0,2] in 0.1 increments, as suggested by [28]. Varying the scale parameter resulted in a set of partitions ranging from modules based on single individual users (for scale parameter t = 0), to large modules (for values of t close to the upper limit of the interval). The optimal partition (maximizing the quality measure in (1) was obtained for t = 1. On the Cancerforums.net message board, ten users out of the 920 posts were identified as information brokers as shown in Fig. 5(a)–(e) below. Densities of the retrieved modules range from 0.2 to 0.6. These density values were within the observed density values interval (towards the upper limit), when compared to those generally noted in social networks, thus confirming our network modeling approach [47], [48]. Information brokers were identified following the procedure described in Sections II-G–H. Further scrutinizing these users and their containing modules we confirmed their connections were the densest. A thorough reading of these ten users’ posts throughout the threads they started and participated in revealed that they were informative and actively interacting with users across many threads. Other members sought out these ten posters for their wisdom and experience. Their forum ‘behavior’ has confirmed to us that these users were the premier information brokers of the drugs Erlotinib on the Cancerforums.net forum.

Fig. 5. Ten users were identified as information brokers on the Cancerforums.net Forum. Modules in parts a)–e) show where these ten users reside in the forum.

216

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015

TABLE IV SIDE-EFFECT FREQUENCY AND LOCATION IN SELECTED MODULES Module 1 (A)

‘rash’ (p − value < 0.01) ‘itch’ (p − value < 0.05)

Module 2 (B) Module 5 (E)

‘rash’ (p − value < 0.05) ‘rash’ (p − value < 0.01)

In the last part of our analysis, we investigated which modules were significantly involved in discussing specific side effects. As described in Section II-I, retrieved modules were enriched with the TF-IDF scores corresponding to the side-effect wordlist vectors. For each module and each side-effect scores sample, ttests were performed to assess the significant difference between the in-module sample and the overall forum population scores. Rash and itching were identified as the side effect terms with significantly higher scores in Modules 1, 2, and 5 when compared to the overall scores population in the forum, as described in Table IV. This reflects the fact that users grouped within these modules repeatedly discussed these side effects in their posts. This was confirmed by subsequent scrutiny of the respective posts. A literature search confirmed that rash and itching are indeed two of the most common side-effects of Erlotinib with as much as 70% of the patients affected, as indicated by clinical studies. [44]–[46] IV. DISCUSSION We converted a forum focused on oncology into weighted vectors to measure consumer thoughts on the drug Erlotinib using positive and negative terms alongside another list containing the side effects. Our methods were able to investigate positive and negative sentiment on lung cancer treatment using the drug by mapping the large dimensional data onto a lower dimensional space using the SOM. Most of the user data was clustered to the area of the map linked to positive sentiment, thus reflecting the general positive view of the users. Subsequent network based modeling of the forum yielded interesting insights on the underlying information exchange among users. Modules of strongly interacting users were identified using a multiscale community detection method described in [28]. By overlaying these modules with content-based information in the form of word-frequency scores retrieved from user posts, we were able to identify information brokers which seem to play important roles in the shaping the information content of the forum. Additionally, we were able to identify potential side effects consistently discussed by groups of users. Such an approach could be used to raise red flags in future clinical surveillance operations, as well as highlighting various other treatment related issues. The results have opened new possibilities into developing advanced solutions, as well as revealing challenges in developing such solutions. The consensus on Erlotinib depends on individual patient experience. Social media, by its nature, will bring different individuals with different experiences and viewpoints. We sifted

through the data to find positive and negative sentiment, which was later confirmed by research that emerged regarding Erlotinib’s effectiveness and side effects. Future studies will require more up-to-date information for a clearer picture of user feedback on drugs and services. Future solutions will require more advanced detection of intersocial dynamics and its effects on the members: such interests of study may include rankings, ‘likes’ of posts, and friendships. Further emphasis on context posting will require formal language dictionaries that include medical terms for specific diseases, and informal language terms (‘slang’) to clarify posts. Finally, different platforms will allow up-to-date information on the status of the drug in case one social platform ceases to discuss the drug. Another solution can look at multiple wordlists that can include multiple treatments that, when combined with contextual posting and medical lexical dictionaries, can pinpoint the source (or multiple sources) of user satisfaction (or dissatisfaction), which can open the door towards mapping consumer sentiment of multidrug therapies for advanced diseases. The combined solutions can open new avenues of postmarketing surveillance research as companies seek real-time, ‘intelligent’ data of their products and services to remain competitive. This solution can be envisioned on future medical devices that can serve as postmarketing feedback loop that consumers can use to express their satisfaction (or dissatisfaction) directly to the company. The company benefits from real-time feedback that can then be used to assess if there are any problems and rapidly address such problems. Social media can open the door for the health care sector in address cost reduction, product and service optimization, and patient care. REFERENCES [1] A. Ochoa, A. Hernandez, L. Cruz, J. Ponce, F. Montes, L. Li, and L. Janacek. “Artificial societies and social simulation using ant colony, particle swarm optimization and cultural algorithms,” New Achievements in Evolutionary Computation, P. Korosec, Ed. Rijeka, Croatia: InTech, pp. 267–297, 2010. [2] W. Cornell and W. Cornell. (2013). How Data Mining Drives Pharma: Information as a Raw Material and Product. [Online]. Available: http://acswebinars.org/big-data [3] L. Toldo, “Text mining fundamentals for business analytics,” presented at the 11th Annual Text and Social Analytics Summit, Boston, MA, USA, 2013. [4] L. Dunbrack, “Pharma 2.0 – social media and pharmaceutical sales and marketing,” in Proc. Health Ind. Insights, 2010, p. 7. [5] C. Corley, D. Cook, A. Mikler, and K. Singh, “Text and structural data mining of influenza mentions in web and social media,” Int. J. Environ. Res. Public Health, vol. 7, pp. 596–615, Feb. 2010. [6] L. Getoor and C. Diehl, “Link mining: a survey,” SIGKDD Explor. Newsl., vol. 7, pp. 3–12, Dec. 2005. [7] Q. Lu. And and L. Getoor, “Link-based classification,” in Proc. 20th Int. Conf. Mach. Learning, Washington, D.C., USA, 2003, pp. 496–503. [8] A. Ng, A. Zheng, and M. Jordan, “Stable algorithms for link analysis,” in Proc. SIGIR Conf. Inform. Retrieval., New Orleans, LouisianaLO, USA, 2001, pp. 258–266. [9] B. Taskar, M. Wong, P. Abbeel, and D. Koller, “Link prediction in relational data,” in Proc. Adv. Neural Inform. Process. Syst., Vancouver, B.C. Canada, 2003. [10] D. Liben-Nowell and J. M. Kleinberg, “The link prediction problem for social networks,” J. Am. Soc. Inform. Sci. Technol., vol. 57, pp. 556–559, May 2007.

AKAY et al.: NETWORK-BASED MODELING AND INTELLIGENT DATA MINING OF SOCIAL MEDIA FOR IMPROVING CARE

[11] Z. Lacroix, H. Murthy, F. Naumann, and L. Raschid, “Links and paths through life sciences data sources,” in Proc. 1st Int. Workshop Data Integr. Life Sci., Leipzig, Germany, 2004, pp. 203–211. [12] J. Noessner, M. Niepert, C. Meilicke, and H. Stuckenschmidt, “Leveraging terminological structure for object reconciliation,” in The Semantic Web: Research and Applications, Heidelberg, Berlin: Springer, 2010, pp. 334– 348. [13] M. E. J. Newman, “Detecting community structure in networks,” Eur. Phys. J., vol. 38, pp. 321–330, Mar. 2004. [14] J. Huan and J. Prins, “Efficient mining of frequent subgraphs in the presence of isomorphism,” in Proc. 3rd IEEE Int. Conf. Data Mining, Melbourne, Florida.FL, USA, 2003, pp. 549–552. [15] D. Hand, “Principles of data mining,” Drug Safety, vol. 30, pp. 621–622, Jul. 2007. [16] J. Hans and M. Kamber, Data Mining: Concepts and Techniques. 2nd ed. Burlington, MassMA, USA: Morgan Kaufmann, 2006 [17] C. Corley, D. Cook, A. Mikler, and K. Singh, “Text and structural data mining of influenza mentions in web and social media,” Int. J. Environ. Res. Public Health, vol. 7, pp. 596–615, Feb. 2010. [18] S. R. Das and M. Y. Chen, “Yahoo! for Amazon: Sentiment extraction from small talk on the Web,” Manag. Sci., vol. 53, pp. 1375–1388, Sep. 2007. [19] E. Riloff, “Little words can make a big difference for text classification,” in Proc. 18th Annu. Int. ACM SIGIR Conf. Res. Develop. Inform. Retrieval, Seattle, WashingtonWA, USA, 1995, pp. 130–136. [20] W. Yih, P. H. Chang, and W. Kim, “Mining online deal forums for hot deals,” in Proc. IEEE/WIC/ACM Int. Conf. Web Intell., Beijing, China, 2004, pp. 384–390. [21] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment classification using machine learning techniques,” in Proc. ACL-02 Conf. Empirical Methods Natural Language Process., Philadelphia, PA, USA, 2002, pp. 79–86. [22] X. Feng, A. Cai, K. Dong, W. Chaing, M. Feng, N.S. Bhutada, J. Inciardi, and T. Woldemariam, “Assessing pancreatic cancer risk associated with dipeptidyl peptidase 4 inhibitors: data mining of FDA adverse event reporting system (FAERS),” J. Pharmacovigilance, vol. 1, Jul. 2013. [23] K. Y. Chan, C. K. Kwong, and T. C. Wong, “Modeling customer satisfaction for product development using genetic programming,” J. Eng. Design, vol. 22, no. 1, pp. 56–68, Jan. 2011. [24] I. Frommholz and M. Lechtenfeld, “Determining the polarity of postings for discussion search,” in Proc. Workshop Woche Lernen Wissen Adaptivit¨at., W¨urzburg, Germany, 2008, pp. 49–56. [25] J. Schectman, (2013, May 1). Glaxo Mined Online Parent Discussion Boards For Vaccine Worries. [Online]. Available (http:// blogs.wsj.com/cio/2013/05/01/glaxo-mined-online-parent-discussionboards-for-vaccine-worries/) [26] R. McBride, (2012, Aug. 1). Merck to Draw on Social Network for Psoriasis Patients. [Online]. Available (http://www.fiercebiotechit.com/story/merck-draw-social-networkpsoriasis-patients/2012–08–13) [27] A. Akay, A. Dragomir, and B. E. Erlandsson, “A novel data-mining approach leveraging social media to monitor consumer opinion of sitagliptin,” J. Biomed Health Inform. Vol: PP, Issue: 99. [28] E. Le Martelot and C. Hankin, “Multi-scale community detection using stability as optimisation criterion in a greedy algorithm,” Proceedings of the 2011 International. Conf. erence on Knowledge Discovery and Information Retrieval (KDIR 2011), Paris, France: SciTePress, Oct. 2011, pp. 216–225. [29] National Cancer Institute. (2014, Apr. 21). Erlotinib Hydrochloride. [Online]. Available: http://www.cancer.gov/cancertopics/druginfo/ erlotinibhydrochloride [30] World Cancer Research Fund International. (2013, Dec. 13). Cancer Statistics Worldwide. [Online]. Available: http://www.wcrf.org/ cancer_statistics/world_cancer_statistics.php [31] Centers for Disease Control and Prevention. (2013, Oct. 24). Cancer Data by Demographic. [Online]. Available: http://www.cdc.gov/cancer/ dcpc/data/men.htm [32] Centers for Disease Control and Prevention. (2013, Oct. 24). Cancer Data by Demographic. [Online]. Available: http://www.cdc. gov/cancer/dcpc/data/women.htm [33] I. Mierswa, M. Wurst, W. Michael, R. Klinkenberg, M. Scholz, and T. Euler, “YALE: Rapid prototyping for complex data mining tasks,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Philadelphia, PA, USA, 2006, pp. 935–940.

217

[34] S. Bird, E. Klein, and E. Loper, Natural Language Processing With Python. Sebastopol, CA, USA: O’Reilly Media, 2009, pp. 504 [35] P. Soucy and G. W. Mineau, “Beyond TFIDF weighting for text categorization in the vector space model,” in Proc. 19th Int. Joint Conf. Artificial Intell., Edinburgh, U.K., 2005, pp. 1130–1135. [36] P. Bonato, P. J. Mork, D. M. Sherill, and R. H. Westgaard, “Data mining of motor patterns recorded with wearable technology,” IEEE Eng. Med. Biol. Mag., vol. 22, no. 3, pp. 110–119, May/Jun. 2003. [37] J. Vesanto, J. Himberg, E. Alhoniemi, and J. Parhankangas, “SelfOrganizing Map in MATLAB: The SOM Toolbox,” in Proc. Matlab DSP Conf., Espoo, Finland, 1999, pp. 35–40. [38] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications. New York, NY, USA: Cambridge University Press, 1994, pp. 825 [39] A. Arenas, A. Fernandez, and S. Gomez, “Analysis of the structure of complex networks at different resolution levels,” New J. Phys., vol. 10, p. 053039, 2008. [40] M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Phys. Rev. E, vol. 69, p. 026113, 2004. [41] M. N. Balak, Y. Gong, G. J. Riely, R. Somwar, A. R. Li, M. F. Zakowski, A. Chiang, G. Yang, O. Ouerfelli, M. G. Kris, M. Ladanyi, V. A. Miller, and W. Pao, “Novel D761Y and common secondary T790M mutations in epidermal growth factor receptor-mutant lung adenocarcinomas with acquired resistance to kinase inhibitors,” Clin. Cancer Res., vol. 12, no. 1, pp. 6494–501, 2006. [42] C. H. Yun, K. E. Mengwasser, A. V. Toms, M. S. Woo, H. Greulich, K. K. Wong, M. Meyerson, and M. J. Eck, “The T790M mutation in EGFR kinase causes drug resistance by increasing the affinity for ATP,” Proc. Nat. Acad. Sci., vol. 105, no. 6, pp. 2070–2075, 2008. [43] J. A. Engelman, K. Zejnullahu, T. Mitsudomi, Y. Song, C. Hyland, J. O. Park, N. Lindeman, C. M. Gale, X. Zhao, J. Christensen, T. Kosaka, A. J. Holmes, A. M. Roger, F. Cappuzzo, T. Mok, C. Lee, B. E. Johnson, L. C. Cantley, and P. A. J¨anne, “MET amplification leads to Gefitinib resistance in lung cancer by activating ERBB3 signaling,” Science, vol. 316, no. 5827, pp. 1039–1043, 2007. [44] A. Anna, C. Liza, A. L. C. Agero, S. W. Dusza, C. Benvenuto-Andrade, K. J. Busam, P. Myskowski, and A. C. Halpem, “Dermatologic side effects associated with the epidermal growth factor receptor inhibitors,” J. Am. Acad. Dermatology, vol. 55, pp. 657–670, 2006. [45] R. Esther, E. Roe, M. P. Garcia Muret, E. Marcuello, J. Capdevila, C. Pallares, and A. Alomar, “Description and management of cutaneous side effects during cetuximab or erlotinib treatments: a prospective study of 30 patients,” J. Am. Acad. Dermatology, vol. 55, pp. 429–437, 2006. [46] R. Caroline, C. Robert, J. C. Soria, A. Spatz, A. Le Cesne, D. Malka, P. Pautier, J. Wechsler, C. Lhomme, B. Escudier, V. Boige, J. P. Armand, and T. Le Chevalier, “Cutaneous side-effects of kinase inhibitors and blocking antibodies,” Lancet Oncology, vol. 6, pp. 491–500, 2005. [47] K. Faust, “Very local structure in social networks,” Sociological Methodology, vol. 37, pp. 209–256, Nov. 2007. [48] K. Faust, “Comparing social networks: Size, Density and Local Structure,” Adv. Methodology Statist., vol. 3, no. 2, pp. 185–216, 2006.

Altug Akay (M’11) received the B.A. degree from the University of Massachusetts, Amherst, MA, USA, in 2005, and the M.A. degree from Dartmouth College, Hanover, NH, USA, in 2007. He is currently working toward the Ph.D. degree in the School of Technology and Health at the Royal Institute of Technology, Huddinge, Sweden. Upon graduating from Dartmouth, he worked at Siemens Healthcare, Erlangen, Germany, for one year under the supervision of Dr. Gudrun Zahlmann. He worked at Spaulding Rehabilitation Hospital under the supervision of Dr. Paolo Bonato from 2008 to 2009. He also contributed a book chapter entitled “ß-thalassemia’s social and economic geography: A possible prevention/treatment program to rout “legacy” genetic mutations,” in the book entitled “Mathematical Methods in Scattering Theory and Biomedical Engineering,” (London, U.K.: World Scientific, 2008). His current research interests include public health, health policy, and health technology.

218

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015

Andrei Dragomir (M’09) received the B.S. degree in electrical engineering at the Politehnica University of Bucharest, Bucharest, Romania, in 1999 and the M.S. and Ph.D. degrees in biomedical engineering at the University of Patras, Patras, Greece, in 2002 and 2006 respectively. He was a Postdoctoral Research Associate at the Arizona State University, Tempe, AZ, USA, from 2006 to 2010. He is currently an Assistant Professor at the University of Houston, Houston, TX, USA. His current research interests include biocomplexity, neural engineering, pattern recognition, machine learning, and bioinformatics.

Bj¨orn-Erik Erlandsson (SM’82) received the Ph.D. degree from Chalmers University of Technology, Gothenburg, Sweden, in the field of applied electronics, in 1977. He is currently a Senior Advisor and Professor at the School of Technology and Health, Royal Institute of Technology, Stockholm, Sweden, primarily responsible for technology and quality in health care. His work experience ranges from research and development, quality management, and regulatory affairs in the international medical device industry, Siemens, and Nobel Industries. He worked as a Quality Manager in the pacemaker industry and has been the Director in Medical Informatics and Technology at the University Hospital of Northern Sweden, Ume˚a, Sweden, and Akademiska Hospital, Uppsala, Sweden, and also as a Professor in Biomedical Engineering. In his role as a Director for technical operations at the university hospitals, he has also been heavily involved in investment issues and investment management, and chairman of the investment planning groups. He is also involved in the standardization work in medical technology and medical informatics, chairman of SIS/TK334. Dr. Erlandsson was the Chairman of the Joint Working Group in Software and Medical Devices (SAMD) at CEN / CENELEC during 2010, and the Cochairman during 2011. He was the Chairman of the Medical Society’s Division of Medical Informatics for two years, Member of the Swedish Council on Health Technology Assessment, SBU Alertr˚ad, and Coordinator EU policies at the Royal Institute of Technology’s Life.

Suggest Documents