A Novel Data-Mining Approach Leveraging Social Media to Monitor ...

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 1, JANUARY 2015

389

A Novel Data-Mining Approach Leveraging Social Media to Monitor Consumer Opinion of Sitagliptin Altug Akay, Member, IEEE, Andrei Dragomir, Member, IEEE, and Björn-Erik Erlandsson, Senior Member, IEEE

Abstract—A novel data mining method was developed to gauge the experience of the drug Sitagliptin (trade name Januvia) by patients with diabetes mellitus type 2. To this goal, we devised a two-step analysis framework. Initial exploratory analysis using self-organizing maps was performed to determine structures based on user opinions among the forum posts. The results were a compilation of user’s clusters and their correlated (positive or negative) opinion of the drug. Subsequent modeling using network analysis methods was used to determine influential users among the forum members. These findings can open new avenues of research into rapid data collection, feedback, and analysis that can enable improved outcomes and solutions for public health and important feedback for the manufacturer. Index Terms—Data mining, network analysis, self-organizing map, social media.

I. INTRODUCTION OCIAL media, ranging from personal messaging to live foras, is providing limitless opportunities for patients to discuss their experiences with drugs and devices. It is also providing limitless opportunities for companies to receive feedback on their products and services [1]–[3]. Pharmaceutical companies are already looking at social network monitoring as a top priority within their IT departments, potentially creating an opportunity for rapid dissemination and feedback of products and services to optimize and enhance delivery, increase turnover and profit, and reduce costs [4]. Recently, methods for harvesting social media for biosurveillance have also been reported [5]. Social media enables communication, collaboration, information collection, and sharing in the healthcare space. It therefore provides a virtual social networking environment. An appropriate way to extract knowledge and trends from the information “cloud” would be to model social media using available network modeling and computational tools (such as network-based analysis methods). Under this paradigm, a social network (Facebook, Twitter, WebMD, etc.) is a structure made of nodes (individuals or organizations) and edges that connect nodes in various relationships such as interests, friendship, kinship, etc. The most

S

Manuscript received October 19, 2013; accepted December 8, 2013. Date of publication December 23, 2013; date of current version December 30, 2014. A. Akay and B.-E. Erlandsson are with the School of Technology and Health, Royal Institute of Technology, 10044 Stockholm, Sweden (e-mail: [email protected]; [email protected]). A. Dragomir is with the Department of Biomedical Engineering, University of Houston, Houston, TX 77204 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JBHI.2013.2295834

common way to represent the information would be a graphical representation that is very convenient for visualization. Network modeling may offer an in-depth understanding of social network dynamics. A network model could be used for simulation studies of various network properties such as understanding how users disseminate information among themselves (news about pandemic or drugs’ adverse effects). Another example is studying the enhancement of certain edges of networks and how certain information affects the enhancements (e.g., how certain user communities evolve based on common interests about specific diseases). A matrix can represent information extracted from social media (called the sociomatrix, or adjacency matrix) that can help construct the network representation. Social networks, although very sparse, are leveragable for performing efficient analysis of the constructed networks. Node degree, network density, and other large-scale parameters can derive information about the importance of certain entities within the network (drug brands, healthcare providers, pharmaceutical companies, or device manufacturers). Such communities are clusters or modules. Specific algorithms can perform network-clustering, one of the fundamental tasks in network analysis. Finding a community in a social network means identifying nodes that interact with each other more frequently than nodes outside of the group. Community detection can facilitate the extraction of valuable information for the whole healthcare industry. The pharmaceutical industry could benefit from this for better targeting their marketing spends. Healthcare providers could better understand the level of satisfaction (and minimizing the adverse events) in their services among patients. Physicians could collect important feedback (stored in the labels characterizing these network modules) from other doctors and patients that would help them in their treatment recommendations and thereby improving the treatment results. Finally, patients may evaluate and leverage other consumers’ knowledge before making better-informed healthcare decisions. Social networks are heterogeneous, multirelational, and semistructured, making gathering such data difficult. One method of social media data mining is link (relationship) mining, which combines social networks, link analysis, hypertext and web mining, graph mining, relational learning, and inductive logic programming [6]. Researching links involve several steps: link-based object classification (categorizes objects based on links and attributes) [7], object type prediction (predicts object types based on attributes, links, and objects linked to it) [8], link type prediction (predicts the purpose of the link based on the objects involved) [9], link existence prediction (predicts the existence of a link) [10], link cardinality estimation (predicting

2168-2194 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

390


the number of links (and objects reached) to an object) [11], object reconciliation (determining whether two objects are the same based on their links) [12], group detection (predicting if an object set belongs together) [13], sub-graph detection (discovering sub-graphs within networks) [14], and metadata mining (mining for data about data) [15], [16]. Other examples of mining social networks are link prediction, namely using the features intrinsic of the current model of a social network to model future connections within the network. Viral marketing uses word-of-mouth effect by measuring the interactions among customers and carefully marketing individuals with the most social connections. Newsgroups discussions take advantage of “response” relationships based on how often people respond to messages they agree (or disagree) with using graph-partitioning algorithms. Relation selection and extraction of a multirelational network measures and ranks different relations based on user information (acquired through queries). Traditional social sciences use surveys and involve subjects in the data collection process. Limited by this process, data collected are of small sizes and typically hundreds of subjects in one study. By contrast, thousands of users of social media produce inordinate amounts of data with rich user interactions. There are two simple ways to extract this information: 1) crawling using site provided application programming interfaces (APIs) or 2) scraping needed information from rendered HTML pages. Many social media sites provide APIs: Twitter, Facebook, YouTube, Flickr, etc. We can also follow how its properties change over time, which would greatly interest public health studies. Past studies have used computational approaches for sentiment extraction of user opinion. Corley et al. analyzed how influenza posts by users in social media correlated with patient reporting data [17]. Das and Chen performed sophisticated sentiment analysis of technology stock market quotes in message boards [18]. Riloff studied the context surrounding words of interest using singular and plural nouns, verb forms, negation, and prepositions [19]. Yih et al. develop a system to intelligently search for the best deals for consumers in an online deal forum [20]. Pang et al. tested three classifiers (Na¨ıve Bayes, maximum entropy, and support vector machines) to ascertain how they performed on sentiment classification [21]. Feng et al. mined data collected from the Food and Drug Administration’s Adverse Event Reporting System [22]. Chan et al. used genetic programming to map interaction terms and higher order terms to map consumer satisfaction [23]. Frommholz and Lechtenfeld discuss the polarity of postings using term and context features in news sources [24]. Furthermore, literature searches have revealed that companies have scoured social media platforms to assess user sentiment [25], [26]. While the data-mining literature is extensive, none have identified influential users, and how forum relationships affect the opinions and behaviors of other users. The novelty of our approach is threefold: first, in contrast with already published studies, we identified influential users, and to this goal, our approach takes into account how forum relationships affect the opinions and behaviors of users. Second, we built on the approach introduced in [18] and automatically tagged both positive and negative words based on the context

of each post. Finally, our approach relies on word frequency statistics and employs an exploratory analysis stage based on self-organizing maps (SOMs), followed by a network analysis stage, based on adapted graph theory methods, that aims at identifying user communities (modules) and potential influential users. II. METHODS A. Forum Search The first step was to find the most popular forum dedicated to diabetes mellitus type 2. We compared the search of four most popular diabetes-related message boards: DiabetesForum.com, Healthboards.com, Foum, and DiabetesDaily.com. A list of drugs and devices used by patients was compiled and searched within the results of each of the message boards: the aim was to ascertain which drugs and/or devices the patients were discussing the most. Sitagliptin was discovered to be the most discussed drug based on the large number of posts on the drug on the message boards. The message board DiabetesDaily.com was chosen because Sitagliptin was the most frequently discussed drug compared to the other four message boards. In the message board, DiabetesDaily.com, a search using the trade term “Januvia” garnered more results (2600 results) than the drug term “Sitagliptin” (92 results). The search result using “Januvia” also returned 713 posts, ranging from 2007 to the present time. B. Initial Text Mining and Preprocessing After compiling the posts, we fed the list to a modified decision-making tree in Rapidminer (www.rapidminer.com) [27] to ascertain the most commonly used words. This decisionmaking tree removed unwanted characters and common stop words. The end result was an initial wordlist containing the forums posts with the term-frequency-inverse document frequency (TF-IDF) scores. The words were then divided into two categories: “positive” and “negative.” The weight vector components of each vector (posting) uses the TF-IDF scheme n log(tft,d + 1) log , if tft,d ≥ 1 weightt,d = xt 0, otherwise where tft,d is the frequency of word t in document d, n, number of documents in collection, and xt , number of documents where word t occurs [28]. C. Text Tagging and Classification The words with the highest TF-IDF scores were located in the forum posts and were then tagged using Python and the NLTK toolkit [29] based on whether they reflected the negativity of a negative word and the positivity of a positive word based on context. For example, the term “I do not feel great” resulted in the word “great” being tagged as “great__n” before it is returned to its specific position. Das and Chen used a similar approach in classifying words [30]. We went one step further and added a positive tag on negative words. A sentence that states “No side

AKAY et al.: NOVEL DATA-MINING APPROACH LEVERAGING SOCIAL MEDIA TO MONITOR CONSUMER OPINION OF SITAGLIPTIN

TABLE I FINAL POST-ANALYSIS WORDLIST

effects so I am happy!” resulted in the word “No” being tagged as “No__p” before it is returned to its specific position. These tagged words were then reclassified based on the context of the post. The next step was to reduce the number of similar words. This was done both manually [checking the words using online dictionaries such as Merriam-Webster (http://www.merriamwebster.com/)] and automatically [synonym database software such as Thesaurus Synonym Database (http://www.languagedatabases.com/) and Google’s synonym search finder (using ‘∼’ after a word)]. The wordlist was reduced using the aforementioned methods, resulting in the development of the Table I. The chart below shows the words divided into the positive and negative terms. The terms that associate with positive or negative meanings are tagged. This was based on both the frequency of the words used in the forum and the context with which the words were used in the posts. Furthermore, each word that appeared less than ten times was also eliminated. This allowed us to achieve a uniform set of measurements while eliminating statistically insignificant outliers. The end result was a modified wordlist of 28 words (14 untagged and tagged positive words, and 14 untagged and tagged negative words) shown in Table I. Before feeding the collected data for exploratory analysis via SOMs, all posts were manually labeled according to the general user opinion observed within the post as positive, negative, and neutral. The manual labeling allowed us to use this as a method of results validation. D. Self-Organizing Map SOMs are an artificial neural network used for clustering that produces low-dimensional representation of high-dimensional data [31]. The SOM is a network where a neural layer (projecting the input data) represents the output space, with each neuron corresponding to a cluster with an attached weight vector. The values of the weight vectors reflect the content of the cluster they are attached to. The SOM presents the available data to the network, linking similar data vectors to the same neurons.

391

We used the SOM because of its visual benefits and highlevel capabilities that greatly facilitated the high-dimensional data analysis. Bonato et al. has shown how vector quantization algorithms reduce the feature space’s size without losing information for identifying clusters in the classification space [32]. The training process presents new input data to the network that determines the closest weight vector and assigns the data vector to the matching neuron: such neurons (and its neighbors) undergo an adaptation process to reflect their new value. The neurons farther from the changed neurons adapt their weight vectors by a smaller degree. The process repeats for all input vectors until all convergence criteria are met. The end-result is a 2-D map. We took the modified wordlist and fed it into the SOM toolbox (http://www.cis.hut.fi/projects/somtoolbox/) in MATLAB [33] to see if specific vectors clustered together based on the specified words from the word list. We trained the SOM with different map sizes, and chose as internal validation measures the quantization and topographic errors. The quantization error is computed as the average distance between each input vector and its best matching neuron (BMN) and IS a measure of how good the trained map fits the input data [31]. The topographic error considers the map structure and represents the accuracy of the map in preserving topology. The topographic error value is calculated from the proportion of all data vectors for which first and second BMNs are not adjacent for measuring topology preservation. The optimum map size was chosen based on the minimum values of the quantization and topographic errors (0.1257 and 10−7 , respectively). The word list vectors were mapped onto the SOM and emerging clusters were further examined for correlations with positive of negative variables of the word list vectors. Cluster groups containing three or fewer posts, and no words of interest, were eliminated. The word occurrences were counted in the remaining cluster groups. We then visually identified subgroups within the map (“positive words” and “negative words”) and ascertained which posts were gravitating toward which words and whether the map reflected consumer satisfaction (or dissatisfaction) toward Sitagliptin. E. Modeling Forum Postings Using Network Analysis The next step was to further scrutinize the forum posts with the goal of identifying influential users. To this goal, we built networks from forum posts and their replies. Networks consist of nodes and connections. Networks are either nondirectional (a connection between two points without a direction) or directional (a connection with a point of origin to an end). A nondirectional nodal degree measures the number of connections of a node, while a directional nodal degree measures the number of connections from an original node and its destination(s). Wassermann and Faust [34] identified four different nodes within a network: isolated (connects to no other nodes), transmitter (connects to other nodes but does not receive them), receptor (does not connect to other nodes but receives them), and carrier (connects and receives connections). The density of a network measures the current number of (many) connections.

392

Fig. 1. users.


Nodes represent users/posts and the edges represent information among

Directional networks divide the maximum number of connections with the number of arrowed connections as follows: Δ=

L f (f − 1)

where L is the number of connections and f is the total number of nodes. For our purposes, we used a network-based analysis approach because of its widespread use in social network analysis, and the ease with which to study and model, user interactions and relationships. We used the directional network model because of the nature of the forum and its internal dynamics among the members. The approach we chose to build our network is described in Fig. 1, which shows how each posting-reply pair is modeled. We started by creating nodes for posts containing direct replies (responses to previous posts in the forum) and added bidirectional edges connecting these nodes, as described in Fig. 1. The reason we used bidirectional edges in such cases was to reflect the ensuing information transfer (from the initial poster to the replier and vice versa, based on the assumption that they both read the initial post and its reply). Following this, we added additional edges to the subsequent posts (coded in green in Fig. 1). These edges are unidirectional, based on the realistic assumption that the subsequent posts continued to discuss the topic thread (initial post). We set a threshold of three to the number of subsequent posts that are considered as influenced by the initial post. This threshold was set based on our empirical observation of posting contents and their timing. F. Identifying Subgraphs Our modeling framework has consequently converted the forum posts into several large directional networks containing a number of densely connected units (or subnetworks) and unconnected nodes shown in the Fig. 2 below. We pruned the initial networks to identify strongly connected components (or information modules). A strongly connected component is defined for directed networks as a subnetwork in which each two nodes u and v are connected to each other by at least two paths (along the connecting edges): one from u to v and one from v to u [35]. The algorithm we used for retrieving these strongly connected components employs a depth-first search approach [36]. Identifying strongly connected components ensures that information transfer within the subnetwork is maximized. Fig. 3 presents the strongly connected component (information module) obtained from the network in Fig. 2.

Fig. 2. One of the initial networks build from the Diabetes Daily forum. The forum consisted of 711 nodes, 843 edges, and 34 networks containing more than two connected nodes.

G. MAO and User Average Opinion We further refined the obtained information modules by enriching them with information from the posts via the corresponding word list vectors. At this step, we use the word lists’ TF-IDF scores to derive two measures characterizing user opinion. We first defined a global measure (characterizing the whole information module): the module average opinion (MAO) by examining the TF-IDF scores of all postings corresponding to


393

of the post corresponding to the specific node UAOi =

Sumi+ − Sumi − Sumiall

where Sumi+ = j ∈P xij is the sum of all TF-IDF scores corresponding to positive variables (words) for the ith user’s word list and P is the set of indices denoting the positive variables of the word list. Sumi− = j ∈N xij is the sum of all TF-IDF scores corresponding to negative variables (words) for the ith user’s word list and N is the set of indices denoting the negative variables of the word list. Sumall = M j =1 xij is the total of both sums. The unit j represents the index of the whole wordlist. H. Information Brokers Within the Information Modules In order to identify influential users within the modules, we first ranked individual nodes in terms of their total number of connecting edges (in and out-degree). We then searched for nodes within each module that fulfilled the following criteria: 1) They are influential users (nodes with the largest degrees). 2) The UAO scores are within the MAO scores (both MAO > 0 and UAO > 0 or both MAO < 0 and UAO < 0). We named the nodes fulfilling the above criteria information brokers, based on the fact that they possess the highest number of connections in the strongly connected information modules. III. RESULTS

Fig. 3. Information module obtained as strongly connected components of larger networks.

the nodes within a specific module MAO =

Sum+ − Sum− Sumall

where Sum+ = xij is the sum of all TF-IDF scores corresponding to positive variables (words) of the word lists and for all users within the current module. The unit i is the node/post index. The unit j is the wordlist index (corresponding to positive variables of the word lists). xij is the sum of all TF-IDF scores correSum− = sponding to negative variables of the word lists and for all users within the current module. The unit i is the node/post index. The unit j is the wordlist index (corresponding to negative variables of the word lists). M Sumall = N i=1 k =1 xik is the total of both sums. The unit k represents the index running across all variables of the word list. In a similar manner, we also defined a local measure characterizing user opinion (specific to each node in the module), the user average opinion (UAO), by examining the TF-IDF scores

Fig. 4 is the graphical representation of the SOM of the positive and negative words group in the forum DiabetesDaily. Prior to the final SOM, a subset of the data was used for training the SOM. This was to ensure that SOM was trained to accurately model a sample set of the data prior to receiving the whole dataset. To this end, 30% of the data were selected for training the SOM. We used a 13 × 13 map size with 28 variables from the modified wordlist to ascertain the weight of the words corresponded to the opinion of the drug Sitagliptin. A criterion for selecting the variables was that each word should appear ten times and above. This allowed us to achieve a uniform set of measurements while eliminating statistically insignificant outliers. The bulk of the user’s posts converged on four points of the map. We checked the correlation of the respective nodes with the values of their weight vectors corresponding to positive or negative words. This is how we defined the positive and negative areas of the map. A picture begins to emerge of the user’s opinion that is roughly divided with regards to satisfaction (or lack thereof) of the drug Sitagliptin. One source of negative opinion stems from the side effects of the drug. A review of the medical literature has confirmed the very same side effects that the users were discussing [37]–[40]. Other sources of negative opinion vary from user frustration of the drug costs to frustration at the medical community. Positive opinions mainly stemmed from satisfaction by users who switched to it based on recommendations from a physician. The SOM analysis reflected on the rough

394


Fig. 4. Results of the SOM analysis on posts from the DiabetesDaily forum. Top left panel shows the unified distance matrix in which several user clusters can be observed. The rest of the panels display individual word list variables values obtained after SOM map training.

division of user opinion of Sitagliptin on the forum, based on the reasons stated previously. The next step was to identify specific, influential users within the forum. On the DiabetesDaily forum, six users out of the 711 posters were identified as information brokers as shown in Fig. 5 below. The figure shows the information modules in which these users reside. The densities of these modules range from 0.25 (for module containing User #18, top right panel in Fig. 5) to 0.55 (for modules containing User #109, bottom right panel in Fig. 5). These density values are within the observed density values interval (towards the upper limit), when compared to those generally noted in social networks, thus confirming our network modeling approach [41], [42]. These density values are relatively high when compared to those generally observed in social networks [41]. The information brokers (circled in red) we retrieved were classified as carriers based on Wassermann et al.’s methods. They received, and connected to, other nodes in the network and their connections were the densest. The directional nature of the networks represents the level of interaction between the carriers and other users. A thorough reading of the posts of these six users revealed that they were mostly informative, combining information from sources from the Internet and from personal experience with Sitagliptin. Their wisdom and experience regarding Sitagliptin was positively received and sought after by other members. These users were also active in answering questions that other users (from newcomers to long-time members) had concerning Sitagliptin. Their forum “behavior” has confirmed to us that these users

were the premier information brokers of the drugs Sitagliptin on the DiabetesDaily forum. IV. DISCUSSION The goal of this study was to transform the posts of a forum dedicated to diabetes mellitus type 2 into vectors to be able to intelligently mine consumer opinion of the drug Sitagliptin. The results open new opportunities, and challenges, into developing more comprehensive solutions in this area. A mixed consensus with regard to Sitagliptin depends on individual patient outcome to solutions. The nature of a social media platform can result in individuals with different outcomes, based on various individual factors and circumstances. Despite such factors, we were able to sift through the data and find positive and negative sentiment, which was later confirmed by the research that emerged regarding Sitagliptin’s effectiveness and side effects. Up-to-date information in future studies will provide a much clearer picture of user feedback on drugs and services. A comprehensive solution will require developing a more thorough web of user influence and how it affects interactions among other users. This will require studying user interactions, friendships with other members, and rankings within the social media platform(s). Additionally, more advanced text mining methods will require incorporating advanced lexical dictionaries, and processing methods that will look at both the formal definitions of words (specific diseases, treatments, side effects,


395

In the future, the solution can be added as part of a software package on either the patient’s mobile device or physiological monitoring device. This solution can benefit all of the stakeholders: the patients can give valuable feedback on the products and services directly to the company. The companies can take advantage of direct and prompt feedback to develop an overall view of the status of the product and/or service, and decide which components demand immediate attention and improvement. Social media is becoming an expanding venue for people to express their thoughts, and ideas. It represents a goal mine for companies seeking to optimize health delivery and reduce costs as well as improving the products and developing products further. REFERENCES

Fig. 5. Six users were identified as information brokers on the DiabetesDaily Forum. Modules in which these six users reside are shown in this figure. Network densities for the displayed modules are: 0.40 (top left module), 0.25 (top right), 0.42 (mid left), 0.40 (mid right), 0.42 (bottom left), and 0.55 (bottom right)

etc.) and informal language (slang of specific diseases, treatments, side effects, etc.) used in social media platforms. Finally, a more thorough search in different social platforms will allow up-to-date information on the status of the drug in case one social platform ceases to discuss the drug (e.g., no new threads posted on the forums about the drug).

[1] A. Ochoa, A. Hernandez, L. Cruz, J. Ponce, F. Montes, L. Li, and L. Janacek, “Artificial societies and social simulation using ant colony, particle swarm optimization and cultural algorithms,” in New Achievements in Evolutionary Computation, P. Korosec, Ed. Rijeka, Croatia: Intech, 2010, pp. 267–297. [2] W. Cornell and W. Cornell. (2013). How Data Mining Drives Pharma: Information as a Raw Material and Product [Webinar]. [Online]. Available: http://acswebinars.org/big-data [3] L. Toldo, “Text mining fundamentals for business analytics,” presented at the 11th Annu. Text Soc. Analytics Summit, Boston, MA, USA, 2013. [4] L. Dunbrack, “Pharma 2.0—Social media and pharmaceutical sales and marketing,” Health Ind. Insights, p. 7, 2010. [5] C. Corley, D. Cook, A. Mikler, and K. Singh, “Text and structural data mining of influenza mentions in web and social media,” Int. J. Environ. Res. Public Health, vol. 7, pp. 596–615, Feb. 2010. [6] L. Getoor and C. Diehl, “Link mining: A survey,” SIGKDD Explor. Newslett., vol. 7, pp. 3–12, Dec. 2005. [7] Q. Lu and L. Getoor, “Link-based Classification,” in Proc. 20th Int. Conf. Machine Learning, Washington, DC, USA, 2003, pp. 496–503. [8] A. Ng, A. Zheng, and M. Jordan, “Stable algorithms for link analysis,” in Proc. SIGIR Conf. Inform. Retrieval, New Orleans, LA, USA, 2001, pp. 258–266. [9] B. Taskar, M. Wong, P. Abbeel, and D. Koller, “Link prediction in relational data,” presented at the Adv. Neural Inform. Process. Syst., Vancouver, Canada, 2003. [10] D. Liben-Nowell and J. M. Kleinberg, “The link prediction problem for social networks,” J. Amer. Soc. Inform. Sci. Technol., vol. 57, pp. 556–559, May 2007. [11] Z. Lacroix, H. Murthy, F. Naumann, and L. Raschid, “Links and paths through life sciences data sources,” in Proc. 1st Int. Workshop Data Integr. Life Sci., Leipzig, Germany, 2004, pp. 203–211. [12] J. Noessner, M. Niepert, C. Meilicke, and H. Stuckenschmidt, “Leveraging terminological structure for object reconciliation,” in The Semantic Web: Research and Applications. Berlin, Germany: Springer, 2010, pp. 334–348. [13] M. E. J. Newman, “Detecting community structure in networks,” Eur. Phys. J., vol. 38, pp. 321–330, Mar. 2004. [14] J. Huan and J. Prins, “Efficient mining of frequent subgraphs in the presence of isomorphism,” in Proc. 3rd IEEE Int. Conf. Data Mining, Melbourne, FL, USA, 2003, pp. 549–552. [15] D. Hand, “Principles of data mining,” Drug Safety, vol. 30, pp. 621–622, Jul. 2007. [16] J. Hans and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed. Burlington, MA, USA: Morgan Kaufmann, 2006. [17] C. Corley, D. Cook, A. Mikler, and K. Singh, “Text and structural data mining of influenza mentions in web and social media,” Int. J. Environ. Res. Public Health, vol. 7, pp. 596–615, Feb. 2010. [18] S. R. Das and M. Y. Chen, “Yahoo! for amazon: Sentiment extraction from small talk on the web,” Manag. Sci., vol. 53, pp. 1375–1388, Sep. 2007. [19] E. Riloff, “Little words can make a big difference for text classification,” in Proc. 18th Annu. Int. ACM SIGIR Conf. Res. Develop. Inform. Retrieval, Seattle, WA, USA, 1995, pp. 130–136. [20] W. Yih, P. H. Chang, and W. Kim, “Mining online deal forums for hot deals,” in Proc. IEEE /WIC /ACM Int. Conf. Web Intell., Beijing, China, 2004, pp. 384–390.

396


[21] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classification using machine learning techniques,” in Proc. Conf. Empirical Methods Natural Language Process., Philadelphia, PA, USA, 2002, pp. 79–86. [22] X. Feng, A. Cai, K. Dong, W. Chaing, M. Feng, N. S. Bhutada, J. Inciardi, and T. Woldemariam, “Assessing pancreatic cancer risk associated with dipeptidyl peptidase 4 inhibitors: Data mining of FDA adverse event reporting system (FAERS),” J. Pharmacovigilance, vol. 1, Jul. 2013. [23] K. Y. Chan, C. K. Kwong, and T. C. Wong, “Modeling customer satisfaction for product development using genetic programming,” J. Eng. Design, vol. 22, no. 1, pp. 56–68, Jan. 2011. [24] I. Frommholz and M. Lechtenfeld, “Determining the polarity of postings for discussion search,” in Proc. Workshop-Woche: Lernen, Wissen Adaptivität, Proc., Würzburg, Germany, 2008, pp. 49–56. [25] J. Schectman, (2013, May, 1). Glaxo Mined Online Parent Discussion Boards For Vaccine Worries [Online]. Available (http://blogs.wsj. com/cio/2013/05/01/glaxo-mined-online-parent-discussion-boards-forvaccine-worries/) [26] R. McBride, (2012, Aug., 1). Merck to Draw on Social Network for Psoriasis Patients [Online]. Available (http://www.fiercebiotechit. com/story/merck-draw-social-network-psoriasis-patients/2012–08–13) [27] I. Mierswa, M. Wurst, W. Michael, R. Klinkenberg, M. Scholz, and T. Euler, “YALE: rapid prototyping for complex data mining tasks,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Philadelphia, PA, USA, 2006, pp. 935–940. [28] P. Soucy and G. W. Mineau, “Beyond TFIDF weighting for text categorization in the vector space model,” in Proc. 19th Int. Joint Conf. Artif. Intell., Edinburgh, Scotland, 2005, pp. 1130–1135. [29] S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python. Sebastopol, CA, USA: O’Reilly Media, 2009, p. 504. [30] S. R. Das and M. Y. Chen, “Yahoo! for amazon: sentiment extraction from small talk on the web,” Manage. Sci., vol. 53, pp. 1375–1388, Sep. 2007. [31] T. Kohonen, Self-Organizing Maps, 3rd ed. Berlin, Germany: Springer, Dec. 2000. [32] P. Bonato, P. J. Mork, D. M. Sherill, and R. H. Westgaard, “Data mining of motor patterns recorded with wearable technology,” IEEE Eng. Med. Biol. Mag., vol. 22, no. 3, pp. 110–119, May-Jun. 2003. [33] J. Vesanto, J. Himberg, E. Alhoniemi, and J. Parhankangas, “Selforganizing map in MATLAB: The SOM toolbox,” in Proc. MATLAB DSP Conf., Espoo, Finland, 1999, pp. 35–40. [34] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications. New York, NY, USA: Cambridge Univ. Press, 1994, p. 825. [35] Ibid [36] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. Cambridge, MA, USA: MIT Press, 2001. [37] A. V. Matveyenko, S. Dry, H. I. Cox, A. Moshtaghian, T. Gurlo, R. Galasso, A. E. Butler, and P. C. Butler, “Beneficial endocrine but adverse exocrine effects of Sitagliptin in the human islet amyloid polypeptide transgenic rat model of type 2 diabetes interactions with metformin,” Diabetes, vol. 58, pp. 1604–1615, Jul. 2009. [38] S. Singh, H. Chang, T. M. Richards, J. P. Weiner, J. M. Clark, and J. B. Segal, “Glucagonlike peptide 1–based therapies and risk of hospitalization for acute pancreatitis in type 2 diabetes mellitus: A populationbased matched case-control study,” JAMA Intern Med., vol. 173, pp. 534– 539, Feb. 2013. [39] M. Elashoff, A. V. Matveyenko, B. Glier, R. Elashoff, and P. C. Butler, “Pancreatitis, pancreatic, and thyroid cancer with glucagon-like peptide1–based therapies,” Gastroenterology, vol. 141, pp. 150–156, Jul. 2011. [40] S. Shimoda, S. Iwashita, S. Ichimori, Y. Matsuo, R. Goto, T. Maeda, T. Matsuo, T. Sekigami, J. Kawashima, T. Kondo, T. Matsumura, H. Motoshima, N. Furukawa, K. Nishida, and E. Araki, “Efficacy and safety of Sitagliptin as add-on therapy on glycemic control and blood glucose fluctuation in Japanese type 2 diabetes subjects ongoing with multiple daily insulin injections therapy,” Endocrine J., vol. 60, no. 10, pp. 1207–1214, Aug. 2013. [41] K. Faust, “Very local structure in social networks,” Sociol. Methodology, vol. 37, pp. 209–256, Nov. 2007. [42] K. Faust, “Comparing social networks: Size, density and local structure,” Adv. Methodology Statist., vol. 3, no. 2, pp. 185–216, 2006.

Altug Akay (M’13) was born in Istanbul, Turkey, in 1986. He received the B.A. degree from the University of Massachusetts, Amherst, MA, USA, in 2005, and the M.A. degree from Dartmouth College, Hanover, NH, USA. He is currently working toward the Ph.D. degree in the School of Technology and Health, Royal Institute of Technology, Stockholm, Sweden. Upon graduating from Dartmouth, he worked at Siemens Healthcare, Erlangen, Germany for one year under the supervision of Dr. Gudrun Zahlmann. He currently works at Spaulding Rehabilitation Hospital under the supervision of Dr. Paolo Bonato. He also contributed a book chapter entitled “ß-thalassemia’s social and economic geography: a possible prevention/treatment program to rout “legacy” genetic mutations,” in the book entitled Mathematical Methods in Scattering Theory and Biomedical Engineering (London, U.K.: World Scientific, 2008). His current interests include public health, health policy, and health technology.

Andrei Dragomir (M’13) received the B.S. degree in electrical engineering from the Politehnica University of Bucharest, Bucharest, Romania, and the M.S. and Ph.D. degrees in biomedical engineering from the University of Patras, Patra, Greece. He was a Postdoctoral Research Associate at the Arizona State University, Tempe, AZ, USA, from 2006 to 2010. He is currently an Assistant Professor at the University of Houston, Houston, TX, USA. His research interests include biocomplexity, neural engineering, pattern recognition, machine learning, and bioinformatics.

Björn-Erik Erlandsson (SM’13) received the Ph.D. degree from the Chalmers University of Technology, Gothenburg, Sweden, in the field of applied electronics. He is currently a Senior Advisor and a Professor at the School of Technology and Health, Royal Institute of Technology, Stockholm, Sweden, with responsibility for technology and quality in health care. He has experience in research and development, quality management, and regulatory affairs from international medical device industry; Siemens and Nobel Industries. He was a Quality Manager in pacemaker industry and has been a Director in Medical Informatics and Technology at the University Hospital of Northern Sweden, Ume˚a and Akademiska Hospital, Uppsala and also a Professor in Biomedical Engineering. In his role as a Director for technical operations at the university hospitals, he has also been heavily involved in investment issues and investment management, and chairman of the investment planning groups. He is also involved in the standardization work in medical technology and medical informatics, Dr. Erlandsson was the Chairman of SIS/TK334, the Joint Working Group in Software and Medical Devices at CEN/CENELEC during 2010 and the Cochairman during 2011, and the Chairman of the Medical Society’s Division of Medical Informatics for two years. He is a member of the Swedish Council on Health Technology Assessment, SBU Alertr˚ad, and the Coordinator EU policies at the Royal Institute of Technology’s Life.