Social health data integration using semantic Web - Semantic Scholar

6 downloads 59573 Views 2MB Size Report
discussion among users, sites that build collective knowledge such as Wikipedia, and ... websites that host these lists have only rudimentary search capability.
Social Health Data Integration using Semantic Web Soon Ae Chun

Bonnie MacKellar

City University of New York College of Staten Island Staten Island, NY 10314 1-718-982-2931

St John’s University Div. of Computer Science, Mathematics and Science New York, NY 1-718-990-7452

[email protected]

[email protected] Internet and American Life Project, 59% of all American adults had searched on the Internet for health-related information [10]. However, the Internet’s content diversity is often overwhelming. Information is scattered among many sites and available in many formats. In particular, it is very difficult to locate information that is contained within social networking sites and online discussion communities, where the patient experiences are shared. These experiences can be valuable sources of knowledge for other patients, government agencies and clinicians. This kind of experiences and wealth of data posted by patients can provide individual patients with personal decision making on proceeding or avoiding certain procedures or drugs.

ABSTRACT This project addresses how to link scattered health-related data from different Web communities, and provide integrated knowledge of health information. Specifically, we integrate data from social media-based patient communities, curated sites with expert content, and the research community. Our approach is based on medical concept extraction using the Unified Medical Language System (UMLS), Resource Description Framework (RDF) semantic modeling to represent diverse social health and medical experiences, and summarization of integrated health data. A prototype implementation annotates medical terms occurring in blogs with summarized health experience data, medical expert data and medical research data that enables users, such as patients, doctors or other health care providers to have integrated and linked view of health-related knowledge. Currently, the system integrates information from PatientsLikeMe, WebMD, and PubMed, and can be used to annotate a wide variety of text based blogs. This system uses ontology-based information extraction and semantic modeling of social health data to integrate informally specified information, which is typical of content written by patients.

Consider a typical user, Mary, who is researching clinical trials for her elderly father who is suffering from kidney cancer. Mary participates on a kidney cancer support mailing list, and has asked for information on a particular trial. One of the answers looks interesting, but Mary is having trouble understanding the technical vocabulary in the response, which was written by a kidney cancer survivor who has some medical background. Mary also wonders if there are research results for this trial, and if other patients have had success with the drug used in the trial and what the side effects were reported.

Categories and Subject Descriptors. J.3 LIFE AND MEDICAL SCIENCES (Health);

Currently, the only way to get further information about a disease or treatments is for people like Mary to initiate a web search to find a definition or go through similar patients’ experiences to get further information and to make a decision. For instance, she would need to navigate over to PubMed and run searches there to find relevant research papers. She might also go to a site like PatientsLikeMe and search the site looking for experiences and statistics on the drugs involved in the trial.

General Terms Data integration, Semantic Web, Social data, Health knowledge

Keywords Social health data, semantic health experience model, semantic integration, health data extraction, health knowledge base

1. INTRODUCTION

A better solution would be an integrated knowledge base system that provides patients and caregivers with aggregated health information from different sources, so they can better understand diagnoses, alternative treatments and side-effects of drugs. It is especially important that the large store of patient generated content buried in medical social networking and blogging sites be integrated into this knowledge base. Using integrated Health Knowledge base, relevant medical terms pertaining to symptoms and to treatments (such as drug names) are highlighted in the text of mailing list messages Mary is reading, with links that display summarized statistics from PatientsLikeMe, and citations from PubMed site. This knowledge can provide her with an integrated information on the particular topic and help her to make healthrelated decisions.

The Internet has fundamentally changed the way that patients obtain information related to healthcare and medicine. Patients, who used to rely solely on health care experts for advice and treatments, now seek health information on their own from the Internet. Health and medical information on the Internet is provided by many sources: official hospital and medical association websites, consumer-oriented healthcare portals, national governments, social networking sites, blogs written by both patients and medical professionals, and online discussion boards and mailing lists. According to a 2011 survey by the Pew Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’12, March 25-29, 2012, Riva del Garda, Italy. Copyright 2012 ACM 978-1-4503-0857-1/12/03…$10.00.

Our research goal is to link scattered health-related data from different Web communities, and provide an integrated view of health information, in particular, symptoms, treatments, and sideeffects, to support health decision making. This work is part of our larger project that aims to develop enhanced search and integration

392

capabilities for health discussion–based sites or health-related social collaboration sites that provide: • Profile based search in which a user’s medical characteristics are used to find information generated by similar patients. • Related information presented to the user in an integrated, asneeded fashion, also filtered based on the user’s medical characteristics. There are many research questions that will be addressed by the overall project, including the task of extracting user profiles from scattered text-based posts. In this paper, we address the health information extraction from diverse sources and integrate them to build a knowledge base for health decision making. The research challenges include how to recognize and extract the health data from the unstructured data as in patients’ social media data, and how to link and integrate the data from disparate sources. We focused on information available on the open Web, i.e., nonpassword protected sites, to avoid issues with privacy. Users are assumed to be patients, caregivers, and other people without a medical background who may be searching for information on a health condition. Our approach utilized ontology-based information extraction, content organization based on a semantic representation, and summarization of health information. The content extraction component utilizes text extraction and text mining technology to extract health terms from unstructured text. The summarization component aggregates patient-contributed content to show data trends across patient groups. These enable users to search health-related terms and view related information in a summarized manner.

include social networking sites that encourage collaboration and discussion among users, sites that build collective knowledge such as Wikipedia, and mashups, which allow Web applications to be easily combined to create more value. The application of Web2.0 technologies to the field of healthcare is termed Health2.0 or Medicine2.0[12]. Health2.0 applications allow patients to take charge, both gathering and creating their own information. The positive effect of the proliferation of Health2.0 sites is the increase in medical knowledge on the Internet, much of it created by the patients themselves. There is an important role for web applications which can integrate this scattered information and present it to the user on an as-needed basis. That is the aim of the system described in this paper.

2.2 Characteristics of Health Websites It is important to note that there are different types of health websites, each with its own set of challenges for finding relevant information. Following is a description of the websites that were used in developing this system. Curated sites: These are sites that are sponsored by some official entity, often by a government agency, disease-related organization, or medical organization. For example, PubMed.gov[18] is a website maintained by the U.S. National Library of Medicine. It indexes over 21 million citations of published research papers in medicine. It provides keyword search, based on the Medical Subject Headings (MeSH) vocabulary. Moderated Question and Answer Sites: In this format, patients ask questions about their condition, and a medical specialist answers. Questions and answers are free-form text, like medical mailing lists. However, the format is more constrained, since information is mainly flowing between the questioner and the specialists. An example of such a site is the breast cancer discussion board at WebMD.com [24], which at the time of this project was moderated by a breast cancer specialist.

We developed the prototype system, Health Knowledge Refinery Portal that identifies health-related terms occurring in a blog and annotates them with linked and aggregated health knowledge extracted from a number of health-related websites. Users can enter a URL for a health discussion or blog and specify what to analyze, e.g. either symptoms or treatments. This triggers all occurrences of either symptom or treatment keywords in the text to be highlighted and the integrated health knowledge base is used to annotate these terms with integrated health information, such as the past experience of patients and effects of drugs/treatments observed by them. The summary for any health terms occurring in the URL consists of a detailed description of the keyword and statistics featuring percentages of patients taking a particular drug, side effects of the drug, severities of the side effects, reasons for stopping the drug consumption, etc. In addition, information from PubMed and WebMD are linked in the annotation providing further expert information on the health terms.

Mailing lists and discussion board: Curated sites and moderated Q&A sites convey content generated by experts to non-experts. At the other end of the spectrum, we find mailing lists and discussion boards, which allow patients to interact with each other and provide information to each other. There are many examples of health-related mailing lists on the Internet, such as the very wellknown Association of Cancer Online Resources (ACOR) [15] which is a collection of 159 mailing lists tied to specific forms of cancer. The purpose of these mailing lists is to provide support and information, and often to advocate for research as well. In a heavily-trafficked mailing list, there may be vast stores of patient information, stretching back for years. Unfortunately, it is very difficult to locate and access this information. Many of the websites that host these lists have only rudimentary search capability. In the project described in this paper, we do not search these mailing lists, which are usually password-protected, but instead make it possible for a member to see annotations of related information, tied to words in a posting.

This paper is organized as follows. In Section 2, we present a rationale for the usefulness of this type of aggregated content, and discuss related works. In Section3 we discuss the RDF-based semantic integration model that facilitates the aggregation of health knowledge and experience, and in Section 4 we present technical details of the prototype system. Section 5 presents conclusion and plans for future research and development.

Healthcare blogs lie at a midpoint between curated sites and the more free-form mailing lists. Blogs may be written by anyone, from patients to noted medical specialists. They have many purposes, from highlighting current research, to chronicling a patient’s journey through the healthcare landscape, to providing official information to patients. They share the unstructured text aspect of mailing lists, but generally do not require that threads be followed in order to locate information.

2. BACKGROUND 2.1 Health2.0 The first generation of the Internet consisted of static pages served to end users on request. As Internet technology evolved, new applications appeared that allowed greater interactivity, personalization, and collaboration between users. The term Web2.0 was first coined by Tim O’Reilly in 2005 [16] to refer to this new generation of Web technologies. These technologies

393

set of keywords, k and then t map those keywords to thhe ontology. This ressults in a repressentation for thee document in terms t of the standardd ontology. Thhis ontology-bassed informationn extraction [26] appproaches includee the system dessigned by Embleey [8] which worked with classified ads to sell carss. SOBA[2] craw wls through p an ontoology about web pagges about socceer matches to populate soccer. Similarly, Saggiion et al [19] haave also developped a system u linguistic rules and that annalyzes businesss documents using gazetteeers, and a business intelligence ontology calledd MUSING. Anotherr similar system is MAGPIE [5], which uses a named n entity recognittion engine to exxtract entities froom web documennts and then links eaach entity with an a instance or class c within an ontology. o In the meedical domain, TRIES [23] extracts e inform mation from radiologgical reports and a populates a specialized radiological ontologyy with the extraccted entities. Serrban [20] uses a lightweight medicall ontology calleed DO to aid inn extracting annd modeling clinical guidelines. Therre has also beenn work that uses information s such as blogs and Twittter to gather extractioon from online sources public health h informatioon and provide surveillance s [7] [14] [ [4]. We use sim milar approach to extract andd map medical terms and informaation to a mediccal ontology, im mproving the waay we relate informaation from diffe ferent healthcaree sites. One of the main contribuutions of our work is to usse ontology-bassed concept integratiion to provide aggregated a inforrmation in a domain where few knoowledge aggregaation tools exist.

Social network S king sites: Th hese are sitess that allow for c collaborative agggregation of infformation and patient experiencces. T They tend to be more structured d than mailing lists. On the earliierm mentioned PatiientsLikeMe, patient-supplied p d information is a aggregated and provided p as visuaal reports for paarticipants [25][111], c creating a valuabble repository of medical informaation. Inn the prototype implementation i of the Health Knnowledge Refineery P Portal described here, informatio on extracted from m the open partss of W WebMD, PubMed, and PatienttsLikeMe are used u to generatee a s semantic knowledge base, wh hile a mix of mailing lists and a m moderated questtion/answer sitees were used too test informatiion a annotation.

2 Related 2.3 d Research There are few examples T e reporteed in the literatture of healthcaarerelated web appplications that aggregate a know wledge from othher s sites. Fernandez--Luque et al [9] describes a num mber of applicatioons inn a survey papeer on web inform mation extractionn for personaliziing h healthcare appliications. They note that manny of the online a applications and data sources in the healthcare field fi do not provide A APIs for extractting information n. This may bee a reason for the s scarcity of such applications. a One of the key problems in inttegrating healthh information froom O d different types of websites is the differencees in terminology. R Research papers are filled with technical jargonn, whereas curaated s sites like WebM MD use correctt but less form mal medical term ms. P Patient-driven m mailing lists and d social networrks contain a hiigh d diversity of terrms, with som me posts usingg correct mediical teerminology andd others using less precise term ms. On some lissts, c common abbreviiations evolve ov ver time. For exxample, oncologiists u the technicall term “no evideence of disease”” to mean the sttate use thhat non-cliniciaans commonly call c “in remissioon”, but on canncer m mailing lists, thiss term is further abbreviated to “NED”. “ Smith and a W Wicks [22] repoort that 43% of PatientsLikeM Me symptom terrms e either match exactly (24%) or synonymously (119%) with termss in thhe UMLS Metaathesaurus (UMLS is a medicaal ontology), whhile m more than half of the symptom m terms either do not match the U UMLS, or are unnclassifiable. Th he variability in social health terrms m cause misssing semantic reepresentations. Any system that may t inntegrates healthh information must m find a wayy to map differiing representations of o the same concept to a commonn representation..

3. SE EMANTIC APPROAC A H FOR HE EALTH KNOW WLEDGE INTEGRAT I TION Capturinng the richness of patient experience as well w as the technicaal detail of meddical informatioon from the reseearch world requiress a powerful infformation modeeling approach. Our overall approacch uses semanticc web techniquess integrated withh a standard medicall ontology. Ann ontology typiically consists of classes, propertiies, relationshipps between claasses (for exam mple, IS-A, PART-O OF) and instancces. In our modeel, classes and concepts c are derived from central concepts c that innclude disease, symptoms, treatmennts, and side-effe fects (see Figure 1).

Data integrationn refers to thee problem of gathering relaated D innformation from m disparate sourrces and presentting it in a uniffied s schema and sem mantic heterogeeneity. Integratiing non-structurred w data is challlenging due to th web he lack of a scheema. Approachess to inntegrating this data can be baased on informaation retrieval and a e extraction algorrithms [6]. Seemantic heteroggeneity refers to d differences in thhe way that con ncepts are represented in differrent d sources. Theese differences pose data p a big challeenge when tryingg to inntegrate informaation based on keywords. To solve s this probleem, o can use an ontology, one o which is a standardized representationn of a body of know wledge, consistin ng of concepts and relationshhips b between conceppts. The task th hen becomes one o of developiing m mappings from concepts c in a data source to the ontology. o There are s standard ontologgies in many do omains; within the t medical worrld, thhe Unified Meedical Languag ge System (UM MLS) providess a thhesaurus of meedical conceptss as well as mappings m to othher c controlled mediccal vocabularies.

Figure 1 Semantic Model for Social Health H Entities In generral, medical concepts as used wiithin social netw working sites and maailing lists are not as finely classified c as thoose used in clinical professionals’ concept c hierarchhy. For instancce, the term “antibiootics” refers too a kind of drug d without further f subdistinctiions for mostt non-clinical people, but professional p clinicianns may distinguiish further into Penicillins, P Cephhalosporins, Tetracycclines, Erythrom mycin, Aminoglyycosides, Polyppeptides and Antifunggal Antibiotics. Given this type of concept-based c s semantic modell, we have f filling them m with the developped an extraction approach for instancees from the healtth and medical sites for the know wledge base.

When dealing with W w unstructured d information inn a text documeent, thhe first problem m is to extract information from the document as a a

394

To detect the instances for different classes in the ontology, we used UMLS. If a medical/health term matches with one in UMLS, then the category of the term is traversed to classify it as one of the class term. For instance, when “viral hepatitis” is mentioned, then one of the UMLS disease classification system e.g. ICD-10 is used. When a concept in ICD-10 system matches its concept with the term, the term is classified and described as a disease. If the term is “cryotherapy,” then it is considered to be a treatment/procedure. In the next phase of the project, we plan to improve our text extraction component to map informal patient terms, such as the mailing list abbreviations described earlier, to the ontological classification. Extracted medical entities from sources can have relationships among themselves. The relationships in the semantic model shown in Fig 1 are used. We represented the extracted instances using the RDF triple model, consisting of subject-predicate-object to describe web resources in a standardized manner. For instance, a patient’s profile at PatientsLikeMe1 reports the disease, treatment procedures and prescription drugs taken. This particular web resource for the patient’s health experience can be described in the RDF model, whose snippet is shown below in Fig 2.

4. PROTOTYPE: HEALTH KNOWLEDGE PORTAL 4.1 Architecture The prototype system consists of the following modules: 1) Social Health Knowledge Base: This component contains knowledge, represented as an RDF model, that has been mined from blogs, curated sites, and social networking sites. Currently, knowledge is mined from PatientsLikeMe, PubMed, and WebMD, but more sites can easily be integrated into the system. 2) Blog Extraction Component: After the user enters the URL of a blog or discussion list post, the blog extraction component extracts all the text from the URL specified by the user and passes it to the keyword identification component. 3) Health Keywords Identification component: Once the text is extracted from a designated blog, the Health Keyword Identification component looks for health-related keywords that match with health related keywords stored in the Social Health Knowledge Base. 4) Integrated Health Summary and Linking component: All the matched heath-related keywords are marked up with health summary data from the Social Health Knowledge Base and with hyperlinks to WebMD and PubMed medical knowledge relevant to the health keywords. This allows the information to be integrated and tied to the keyword occurrence in the blog text for presentation to the user.

neuroplasticity Omaha,,NE Male52 Epilepsy Right Temporal Lobectomy Extratemporal resection , … Sebril Keppra XR … …….

Figure 3 shows the overall system architecture with the external users.

Figure 2: RDF Model of a patient’s Health Resource (see footnote 1)

In this example, the subject is the patient’s profile web page (see footnote 1), predicates are “the patient has disease, had procedures and took prescription drugs”, and the objects are Epilepsy, a set of procedures called “Right Temporal Lobectomy” and “Extratemporal resection”, and “Sebril” and “Keppra XR”. A concept or URI can link to another web source URI. For example, “Right Temporal Lobectomy” can be linked to a research paper on PubMed, using triple notations . Since the RDF triples asserting the relationships between web resources are used for aggregations and summarizations, searches can then be specified using the semantic query language SPARQL. The search results can help users see the general patient population’s experience.

Figure 3: System Architecture

4.2 Detailed Description There are effectively two phases in the prototype: a phase in which knowledge is collected from websites such as PubMed and PatientsLikeMe, and a phase in which a natural language text from a blog post is analyzed and annotated.

In the following section, we present details of the prototype implementation, which uses our semantic approach to integrate disparate health information.

1

Knowledge collection happens on a scheduled basis in order to keep the knowledge base up to date. Text extraction from PatientsLikeMe is done with Mozenda. This is a tool that extracts pattern-based data from all the pages in a list; it crawls through the website and extracts data according to a pattern specified by the

http://www.patientslikeme.com/patients/view/143051

395

uuser. For sympptom data, textt was extractedd in the followiing p pattern:

the-clockk management off moderate to moderately-severe m pain for an extendedd period of time.,Fiibromyalgia,363,,,,,,

SSymptom Name, Number of Patieents Suffering, Brief B Description of S Symptom, Which Drug D did Patient Use U to Treat the Symptom, S Numberr of P Patients Taking thee Drug

Once exxtracted, the sym mptom and treatm ment keywords are indexed using Luucene. Ontologyy-based classificcation is used to incorporate the conncepts into thee Social Healthh Knowledge Base. This informaation is utilized later, when thee user-supplied blog text is analyzedd. During the blog analysiis phase, the texxt in a blog post supplied by a user is i annotated wiith information from the know wledge base. First, thhe plain text is extracted e using the jsoup API[[13] , which removess HTML formattting. The user chooses c whetherr to annotate symptom ms or treatmentts. If the user chhooses symptom ms, then the symptom ms information described in thee last section is used u to find matchess in the extracted text. Eachh matched woord is then highlighhted with a hypeerlink associatedd with it. After clicking on the mattched symptom, brief details off the selected syymptom are displayeed along with thee aggregated stattistics in tabularr format. We also proovide WebMD and PubMed links for all thhe matched symptom ms. When any user u clicks on thhe WebMD or PubMed P link beside the t matched sym mptom, the releevant WebMD or PubMed page is displayed d to the user in the defauult Web browserr.

a for treatmentt data, this patterrn was used: and Drug Name, Cattegory, Number of Patients usinng the Drug, Brief D D Description of Druug, Reasons for Taking T the Drug, Number of Patieents taaking the Drug, Side S Effects, Numb ber of Patients havving Side Effect, Side S E Effect Severity, Num mber of Patients having h Side Effect Severity, Reasons for S Stopping the Drug Consumption, Nu umber of Patients that t have Stopped the C Consumption

A example of exxtracted symptom data is as follows: An Rashes (redness sw R welling),4395,A ra ash is an area of innflamed and irritaated skin. Rashes are characterized c by redness itching swelling s and various tyypes of skin lesiions (e.g. maculees papules nodulles plaques pustuules v vesicles wheals).. There are many different types of skin s r rashes.,1,Metronid dazole Topical

A example of trreatment data is shown below: An Zytram XL (tram Z madol),Prescriptio on Drug,6,Tramaadol is an opiioid a analgesic used foor the relief of moderate m to modderately-severe paain. E Extended release formulations fo are in ndicated for patiennts requiring arouund-

Figure 4: Ann notated blog posst with summarry information displayed d of patients that take thee particular drugg for treating the respective m. The top fivee drug names along a with theirr respective symptom percentaages are displayeed. This process is repeated forr calculating the statiistics for side efffects of the drugg, side effect sevverity of the drug, annd reasons for stoopping the drug..

Inn Figure 4, notee that there are PubMed P and WeebMD links nextt to thhe highlighted symptom. If the user u clicks on eiither of these linnks, innformation from m these sites is displayed. d When the user clicks the P PubMed link, thee publication lissts related to thee symptom wordd is s shown. If the user selects Treatm ments, then the treatment indexx is u used in a similar fashion to match h and then annottate the text.

5. CO ONCLUSIO ON AND FU UTURE WO ORK In this paper, we pressented a Semanntic Web technoology based approacch to integrating and aggregatinng health inform mation culled from a variety of webb resources. Speecifically, we used u UMLS ontologyy-based medicaal term extractioon from social health data,

only contains raw Since the data collected S c from PatientsLikeMe P r n numbers, we callculated the statiistics from the collected c data. For F e example, for sym mptoms, we calcu ulated the statisttics as a percentaage

396

[5] J. Domingue, M. Dzbor, and Enrico Motta, “MAGPIE: Supporting Browsing and Navigation on the Semantic Web,” The Semantic Web ISWC 2003, Springer Berlin / Heidelberg, 2003, 690-705. [6] X. Dong and A. Halevy, “Indexing dataspaces,” Proceedings of the 2007 ACM SIGMOD international conference on Management of data, Beijing, China: ACM, 2007, 43–54. [7] N. Dragu, F. Elkhoury, T. Miyazaki, R.A. Morelli, and N. di Tada, “Ontology-based text mining for predicting disease outbreaks,” Twenty-Third International FLAIRS Conference, 2010. [8] D. Embley, “Toward semantic understanding: an approach based on information extraction ontologies,” Proceedings of the 15th Australasian Database Conference, vol. 27, 2004, 3-12. [9] L. Fernandez-Luque, R. Karlsen, and J. Bonander, “Review of extracting information from the social web for health personalization,” Journal of Medical Internet Research, vol. 13, Jan. 2011. [10] S. Fox, The Social Life of Health Information, 2011, Pew Internet and American Family Project, 2011. [11] J.H. Frost and M.P. Massagli, “Social uses of personal health information within PatientsLikeMe, an online patient community: What can happen when patients have access to one another’s sata,” Journal of Medical Internet Research, vol. 10(3), May. 2008. [12] B. Hughes, I. Joshi, and J. Wareham, “Health 2.0 and Medicine 2.0: Tensions and controversies in the field,” Journal of Medical Internet Research, vol. 10, 2008. [13] “jsoup: Java HTML Parser.” [Online]. Available: http://jsoup.org/. [Accessed: 07-Sep-2011]. [14] M.N. Kamel Boulos, A.P. Sanfilippo, C.D. Corley, and S. Wheeler, “Social Web mining and exploitation for serious applications: Technosocial Predictive Analytics and related technologies for public health, environmental and national security surveillance,” Computer Methods and Programs in Biomedicine, vol. 100, Oct. 2010,16-23. [15] A. Meier, E. J. Lyons, G. Frydman, M. Forlenza, and B. K. Rimer, “How cancer survivors provide support on cancer-related internet mailing lists,” Journal of Medical Internet Research, vol. 9, no. 2. [16] O’Reilly, Tim, “What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software.”, Communications & Strategies, vol. 1, First Quarter 2007. [17] PatientsLikeMe.com, http://www.patientslikeme.com/, 2011. [18] PubMed.gov, http://www.ncbi.nlm.nih.gov/pubmed/, 2011. [19] H. Saggion, A. Funk, D. Maynard, and K. Bontcheva, “Ontologybased information extraction for business intelligence,” Proceedings of the 6th International Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference, Springer-Verlag, 2007, 843–856. [20] R. Serban, A. ten Teije, F. van Harmelen, M. Marcos, and C. PoloConde, “Extraction and use of linguistic patterns for modeling medical guidelines,” Artificial Intelligence in Medicine, vol. 39, Feb. 2007, 137-149. [21] D. Skoutas and A. Simitsis, “Designing ETL processes using semantic web technologies,” Proceedings of the 9th ACM International Workshop on Data Warehousing and OLAP, Arlington, Virginia, USA: ACM, 2006, pp. 67–74. [22] C.A. Smith and P.J. Wicks, “PatientsLikeMe: Consumer Health Vocabulary as a Folksonomy,” AMIA Annual Symposium Proceedings, American Medical Informatics Association, 2008, 682686. [23] E. Soysal, I. Cicekli, and N. Baykal, “Design and evaluation of an ontology based information extraction system for radiological reports,” Computers in Biology and Medicine, vol. 40, 900-911. [24] WebMD.com, Breast cancer discussion board,

and data from medical expert site and research community. The extracted medical/health entities are used to describe the patient resource pages in RDF triples. This semantic knowledge base is queried and provides integrated health-related knowledge. We present a prototype system which uses this approach to annotate blog posts with the aggregated information. Our work can improve the experience of patients, who often need critical information, to easily locate and understand relevant information. Our ongoing work aims to expand this system to provide better integration of patient content, as well as profile-based search capabilities. As noted earlier, there is great diversity in the terms used to identify medical concepts in mailing list and social networking posts, since patients are writing the content. We plan to improve mapping of the terms to the ontology by using GATE [3], which is a suite of tools that enable sophisticated information extraction from text-based sources. We plan to analyze a mailing list for common synonyms and acronyms, and use the results of this analysis to perform information extraction using part-ofspeech tagging, named entity extraction, use of a gazetteer and other techniques during information extraction. We also plan to extend the number of sites used to populate the knowledge base. Another issue is that of information quality. As more sources are aggregated, it will be important for users to be able to judge the quality of information from a given site[1]. In a future iteration of this project, credibility measures could be added to the annotations, using measures such as user votes or most visited statistics. An important question here would be an appropriate measure of credibility for very specialized types of health information. A next phase of our project is to use the semantic approach described here to create patient profiles from the stream of posts in a medical mailing list. A semantic representation for a patient profile can improve searches within the mailing list. For example, a search for treatment options initiated by a patient with refractory retinoblastoma might return posts written by other patients with refractory retinoblastoma at the top of the list of results. This will help patients quickly identify answers most likely to be relevant.

6. ACKNOWLEDGMENTS This work was supported in part by PSC-CUNY Research Grant 42 awarded to Chun. We acknowledge Shrikant Doiphode and Prof. Geller in New Jersey Institute of Technology for their help with the prototype system.

7. REFERENCES [1] A. Amin, H. Cramer, L. Hardman, V. Evers, “The effects of source credibility ratings in a cultural heritage information aggregator”, Proceedings of the 3rd workshop on information credibility, 2009, 3542. [2] P. Buitelaar, P. Cimiano, S. Racioppa, and M. Siegel, “Ontology-based information extraction with SOBA,” Proceedings of the International Conference on Language Resources and Evaluation, 2006, 2321–2324. [3] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, “GATE: an architecture for development of robust HLT applications,” Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2002, 168–175. [4] K. Denecke, A. Stewart, T. Eckmanns, D. Faensen, P. Dolog, and P. Smrz, “The medical ecosystem – personalised event-based eurveillance,” Proceedings of the 13th World Congress on Medical and Health Informatics Medinfo, 2010.

http://exchanges.webmd.com/breast-cancer-exchange, 2011. [25] P. Wicks, M. Massagli, J. Frost, C. Brownstein, S. Okun, T. Vaughan, R. Bradley, and J. Heywood, “Sharing Health Data for Better Outcomes on PatientsLikeMe,” Journal of Medical Internet Research, vol. 12, Jun. 2010. [26] D.C. Wimalasuriya and D. Dou, “Ontology-based information extraction: An introduction and a survey of current approaches,” Journal of Information Science, vol. 36, 2010, 306.

397