SHORT PAPER International Journal of Recent Trends in Engineering, Vol. 1, No. 1, May 2009
Ontology Enhanced Clustering Based Summarization of Medical Documents A. A.Kogilavani1, B. Dr.P.Balasubramanie2 1
Kongu Engineering College/CSE, Erode, India Email:
[email protected] 2 Kongu Engineering College/CSE, Erode, India Email:
[email protected]
used to retrieve relevant set of documents. Given the diversity of medical information sources, methods must be established that will enable users to quickly understand and determine the content of a document. Summarization is one such approach that can help users to quickly determine the main points of a document. A summary can be defined as a text that is produced from one or more texts, that convey important information in the original text, and that is no longer than half of the original text and usually significantly less than that [6]. The main goal of a summary is to present the main ideas in a document in less space. A short summary will improve efficiency since not all documents come with an abstract or summary. Some of the documents have author written abstracts which should summarize the main ideas of an article. But in real world scenario information retrieval starts with a user’s query, which is a set of keywords. These keywords may or may not match the main thoughts of a document. In this case the author written abstracts will not be a good summary for a user’s query. Hence, summaries that a user wants need to be generated dynamically based on the query keywords. To solve this problem the proposed system builds a document summarization system specialized for medical domain, which will retrieve and summarize up-to-date medical information from trustworthy online sources according to revised user queries. Consider MEDLINE, the largest biomedical text database, which has more than 20 million articles and thousands of articles are added to MEDLINE every week. In order to deal with this type of text information overload problem, document clustering is combined with text summarization to provide a solution. Document clustering enable us to group similar documents together and then text summarization technique is applied to provide concise summary by extracting the most important information from a document cluster.
Abstract—The growing amount of data, the short of structured information and the information diversity have made information and knowledge management a real challenge. Even though larger quantities of data are merely available, easier access to the required information at the right time and in the most appropriate form is still difficult. Particularly the medical domain suffers typically from the problem of information overload since it is essential for physicians and researchers in medicine and biology to have quick and efficient access to up-to-date information according to their interests and requirements. Methodologies are needed to support users whose knowledge of medical vocabularies is inadequate to find the desired information and for medical experts who search for information outside their field of expertise. In order to effectively utilize the vast amount of biomedical information and to provide a solution to information overload problem, the proposed system combines both document clustering and text summarization technique. In the proposed system the user query is revised by mapping query with synonyms and semantically related concepts using MeSH ontology knowledge source. Based on the revised query medical documents are retrieved from trustworthy online sources and those documents are clustered to generate cluster wise summary. Index Terms—query expansion, text summarization, document clustering, feature extraction, summary generation
I. INTRODUCTION Information plays a vital role in our society. As vast amounts of knowledge are created and available through World Wide Web, how to efficiently and effectively allocate and use these precious data becomes critical. In general a web search engine tries to serve as an information access agent and it retrieves and ranks information according to a user’s query. But current search engines only do shallow string processing due to the lack of deep perceptive of natural languages and human intelligence, and users usually have to go through pages before they find some information to be useful. When a user wants to know about information on a very specific topic, they have troubles finding the correct search terms because the professional and lay medical vocabularies are not always well-matched. This problem can be solved by mapping user query to related medical concepts [4]. This type of concept mapping is used for revising the user query based on synonyms which will be 1
A. Related Work PERSIVAL (Personalized Retrieval and Summarization of Image, Video And Language) is designed to provide personalized access to a distributed patient care digital library of medical literature and consumer health information [3]. PERSIVAL consists of user query component to extract important information, distributed online multimedia search component to find relevant sources and patient information to rerank articles, and multimedia presentation component to summarize echocardiogram video. But interface between
Corresponding Author
546 © 2009 ACADEMY PUBLISHER
SHORT PAPER International Journal of Recent Trends in Engineering, Vol. 1, No. 1, May 2009 related terms to descriptors. MeSH descriptors are organized in a MeSH Tree, which can be seen as a MeSH Concept Hierarchy. In the MeSH Tree there are 15 categories and each category is further divided into subcategories. For each subcategory, corresponding descriptors are hierarchically arranged from most general to most specific.
these components may be required to feed information from one stage to next automatically. PubMed is a service of the National Library of Medicine that includes over 18 million citations from MEDLINE and other life science journals for biomedical articles back to 1948. PubMed includes links to full text articles and other related resources. It is used as a web search engine to retrieve medical related articles. HelpfulMed integrates searching and indexing algorithms, an automatic thesaurus or concept space and Kohonen-based Self-Organizing Map (SOM) technologies to provide searchers with fine grained results. It is just like a gateway to serve the information seeking needs of the medical professionals, researchers and other advanced users. This system increases the user satisfaction but summary generation is not included in the system.
III. PROPOSED SYSTEM STRUCTURE A. Overview The proposed system utilizes knowledge-based approach that builds a semantic representation for the summarization task using medical ontology knowledge. The system gets user query and maps that query to related medical concepts using MeSH descriptor ontology knowledge source. This type of concept mapping is used for expanding the user query based on synonyms. The revised user query is then applied to PubMed to retrieve up-to-date medical information. Fig.1. shows the proposed system structure of our ontology enhanced clustering based medical information summarization system.
In this paper Section II discusses basic contents of medical ontology knowledge sources. The components of proposed system are explained in section III. Section IV describes evaluation measures. Section V consists of conclusion. II. ONTOLOGY KNOWLEDGE SOURCE
User Query
Ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts. Medical field consists of vast amount of medical related terms. It is necessary to understand the relationship between these terms to interpret any medical document. This section provides a brief overview of ontology knowledge sources available for medical domain.
Query Expansion
Documents Retrieval
A. UMLS UMLS (Unified Medical Language System) is designed by National Library of Medicine (NLM) as medical ontology knowledge base [7] to help a medical information system to understand the meanings of the concepts, terms and their relationships in biomedicine and health domain. There are three UMLS knowledge sources: the Met thesaurus, the Semantic Network, and the SPECIALIST lexicon. The Metathesaurus which is a multi-lingual vocabulary database that contains over one million biomedical concepts from over 100 source vocabularies, definitions of medical terms, synonyms, abbreviations and the relationship among them. The Semantic Network defines 135 broad categories and fiftyfour relationships between categories for labeling the biomedical domain. The SPECIALIST Lexicon & Lexical Tools, which provide lexical information and programs for language processing.
Documents Clustering
Feature Extraction
Summary Generation
B. MeSH Medical Subject Headings (MeSH) created by the National Library of Medicine mainly consists of the controlled vocabulary and a MeSH Tree. The controlled vocabulary consists of different types of terms, such as Descriptor, Qualifiers, Publication Types, Geographics, and Entry terms. Descriptor terms are main concepts or main headings. Entry terms are the synonyms or the
Figure1. Medical Information System Structure
B. Document Clustering Ontology can improve document clustering performance with its concept hierarchy knowledge. The proposed system first cluster terms by calculating term semantic similarity using MeSH ontology on retrieved document sets. Then the documents are mapped to the 547
© 2009 ACADEMY PUBLISHER
SHORT PAPER International Journal of Recent Trends in Engineering, Vol. 1, No. 1, May 2009 (2) If a sentence contains an original keyword, assign weight 1 to it. If a sentence contains either MeSH entry term or MeSH descriptor term then assign a weight of 0.5 to this keyword.
corresponding term cluster. Finally term re-weighting is applied to assign more weight to terms that are more semantically similar with each other. Ontology based similarity measure has some advantages over other measures. First, ontology is created by human being manually for a domain and thus more precise; second, compared to other methods such as latent semantic indexing, it’s much more computational efficient; Third, it helps integrate domain knowledge into the data mining process. Comparing two terms in a document using ontology information usually exploit the fact that their corresponding concepts within ontology usually have properties in the form of attributes, level of generality or specificity, and their relationships with other concepts.
(3) Add all the weights together to get a score for each sentence. Finally sentences with high scores are selected to be included in the summary for each cluster. IV. EVALUATION MEASURES A. Precision Consider ‘a’ is the number of matched sentences in the generated summary and original documents, and ‘b’ is the number of sentences in the generated summary then precision is defined as,
Information content based measure [8] associates (1) probabilities with concepts in the ontology. The probability is defined as,
P = a / b.
(2)
B. Recall Consider ‘a’ is the number of matched sentences in the generated summary and original documents, and ‘c’ is the number of sentences in the original documents then recall is defined as, R = a / c. (3)
⎛ freq(C) ⎞ ⎟⎟ IC(C) = - log⎜⎜ ⎝ freq(Root)⎠
Equation (1) is used to define probability in which freq(C) is the frequency of concept C, and freq (Root) is the frequency of root concept of the ontology. The frequency count of concept is the sum of the frequency counts of all the terms that map to the concept. Additionally, the frequency count of every concept includes the frequency counts of subsumed concepts in an ISA hierarchy.
C. Experiment Result For experiment purpose, user query “migraine” was given as input which is expanded using MeSH descriptor and expanded keywords are given as input to MEDLINE database. The number of documents retrieved was 2698. Original keywords are mapped into MeSH entry terms and into MeSH descriptors. Clustering is done using kmeans clustering algorithm based on expanded keywords, MeSH terms and Mesh Descriptors.
Documents are represented by using either MeSH entry terms or by MeSH descriptor terms[10]. Descriptor terms are main concepts or main headings. Entry terms are the synonyms or the related terms to descriptors. MeSH entry term sets are detected from retrieved documents using MeSH ontology and then entry terms are replaced with descriptor terms using MeSH ontology. The process of document clustering consists of the following steps:
TABLE1. INDEXING USING KEYWORDS, MeSH ENTRY TERMS, MeSH DESCRIPTOR Indexing Scheme Original keywords MeSH entry terms MeSH descriptors
(1) The retrieved documents are indexed using MeSH entry terms or MeSH descriptor terms.
No. of terms 1189 705 231
(2) Calculate term similarity using information content based similarity measure and then similarity matrix is constructed for the indexed terms. (3) Run k-means clustering to cluster documents.
TABLE2. SUMMARY EVALUATION
C. Summary Generation In each cluster string matching is applied with the MeSH entry term set or MeSH descriptor set and count the number of matched keywords in each sentence. To extract sentences to be included in the summary the following scores are calculated for each sentence.
Scheme
Original keywords MeSH entry terms MeSH descriptors
(1) Simply count the number of matched original keywords and select the sentences with many matching keywords. 548 © 2009 ACADEMY PUBLISHER
No. of sentences in generated summary
No. of matched sentences
Precision
Recall
40
10
0.25
0.5
22
12
0.55
0.6
10
6
0.6
0.3
SHORT PAPER International Journal of Recent Trends in Engineering, Vol. 1, No. 1, May 2009 PubMed Document Clustering”, Springer Berlin / Heidelberg, Vol. 4443, pp. 115-126, 2008.
For feature selection three different scores are used and feature with high scores are selected to be included in the generated summary. Evaluation result shows that the proposed method using MeSH descriptor produces better summary compared to other two methods.
[9] Xiaohua Zhou, Xiaodan Zhang, Xiaohua Hu, “MaxMatcher: Biological Concept Extraction Using Approximate Dictionary Lookup”, Springer-Verlag Berlin Heidelberg PRICAI , LNAI 4099, pp. 1145 – 1149, 2006. [10] Zhang.X, Jing.L, Hu.X, Ng.M, Xia.J, Zhou.X, “ Medical document clustering using Ontology Based Term similarity measures”, International Journal of Data Warehousing and Mining (IJDWM) 2008. A.Kogilavani is a Lecturer in the department of Computer Science and Engineering at Kongu Engineering College. She is currently working toward the Ph.D. degree in Document Summarization. Her research interests are Natural Language Processing, Computational Linguistics, Knowledge Discovery and Information Retrieval.
V. CONCLUSION The Proposed work discuss about on MeSH ontology enhanced cluster based summarization of medical documents to provide information needed by the user. Ontology knowledge is confirmed to be an efficient way to go beyond the mere keyword-based information retrieval methods. The generated summary quality is measured using precision and recall measures.
REFERENCES [1] Chen.H, Lally.A, Zhu.B, Chau.M, “HelpfulMed: Intelligent searching for Medical Information over the Internet”, Journal of the American Society for Information Science and Technology (JASIST), Vol. 54(7), pp. 683-694, 2003. [2] Illhoi, Xiaohua Hu, Il-Yeol song, “A coherent graph-based semantic clustering and Summarization approach for Biomedical literature and a new summarization evaluation Method ”, First International Workshop on Text Mining in Bioinformatics (TMBio) , 2006. [3] McKeown.K, Chang.S, Cimino.J, Feiner.S., Friedman.C, Gravano.L, Patel.V, Hatzivassiloglou., Johnson.S, Jordan.D, Klavans.J, Kushniruk.A, Teufel.S, “ PERSIVAL: A System for Personalized Search and Summarization over Multimedia Healthcare Information”, ACM/IEEE Joint Conference on Digital Libraries, Roanoke, pp. 331-340, 2001. [4] Ping Chen, Rakesh Verma, “A Query-based Medical Information Summarization System Using Ontology Knowledge” Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06) , 2006.
Dr.P.Balasubramanie received the Ph.D. degree in Theoretical Computer Science from Anna University in 1996. He is currently Professor in Computer Science and Engineering department at Kongu Engineering College. He has authored 6 books, 17 International Journals and 13 National Journals. He serves on the Editorial Board of the ACCST Research Journal. He was rewarded with CSIR JRF award.
[5] Rakesh Verma, Ping Chen, Wei Lu, “ A Semantic Free-text Summarization System Using Ontology Knowledge”, IEEE Transactions on Information Technology in Biomedicine, Vol. 5(4), pp. 261-270, 2007. [6] Stergos Afantenos, Vangelis Karkaletsis, Panagiotis Stamatopoulos, “Summarization from medical documents: a survey”, Journal of Artificial Intelligence in Medicine, Vol.33(2), pp. 157-177, 2005. [7] Unified Medical Language System, available at www.nlm.nih.gov/research/umls. [8] Xiaodan Zhang, Liping Jing, Xiaohua Hu, Michael Ng, Xiaohua Zhou, “A Comparative Study of Ontology Based Term Similarity Measures on 549 © 2009 ACADEMY PUBLISHER