terms using the Google search engine. Retrieved definitions are provided as links to the terms. The system was found to improve reader's comprehension by an ...
A Semantic and Syntactic Text Simplification Tool for Health Content Sasikiran Kandula1, Dorothy Curtis2, Qing Zeng-Treitler1 Department of Biomedical Informatics, University of Utah, Salt Lake City, UT. 2 Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA. 1
Abstract Text simplification is a challenging NLP task and it is particularly important in the health domain as most health information requires higher reading skills than an average consumer has. This low readability of health content is largely due to the presence of unfamiliar medical terms/concepts and certain syntactic characteristics, such as excessively complex sentences. In this paper, we discuss a simplification tool that was developed to simplify health information. The tool addresses semantic difficulty by substituting difficult terms with easier synonyms or through the use of hierarchically and/or semantically related terms. The tool also simplifies long sentences by splitting them into shorter grammatical sentences. We used the tool to simplify electronic medical records and journal articles and results show that the tool simplifies both document types though by different degrees. A cloze test on the electronic medical records showed a statistically significant improvement in the cloze score from 35.8% to 43.6%. Introduction Text simplification is a challenging natural language processing (NLP) task. Its application to the health domain is of special importance as the majority of health information, including those articles targeted at consumers, require reading skills that an average US adult does not have1. This significantly limits the impact of the available health information on improving health outcomes. Tools to automate the simplification of health content can be instrumental in making health information more accessible to consumers. There has been relatively limited prior research on such tools2. One of the few simplification tools that address this issue is a system developed by Elhadad3 that identifies difficult terms and retrieves definitions for these terms using the Google search engine. Retrieved definitions are provided as links to the terms. The system was found to improve reader’s comprehension by an average 1.5 points on a 5 point scale. In a 2007 paper, we described an ad hoc method for simplifying health content4. The method addressed the semantic difficulty of health content that results from the use of terms unfamiliar to an average reader. The method identified difficult terms in the text and
tried to simplify them either by replacing them with easier synonyms or by explaining them using easier related terms. The study reported useful and correct simplifications in 68% of identified difficult terms. The simplification tool discussed here extends this earlier tool by using a larger set of relationship types that were identified through a more systematic review of human-generated explanations. Additionally it incorporates a module to address the syntactic difficulty of health texts. Though we recognize that syntactic difficulty can arise due to a number of reasons, in this version of the tool we only focus on identifying and simplifying complex and compound sentences, i.e. sentences with at least one dependent clause and one independent clause. To validate the system, we used two types of health documents – electronic medical records and biomedical journal articles. Through the increasing availability of personal health records and services like PubMed, the general public now has access to both these types of difficult documents. However, these texts require literacy and numeracy skills much higher than that of an average reader. The readability of electronic medical records is low due to extensive use of medical terms and abbreviations, short ungrammatical sentences and very little cohesion. Journal articles are generally more cohesive but they use difficult terms and excessively long sentences with complicated syntactic structures which are hard to follow. In the following sections, we describe the different modules of the system, report results of the tests and discuss their implications. Background Identification of difficult terms is the primary step in text simplification. In a previous study5, we have proposed a frequency-based technique to estimate the difficulty of terms. This method is based on the observation that terms that occur more frequently in lay reader targeted biomedical sources, such as Reuters News or MedlinePlus queries, tend to be easier. We have also proposed a contextual network approach that estimates a term’s difficulty based on its usage contexts6. Both methods have been validated using data from actual consumer surveys. A combination of these two techniques can be used to obtain a single score, familiarity score, in the range of 0 (very hard) and 1 (very easy) as an estimate of a
AMIA 2010 Symposium Proceedings Page - 366
lay reader’s familiarity with a given term. All terms with a familiarity score less than a user-defined threshold can be considered to be difficult and in need of simplification. The simplification tool described by Zeng-Treitler et al.4 uses the familiarity score to identify difficult terms. Terms with a familiarity score below a specified threshold were simplified through one of the following: Synonym replacement. A simple approach to simplify difficult terms is to replace them with their easier synonyms. Prior studies have shown that there are significant differences between a patient’s and a healthcare professional’s mental model of the medical domain7,8 and they prefer different terms to describe the same concept. The Open Access and Collaborative Consumer Health Vocabulary (OAC CHV) Initiative9 is an on-going effort to curate consumer-friendly alternatives to difficult medical terms. Difficult terms identified for simplification using their familiarity score, can have OAC CHVpreferred synonyms that satisfy the familiarity score threshold criterion. If such synonyms exist, the difficult terms were replaced with their OAC CHVpreferred synonyms. Explanation Generation. For difficult terms that have no OAC CHV-preferred synonyms, explanations were generated in an ad hoc manner. An explanation is defined as a short phrase that makes the term more understandable and is different from the term’s definition which is focused on describing the precise and complete semantics. As such, a definition can use terms more difficult than the original term, but an explanation cannot; a definition may also need to be lengthy but an explanation should be brief. For example, a definition (MeSH) for hemochromatosis can be ‘a disorder due to the deposition of hemosiderin in the parenchymal cells, causing tissue damage and dysfunction of the liver, pancreas, heart, and pituitary’ and an explanation (MedlinePlus) can be ‘an inherited disease in which too much iron builds up in the body’. To automatically generate explanations, the tool: (a) generates a pool of terms related to the original term; (b) selects the easiest related term using the familiarity score; and, (c) uses a short phrase to describe the relationship between the difficult term and the selected related term. To generate a pool of related terms, synonymous (SY) and hierarchical relationships (PAR, CHD) defined in the Unified Medical Language System (UMLS) Metathesaurus were used. For example, UMLS defines disordered taste as a synonym of dysgeusia and congenital abnormality as a parent of pulmonary atresia. Related terms that have an OAC CHV alternative were replaced with the corresponding OAC CHV
term. The familiarity score of the related term was calculated and the easiest related term that satisfies the familiarity score threshold criterion was selected. If the selected term is a synonym it was used to replace the original term. If the selected term is a parent of the original term, the replacement was of the form (a type of ). Similarly, if it is a child, the replacement was of the form (e.g. ). Pulmonary atresia (a type of birth defect) and oropharyngeal (e.g. mouth) are examples of generated explanations. A human expert review found 32% of the explanations to be either not useful or incorrect. One major source of these errors was the use of nonapplicable hierarchical relations (for example, ‘tobacco abuse’ as ‘a type of psychiatric problem’). This led us to believe that supplementing the hierarchical relations with other relations might reduce these errors and improve the simplification. Methods Building on the functionality described in the previous section, we extended the tool to use additional relationship types for explanation generation and added a new syntactic component. Semantic Simplification In a related ongoing study, we manually analyzed explanations contained in a set of 150 human created diabetes-related documents and identified key relations used to explain five common semantic groups10 of health concepts (Table 1). We found that the relationship types between the difficult term and the easier explanation terms are very often dependent on the semantic type of the difficult term. Original Term Sem. Group Disease name Anatomical Structure Device Procedure Medication
Connector a condition affecting a part of a device/instrument used in a procedure performed on can have a tradename of
Explanation Term Sem. Group Anatomical Structure Anatomical Structure Procedure Anatomical Structure Medication
Table 1. Connectors used in explanations generated using semantic type information
For instance, it was found to be common to explain a disease/condition with common symptoms observed or the specific body parts affected. Hence, we modified the explanation generation algorithm to take into consideration the semantic type of the difficult term and to generate semantically influenced
AMIA 2010 Symposium Proceedings Page - 367
explanations. For example, the condition Endometrial adenocarcinoma is explained as Endometrial adenocarcinoma (a condition affecting female genitalia) and Humerus as Humerus (a part of arm). For difficult terms that have both hierarchical and semantic explanations, the semantic explanation is appended to the hierarchical explanation. For example, polynephritis is explained as a type of infection affecting renal pelvis or urinary tract. Syntactic Simplification The syntactic simplification is performed at the sentence level. Sentences longer than 10 words are assumed to require syntactic simplification and are processed through a series of modules (discussed below). At the end of this simplification the original sentence can be retained unchanged or replaced by two or more, shorter, hence presumably simpler, sentences. The modules include (also see Figure 1): Part of Speech Tagger. A sentence that meets the length criterion is tokenized and each token is annotated with appropriate part-of-speech (POS) tag, as required by the simplifier module in the next step. We used open source software from OpenNLP11 to perform these tasks. Grammar Simplifier. This module breaks down the long sentence into two or more shorter sentences and is based on the simplifier proposed by Siddharthan12. The module works by identifying POS patterns and applies a set of transformational rules to produce shorter sentences. For example this module splits the sentence “A “ lthough the subjects with high CACSs may be at higher risk of a first event, further followup data are needed before EBT screening can be recommended for type 1 diabetic patients.” into “The subjects with high CACSs may be at higher risk of a first event.” and “But further follow-up data are needed before EBT screening can be recommended for type 1 diabetic patients.” Output Validator. To guard against ungrammatical and fragmented simplifications, the output validator module checks each output sentence prroduced by the above module against the following conditions: a) sentence too short (word count