Concept Integration from the caTIES to i2b2 Using the UMLS Semantic Network Vincent Yip
Umit Topaloglu
University of Arkansas for Medical Sciences 4301 W. Markham, Slot 637 Little Rock, AR 72205, USA. 501-428-4432
University of Arkansas for Medical Sciences 4301 W. Markham, Slot 637 Little Rock AR, 72205, USA. 501-686-7238
[email protected]
[email protected]
ABSTRACT A tremendous wealth of valuable information is available in the plain text clinical reports and there are variety types of Natural Language Processing (NLP) platforms in place to generate concept codes and mine the reports. The information obtained from the reports has more value if it can be integrated with other clinical and genomics data. The Integrating Biology and the Bedside (i2b2) is being adopted by many institutions. Its open source based scalable framework allows research on genomics and clinical data. In this study, we have shown that any existing information extraction systems can be integrated to i2b2. In order to address this issue, the UMLS semantic network is adopted to map the concept codes generated by the Cancer Text Information Extraction System (caTIES) to i2b2. With the proposed approach, more than 200,000 sample records and 18,000 unique concept codes are made accessible and searchable instantly throughout the i2b2 infrastructure.
1. INTRODUCTION Diversity of the clinical research focus requires use of all the available information. Since massive amounts of medical records are being produced (such as pathology reports, radiology reports, physician notes/dictations, problem list, labs etc.) accessing and reviewing all the information is a time consuming task. Furthermore, data querying across disparate data repositories is a challenge due to the fact that different data sources use various data encoding standards. To help researchers in longitudinal, retrospective and other types of studies, some institutions have implemented clinical data warehouses/repositories [1] [2] [3] [4] [5]. A typical warehouse is very useful for many purposes (i.e. clinical care, research); however, many lack the ability to search narrative documents which are the care professional’s interpretation of images, lab results, or other clinical data in unstructured plain English. i2b2 is a National Institutes of Health Roadmap initiative and
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’10, Month 1–2, 2010, City, State, Country. Copyright 2010 ACM 1-58113-000-0/00/0010…$10.00.
aimed to have a software framework to enable clinical researchers to access genomic and clinical findings in a unified environment. It consists of modules (“cells”) within a “hive” that provides interactions using web service and XML messaging. Many institutions are adopting i2b2 and some (e.g. University of Arkansas for Medical Sciences) have existing NLP tools in place as well. In order to make the concept codes, generated by the existing NLP tools, usable in i2b2 framework, in this paper, we purpose a UMLS semantic network based model. The proposed infrastructure utilizes the cancer Text Information Extraction System (caTIES) [6], the Informatics for Integrating Biology and the Bedside (i2b2) [5], the UMLS semantic network [7], and other standards based open source tools. The concept codes are generated by the caTIES and made accessible and searchable within i2b2 using the UMLS semantic network. In turn, the i2b2 infrastructure serves as a mean to share data both within and outside an institute. Ultimately, the proposed model can be used as a road map for other NLP based repositories to access/share data through the i2b2. Prior to the mapping process, MirthConnect [8], an open source interface engine, is used to pull the clinical data from various data sources such as inpatient/ outpatient Electronic Medical Record (EMR), Radiology Information System (RIS), and Laboratory Information Management System (LIMS). After that, the unstructured free text data will be processed by caTIES where the patient records are de-identified and structured concept codes are generated. The results are stored in a SQL Server relational database in which a patient Medical Record Number (MRN) is masked to link a patient’s record in a de-identified manner. The concept codes and patient’s information are then mapped to i2b2’s star scheme using the UMLS semantic type.
2. BACKGROUND There were various data repositories exist such as the ClinQuery [9], the DXtractor [10], the CareWeb [11], and the i2b2 [5]. Recently, there are approaches that use NLP for knowledge discovery and clinical decision support [12]. MediClass [13], MedLEE [14], cTAKES [15], HiTex [16] and caTIES are some of existing NLP systems. In this study, the concept codes that are generated by caTIES are mapped to the UMLS semantic network and transferred to the i2b2. The caTIES [6] is a cancer Biomedical Information Grid (caBIG) compliant application developed to extract coded information
from free text reports using various natural language processing (NLP) techniques. With the help of publicly available NLP tools, algorithms, and the NCI Metathesaurus [17], caTIES is capable of identifying and indexing concepts from plain text reports. GATE (General Architecture for Text Engineering) [18] is the main part of the NLP core of caTIES which is a java toolkit for NLP. NCI Metathesaurus is based on the Unified Medical Language System (UMLS) Metathesaurus [19] with additional cancer-centric vocabulary supplements. In addition, the open source Harvard Scrubber is bundled with caTIES and it is used for deidentification. Concepts in the UMLS Metathesaurus are categorized consistently into the UMLS semantic network using the semantic types. The semantic type is a basic semantic category so that a term is assigned to the category based on the inherent properties of a concept or its functional properties. For instance, Neisseria belongs to the semantic type of “Bacterium”; the term Fibroblast falls into the semantic type of “Cell” [20]. In addition, semantic relations are used to describe the relationship between semantic types. There are two types of semantic relations: (1) the hierarchical relationships and (2) the non-hierarchical relationships. The hierarchical relationship, the “isa” semantic relation, constructs the hierarchy of types in the semantic network. While the non-hierarchical relationship includes the "physically related to," "spatially related to," "temporally related to," "functionally related to," and "conceptually related to." In this study, only the “isa” relationship is used due to the fact that only the hierarchical structure is needed for the mapping. Informatics for Integrating Biology and the Bedside (i2b2) is a NIH National Center for Biomedical Computing (NCBC) funded initiative. The i2b2 aims at providing cohort identification functionality to end users. Researches use i2b2 to verify the existence of a group of patients that fulfill certain criteria. Initially, a user performs analysis on the de-identified patient information. If the group of cohort exists, an identified cohort extraction can be executed with the institutional review board (IRB) approval. A friendly user interface is provided for cohort searching. In addition, the i2b2 software (the hive) consists of a series of independent modules known as “cells”. Web services with XML can be used to communicate between the cells and other external applications. Therefore, the interaction between the i2b2 software and other applications are simplified. Moreover, a plug-in can be developed to extend the functionality of the webclient to perform more sophisticated cohort analysis. All these advantages are the reason why the i2b2 is chosen, in this study, as the platform for data sharing.
3. METHODOLOGY As depicted in figure 1, the main components of this model are the data sources, an interface engine, a text extraction system, an UMLS semantic network mapper, and the i2b2 software.
3.1 Mirth Interface Engine In this study, Mirth is used to collect three types of clinical reports hourly: pathology reports, radiology reports and transcripts from different source systems. The pathology reports and the radiology reports are retrieved from the hospital’s Sun Java Composite Application Platform Suite (JCAPS) interface engine. Mirth uses the File Transfer Protocol (FTP) to retrieve the in-patient and outpatient HL7 messages from the EHR system. Mirth also performs transformation of the incoming HL7 reports from the JCAPS
engine which organizes multiple sections of a report in multiple OBX segments; Java Script is used to extract and group these sections into their corresponding OBX segments. Regular expressions are employed to identify the segment headers. Finally, each HL7 message is then saved as a message file with the built-in functions of the Mirth.
Figure 1 - the model overview. Mirth pulls reports from various data sources. Reports are then transformed into caTIES readable format and saved to a HL7 message folder. caTIES picks up the reports from the message folder to deidentify and then performs concept extraction. The results will be stored into a MySQL database. Mirth then pulls all the extracted concepts and maps them to the UMLS semantic network and generates concept paths based on the semantic type of the concepts. Patient demographics are also deidentified and pulled from the hospital’s Admission, Discharge, and Transfer (ADT) system. De-identified reports, concept paths, patient demographics are transferred to the i2b2 using Mirth. The web-client of the i2b2 allows researchers to search and view the de-identified patient information.
3.2 De-identification and Concept Code Extraction with the caTIES In caTIES, the HL7 message pipeline (HL7 Pipeline), the deidentification pipeline (DE-ID Pipeline), and the concept coding pipeline (TIE Pipeline) are being run. The HL7 Pipeline is loading HL7 messages from the designated folder to the caTIES. The messages are validated based on the HL7 specification version 2.3.1. Information such as the patient demographic information and message content are extracted and stored in the private database of the caTIES. The DE-ID Pipeline further processes the message by removing the PHI from the message content. The deidentified information is then transferred into the public database of the caTIES. As default, the caTIES uses the open-source Harvard scrubber de-identification tool. The TIE Pipeline sends the previously processed HL7 messages to GATE where relevant concept codes are extracted using the EVS MMTx service. The EVS MMTx service is used to query the NCI Metathesaurus. Note that, each concept consist of 6 important pieces of information: (1) the concept unique identifier (CUI), (2) the original terms that used in the report, (3) preferred name, (4) the semantic type, (5) the concept group, and (6) the negation status. The concept codes are stored in the compressed GATE format in the caTIES public database. In general, the caTIES classifies coded concepts as diagnosis, procedure, general concept, or organ type. A built-in negation engine is employed to recognize negation in plain text such as no cancer symptoms.
Mirth also queries the ADT system for de-identified patient demographic such as race, gender, age, and only the first three digit of a zip code.
3.3 The UMLS Semantic Mapper The UMLS Semantic Mapper is used to map a concept to the semantic network and a hierarchal path is created for the concept. As mentioned earlier, each concept accompanies a semantic type, which is used to map to the UMLS semantic network. First, the mapper performs a search based on the semantic type of a concept against the semantic network. A leaf node in the network matches the semantic type of the concept. Then, a path is generated from the root node to the leaf node. This path is called the concept path in this study. In addition, the CUI of a concept is used as the concept identifier for the i2b2. Note that, the semantic network structure is freely available online. In this study, the SRSTR file from the UMLS semantic network website is adopted to generate the semantic hierarchy. There are 134 semantic types in the network. Mirth is then used to insert the patient demographics, de-identified reports, and the concept paths to the i2b2 star scheme.
Figure 3 – Cohort analysis using the CUI search.
3.4 The i2b2 Software More than 200,000 sample records and 18,000 unique concept codes are made accessible and searchable in the i2b2 software. Data can be searched in several ways: (1) a concept keyword (Figure 2), (2) the CUI of a concept (Figure 3), or (3) browsing through the semantic network (Figure 4). The existence of a patient group that fulfills some search criteria is verified using the i2b2 (Figure 5). A patient count and a patient set are available after the search (Figure 6). The patient set is the patient object that is used by the analytical tool. A Different plug-in is available in the i2b2 web-client. The demographic plug-in is demonstrated below (Figure 7). This plug-in provides the drill down information of a patient group, for example, the gender, race and the patient count (Figure 8).
Figure 4 – Cohort analysis using the term navigation.
Figure 5 – The i2b2 query status.
Figure 2 – Cohort analysis using the keyword search.
Figure 6 – The i2b2 query results.
Figure 7 – The i2b2 demographic plug-in.
Negated concepts are discovered by the caTIES, for instance, no evidence of Breast Carcinoma Metastatic to the Brain. In this case, Breast Carcinoma is the negated diagnosis. In order to allow users to search for negated concepts in the i2b2, whenever the concept is indicated as negated, a keyword “no” is added to the beginning of the concept to differentiate it from the non-negated concept. At the same time, a letter “N” is appended to the end of the concept_cd (a unique concept identifier in the i2b2.) By doing so, the negated and the non-negated concepts are considered as two separate concepts. Figure 2 clearly illustrated that if a user searches for Emphysema, the corresponding negated concept (no Emphysema), if any, is also searchable. The concept dimension of the i2b2 has the ability to incorporate synonyms of concepts. For instance, Carcinoma of breast is the synonyms of Breast Carcinoma. In our model, the terms that are used in the reports are considered as the synonyms of the preferred name of the concept. There are 2 mapping tables in i2b2, the patient mapping table and the encounter mapping table, where the source system identifier is mapped to the internal i2b2 identifier. These 2 tables make it possible to queried identified patient information if necessary. Currently, the web client of the i2b2 provides several plug-ins to analyze search results. Figure 8 demonstrated the results from the demographics plug-in. In order to incorporate more analytical functions to the web-client, developers can easily develop different plug-ins without altering the core of the i2b2.
5. ACKNOWLEDGEMENT This work is partly supported and funded by the National Center for Research Resources (NCRR) 1UL1RR029884 and the Arkansas Breast Cancer Research Program.
6. REFERENCES [1]. Dyer, Karen, et al. Development of a Universal Connectivity and Data Management System. s.l. : Critical Care Nursing Quarterly, 2001.
Figure 8 – The i2b2 demographic plug-in result.
4. RESULTS AND DISCUSSIONS A concept mapping approach has successfully been developed and implemented to insert concepts from other NLP based system, the caTIES in this study, to the i2b2. Several fully functional demonstrations have been shown in the previous section. Some technical questions have been answered prior to the implementation of the model. The questions are listed as follows: (1) which concept groups should be included/ excluded, (2) how to incorporate negated concepts, (3) how to deal with synonyms, and (4) how to query identified patient information if necessary. Identified concepts are grouped by the caTIES into 6 main groups: (1) Diagnosis, (2) Negated Diagnosis, (3) Procedure, (4) Organ, (5) General Concept, and (6) Negated Concept. In the proposed approach, the general concept group is not applied. It is because the general concept group mostly only consist of modifiers, for example, left, level, one, two, etc.
[2]. Szirbik, N., Pelletier, C. and Chaussalet, T. Six methodological steps to build medical data warehouses for research. s.l. : International Journal of Medical Informatics, pp. 683-691, 2006. [3]. Sahama, Tony R. and Croll, Peter R. A data warehouse architecture for clinical data warehousing. s.l. : Ballarat, Australia : Australian Computer Society, Inc., 2007. ACSW '07: Proceedings of the fifth Australasian symposium on ACSW frontiers. pp. 227--232, 2007. [4]. Lyman JA, Scully K, Harrison JH, Jr. The development of health care data warehouses to support data mining. s.l. : Clin Lab Med, 2008, Vol. Mar:28(1), 2008. [5]. I2B2: Informatics for Integrating Biology & the Bedside. I2B2. I2B2. [Online] http://www.i2b2.org. [6]. Crowley, Rebecca. caTIES: What Is CATIES? caTIES. [Online] 2 17, 2010. [Cited: 2 22, 2010.] http://caties.cabig.upmc.edu/Wiki.jsp?page=WhatIsCATIES. [7]. McCray, A. T. An upper-level ontology for the biomedical. s.l. : Comparative and Functional Genomics. Comp Funct Genom 2003; 4: 80–84., 2003. [8]. Mirth Connect. mirthproject. [Online] [Cited: 2 22, 2010.] http://www.mirthcorp.com/community/overview.
[9]. Safran C, Porter D, Lightfoot J, et al. ClinQuery: A system for online searching of data in a teaching hospital. s.l. : Ann Intern Med 1989;111(9):751– 6., 1989. [10]. Nigrin DJ, Kohane IS. Data mining by clinicians. s.l. : Proc AMIA Symposium 1998, pp 957–961, 1998. [11]. Halamka JD, Osterland C, Safran C. A web-based medical record for an integrated health care delivery system. s.l. : CareWeb, Int J Med Inform 1999;54(1):1– 8, 1999. [12]. Demner-Fushman D, Chapman WW, McDonald CJ. J. What can natural language processing do for clinical decision support? s.l. : Biomed Inform 2009;42(5):760–72, 2009. [13]. Hazlehurst B, Frost HR, Sittig DF, Stevens VJ. J. MediClass: a system for detecting and classifying encounterbased clinical events in any electronic medical record. s.l. : Am Med Inform Assoc 2005;12(5):517–29, 2005. [14]. Friedman C, Hripcsak G, DuMouchel W, Johnson SB, Clayton PD. Natural language processing in an operational clinical information system. s.l. : Natural Language Engineering 1995; 1:83-108, 1995.
[15]. Savova GK, Kipper-Schuler K, Buntrock JD, Chute CG. UIMA-based clinical information extraction system. LREC 2008: Towards enhanced interoperability for large HLT systems: UIMA for NLP. s.l. : LREC 2008., 2008. [16]. Zeng QT, et al. Extracting principal diagnosis, comorbidity and smoking status for asthma research: evaluation of a natural language processing system. s.l. : BMC Med Inform Decis Mak. 2006;6:30, 2006. [17]. NCI Metathesaurus. National Cancer Institute. [Online] [Cited: 2 22, 2010.] http://ncim.nci.nih.gov/ncimbrowser. [18]. GATE: a full-lifecycle open source solution for text processing. GATE. [Online] [Cited: 2 22, 2010.] http://gate.ac.uk/overview.html. [19]. UMLS Metathesaurus. National Library of Medicine. [Online] [Cited: 2 22, 2010.] http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html. [20]. McCray, A. T. The UMLS semantic network. s.l. : 13. Annual Symposium on Computer Applications in Medical Care; Washington; DC (USA); 5-8 Nov. 1989. pp. 503-507., 1989.