mechanism or model to assist health professionals searching medical information in ... A way to minimize these problems is the use of computer systems by medical .... return allow quick access and accurate to query the information. To build ...
Semantic Information Indexing and Retrieval on Patient Medical Data Rodrigo Bittencourt Cabral1, Rafael Andrade1, Cloves Langendorf Barcellos Junior2, Aldo von Wangenheim1,2 1
Post-Graduate Program in Knowledge Engineering and Management (EGC) 2 Post-Graduate Program in Computer Science (PPGCC) Technological Center – CTC, Federal University of Santa Catarina - UFSC, CEP 88040-970 Florianópolis, SC, Brazil {cabral}@telemedicina.ufsc.br, {andrade, cloves, awangenh}@inf.ufsc.br Abstract — The need for information search is increasingly growing in all areas of knowledge. One of the biggest problems today is the deficiency in traditional search systems is retrieving relevant information for the user. In the medical area there are several sources of knowledge that are dispersed in different databases, so information retrieval requires much of the professional time and attention when searching these databases. The study in this paper demonstrates the creation of a new mechanism or model to assist health professionals searching medical information in order to speed up medical procedures and make the interaction with the computer more attractive. This article demonstrates the proposed architecture for a semantic search engine and indexing of medical and toxicological databases. Also, we showed here the first results from the first prototype. These results lead us to believe that the model proposed architecture can help fill a gap in the medical field and allow more professionals to make use of telemedicine technologies.
I. INTRODUCTION The need to expand the expert’s medical services is most evident in the public health sector, mainly due to the relative lack of skilled human resources in this area. Low remuneration, high demand for services, availability in various institutions are some of the problems faced by doctors in Brazilian Public Health System. A way to minimize these problems is the use of computer systems by medical community. A telemedicine system can make more attractive the work from radiologist and allow that the physician can access the patient exams by the Internet and make the report from home or anywhere that has web access. Unfortunately there are several barriers to be overcome for a telemedicine system can be accepted in its fullness. According to the telemedicine information exchange [1], many institutions do not allow that the doctors from other institutions or from other states can contribute to the care of a particular patient disease and the institutions have various restrictions on the payment of health professionals. The scare of lawsuits is another concern of physicians to use the telemedicine, although most studies of patient satisfaction showed that patients are generally satisfied with the service from a distance. Access problems and legal issues are not the only challenges for the use of telemedicine systems. There are other technology questions that obstruct their use. The absence of a unique patient record is a major problem from Brazilian Public Health System. Each institution has its own way of storing the
patient’s information. Thus one patient may contain multiple medical records, if he is treated in different medical centers. This treatment model allows the information becomes dispersed in the network. The patient’s databases did not allow a user to find the information in a satisfactory time. To perform a search for a particular patient the system will need find in all databases for search the information requested by the user. And for retrieve the information quickly and safely, a search engine should have a very efficient indexing mechanism. A project developed by Federal University of Santa Catarina (UFSC) in conjunction with the Health Department of Santa Catarina (SES/SC), became more attractive the radiologist working procedures. The project was called the Rede Catarinense de Telemedicina (Santa Catarina Telemedicine Network (RCTM). Among the technologies developed are the telemedicine Portal, which allows medical professional to access exams sent from various cities distributed from Santa Catarina state and by the Internet the doctor make out the report [2]. The objective of this study is to centralize the patient information, images, laboratory information, clinical care toxicology, diagnosis and pictures and storage in a central database. Currently our database contains more than 150,000 patients registered and nearly 200,000 exams and diagnosis stored. In addition, the system has a toxicological information database, which contains articles and other data about the previous treatment of certain diseases and intoxications. This system has large amount of information that can be used by the medical professional care in future. However, the information retrieval in current system consumes much of the professional’s time because the user must find information in various locations and the search model in most of the cases is inefficient for obtaining the answers. Nevertheless, the current search model cannot retrieve the information from patient examinations diagnosis. A way to solve this problem is the use of an information retrieval (IR) mechanism based on semantic dictionaries and medical ontologies. The ontology can be defined as a set of concepts, relationships and rules that manage the relationships. It consists in a way to represent semantic relationships, such as objects and relationships in a particular domain [3]. The use of IR techniques to analyze, index, extract, and ranking the information is intended to retrieve as many relevant documents with the lowest possible error. The use of these two knowledge areas (semantic Web and Information Retrieval), can boost the
relevant documents recovery, considering the diversity of semantic relations contained in a medical ontology. In this paper we study the problem of semantic information retrieval from different databases and in different data formats like free text in patient medical diagnosis, scientific papers in PDF format, or XML files [4]. The technique used is divided into two components: indexing and searching. Indexing process is the information retrieval from different databases, semantic analysis and extraction of medical terms that was compared to the medical ontology and store in an index database. The query process identifies a standard for text classification retrieved from user and the creation of a relationship between two or more terms used in the research (e.g. body parts and symptoms). This process create a decision tree that will be used to expand the search scope and return more relevant information to the user, based on syntactic and semantic analysis data. The remainder of this paper is organized as follow: in sections II, we review the related word; in section III and IV respectively, we present the indexing model and the query model. Finally, we present some discussion remarks and future works.
semantic annotations, queries, RDQL, module ranking and also as a differential, an approach using the vector-space model to assign weights to keywords via a algorithm that identifies the frequency of occurrences of a particular term in a document to specify its relevance in context. The features relationed of weights will be used in our parser module implementation. Vagelis Hristidis et al. [7] present a set of challenges for information recovery on Electronic Medical Records (EMR) in semi-structured databases. The authors describe problems of search in XML-based documents encoded in medical standards to represent clinical data. The biggest problem of traditional search in XML documents is that it retrieves little significant results. A single XML document node can contain various information such patient identifiers, procedures codes, words descriptions [8] or toxicology drug data. When the information is encoded in dictionaries, the system needs to interpret these codes to retrieve the correct information to the user. The retrieve will not recovery redundant information or unclear. In most cases, descriptions in plain text are used to enrich the information on a medical report or to identify one toxicology substance.
II. RELATED WORK
III. PROPOSAL ARCHITECTURE
By referring to semantic search, we are faced with a wide range of related approaches. Among these approaches, it is possible to identify a similarity between them, in most cases. The following are described some of the cases that have similarities with the approach adopted in this work. Cristiano Rocha (2004) [5] proposes an approach to a hybrid search engine for the Semantic Web, which combines traditional methods of searching along the propagation techniques of activation applied to a semantic model from an ontology. The purpose of this work is to enrich the traditional search process with information extracted from a semantic model of application. The process of spreading activation is to basically an explorer of concepts. A search process starts from an initial set of concepts and given some restrictions, seeks to recover information that are closely related to the initial concepts. This method is used to find related concepts in the ontology from a set of related values previously specified. The objective is to extract knowledge from one ontology to obtain a numerical weight to each relationship instance in the model and to obtain all instances of a concept that are related to a particular keyword search, even if a word not found appear concept available in the database. The authors present a generic semantic search system model applied in two specific areas. The results indicate that the hybrid propagation techniques are better than the other propagation techniques studied in the article. However the authors show that the model works only for a specific domain and has not been tested in other areas. One of the weaknesses of the algorithm presented is the lack of semantic interpretation algorithm in the activation of diffusion. This means that some inferences are not considered true by the system. Another proposal is evaluated, is a study on information retrieval in ontologies using the Vector-space Model. In this study, [6] shows a mechanism that uses the techniques of
The proposal has previously reported satisfactory results. However, the motivation of a parallel study concerns the issue of performance. In the below approaches, there is a two-way traffic of information each time a query is made. The need for high performance on medical and toxicology context demanded the development of an architecture that dispenses excessive traffic through the indexing of all information required in the facility only once. Thus, we expected to get a semantic search of high performance both in terms of structure (precision and return), when in performance, because the index return allow quick access and accurate to query the information. To build this search engine, we can make use of several alternatives already developed to assist in the recovery steps of knowledge. However, in the context presented by this study, it is necessary to implement some of its modules to ensure the expected result. One of these modules is the parser used to access the ontology and extract their classification, properties and instances. A. Parser Development Is this parser in conjunction with the semantic annotations module (analyzer module) that should provide the mechanism for indexing the information, we provide a pre-sorting of results that we wish to achieve in the polls. Its development was done by using an API provided by Protègè, which is an open-source library that provides access to OWL and RDF (S) developed in JAVA. It provides a series of classes and methods to read, save, query and manipulate OWL files [9]. From the use of this API, we developed a recursive algorithm to traverse the ontology classes, describing subclasses and instances, beyond the properties of these instances. From this method, it is also possible to describe the relationship between the instances, which are then used for
integration with analyzer module. B. Indexing Model The first experiment to be tested after completing the implementation of the parser is the export of ontology instances and declares their relationships in a way that could be imported later by the analyzer module. Our proposal makes use of some data structures and algorithms as: inverted index, parsers, Levenshtein Distance and Cover Density Ranking (CDR).
Fig.1. Proposed Index Structure The indexing of large data volumes requires a consistent structure when dealing with extensibility and performance while running complex searches queries through the analyzer module proposed. To ensure consistency and performance in our indexes repository, we use an inverted index approach, specifically, a generalized inverted index [9]. Most of today's applications are indexed using indexes called “direct”. They store the identifiers in pairs and some text content of a document. Meanwhile an inverted index, normally store pairs of words, fields and identifiers of a document. In an inverted index there are as many entries as words or fields in a document. The basic structure example showed in Fig.1, covers fields like a general identifier (id), an identifier of the medical examination (examination) and a medical report (report). The performance gain obtained with the use of inverted index is due in large part by the choice of using B-trees, as data structure when storing indexes and their identifiers. B-trees also allow the use of a hierarchical arrangement of documents and their identifiers. C. Query Model The next step is to build the search engine itself. This step makes use of a combination of dictionaries, controlled vocabularies, spelling checkers and stemming algorithms (stemmers). Most documents that will be processed by our proposal are written in Brazilian Portuguese language (pt_BR), but we chose to develop a flexible model, where different languages can be added to the analyzer module. To make it possible, the use a spell checker specific for a particular language is needed, so we decided to use Ispell since it supports multiple languages and has been widely tested and is proven effective in several other projects.
Fig.2.Query example The search algorithm itself uses concepts such as: Proximity: you may specify how many words apart two or more terms are; Inclusive range: finds all documents whose field values are between the upper and lower limits specified; Exclusive range: finds all documents whose field values are greater than the specified lower limit and less than the upper limit; Term relevancy boost: increases the relevance of a document containing this term. A number determines how much of relevance increase is given from others terms of the query; Wildcards: uses a single and multiple characters representations that can be replaced by the analyzer module iterator. These concepts are explicit in the query sent to the analyzer module, the user can also define which words or phrases are mandatory on their results. We can also use Boolean operators to enrich the query. Fig.2 displays an example of a query submitted to the parser module, where the term "cápsula" must always be included in the final results and its relevance is increased in three units from other query terms. The word "azuis" is featured in the query but is not always included in the final results. Finally we have the term "vertigem" which is mandatory in the final results. This proposal also supports search method called "Fuzzy Search" or approximate string matching which is the technique of finding approximate matches to a pattern in a string. To our query model be able to provide support for this kind of search, we need to understand and implement the Levenshtein distance algorithm. The Levenshtein distance uses operation like insertion, deletion or substitution of a single character to transform a string into the other. The algorithm itself uses an (L1+1) × (L2+1) matrix, where L1 and L2 are the lengths of the two strings, it also performs several iterations running each of the operations described before. Finally the Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other [10]. These are some of the techniques proposed in our query model, obviously major improvements such as a Cover Density Ranking (CDR) support and inclusion of new concepts and techniques may be part of the final model. D. Proposed Integration Model Each module previously demonstrated has its specific function within the proposed mechanism. Basically, the integration of these modules is illustrated in Fig.3. Before the user is able to query the system, the parser modules search the pre-built ontology repository and extract the ontology classes, subclasses, instances and relationships in a format readable by the analyzer module, which is responsible for indexing. Once the information is indexed, the user is able to perform queries on the interface. The interface sends the queries to the analyzer module that perform the search in the index.
Fig.4.The search result in the first prototype Fig.3.Proposed Architecture IV. DISCUSSION In this short paper we described the system that enables a semantic search from medical ontologies in specific domains, such as medical and toxicological reports – architecture methods, indexing and query, were discussed. We introduced an integrated information retrieval (IIR) model, a conceptually new approach to recovery information from patient medical diagnosis and information from toxicological database based on medical ontology especially develop for this work. Although our research is still a work-in-progress, many features have been developed. A part of parser is already implemented, and it exports information within format that is showed in Fig.4. A version of the mechanism that manipulates the contents already allows searches to be made, so you can evaluate their performance. In some searches conducted, it is possible to realize the functionality and applicability of the concept of one-way proposed architecture. As an example, we indexed documents with information about some fictional medicines and their characteristics, such as indications, contraindications, side effects, textual summary and other features. In the searches, we use the Brazilian Portuguese keywords "cápsulas azuis vertigem”, with the intention of portraying a symptom and the way the drug is available. This query returned as a result of our product database, as show in Fig.4, where “cápsulas azuis” (blue capsules) were described in summary_description, and "vertigem" (vertigo) was indexed in possible_symptoms fields. One of the major objectives is to retrieve information from multiple distributed databases (such as the ontology of clinical toxicology, the basis of scientific articles, the basis of telemedicine diagnosis and patient data), indexing these bases with a medical ontology specifically developed for the toxicological information and use the DeCS [11] for the basis of patients reports. The goal is to semantically index the information for the user, when searching for a set of words in particular. The system returns only the information that the user has requested and synonymous those have something to do with the term that is stored in the ontology.
This ontology, in a tree representation, can help the defining weight process for the terms. The query method should be set these and other information displayed for the user. Ranking and ordering information are other objectives of the query model. We believe that the development of a semantic search system that can interpret that structured information and unstructured decisions bases should provide greater accuracy in response to the user. Thus, we understand that a system of information retrieval which has a semantic database maybe help fill this gap in the medical field and allow more professionals to use telemedicine technologies. V. REFERENCES [1]
Telemedicine information exchange. Available in: http://tie.telemed.org/articles/article.asp?path=telemed101&article=tmc oming_nb_tie96.xml Access: 24/08/2009. [2] J. Wallauer, D. Macedo, R. Andrade, and A. von Wangenheim, "Building a National Telemedicine Network.". IT Professional, vol. 10, no. 2, pp. 12-17, Mar/Apr, 2008. [3] T. Berners-Lee, J. Hendler, and O. Lassila, "The semantic web," Scientific american, vol. 284, pp. 28-37, 2001. [4] K. Petry, L. Tomazella, R. Andrade, and A. von Wangenheim, "Utilização do Padrão HL7 para Interoperabilidade em Sistemas Legados na Área de Saúde," Revista Brasileira de Engenharia Biomédica, v.25, PP. 29-40, April 2009. [5] C. Rocha, D. Schwabe, and M. P. Aragao, "A hybrid approach for searching in the semantic web," in Proceedings of the 13th international conference on World Wide Web New York, NY, USA: ACM, 2004. [6] P. Castells, M. Fernandez, and D. Vallet, "An adaptation of the vectorspace model for ontology-based information retrieval," IEEE Transactions on Knowledge and Data Engineering, vol. 19, pp. 261272, 2007. [7] V. Hristidis, F. Farfán, J. White, R. Burke, and A. Rossi, "Challenges for Information Discovery on Electronic Medical Records," Book details: Next Generation of Data Mining. CRC Press 2009. [8] PROTÈGÈ-OWL API. Avaliable at: http://protege.stanford.edu/plugins/owl/api/. [9] GIN. GIN - Generalized Inverted Index. 2009. Avaliable: . [10] E. Ristad, P. Yianilos, M. Inc, and N. Princeton, "Learning string-edit distance," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, pp. 522-532, 1998. [11] DECS. DeCS - Descritores em Ciências da Saúde. 2008. Avalible at: .