Automated mapping of clinical terms into SNOMED-CT. An application ...

This is an accepted manuscript version for an article to be published in the journal Journal of Medical Systems. Copyright to the final published article belongs to Springer Science +Business Media New York. The final publication is available on Springer's website http://link.springer.com/article/10.1007%2Fs10916-0140134-x

1

Automated mapping of clinical terms into SNOMED-CT. An application to codify procedures in pathology. J.L.Allonesa , D.Martinezb , M.Taboadaa a Department

of Electronics and Computer Science, Campus Vida, University of Santiago de Compostela, 15782, Santiago de Compostela, La Coruña, Spain

b Department of Applied Physics, Campus Vida, University of Santiago de Compostela, 15782, Santiago de Compostela, La Coruña, Spain

Corresponding autor: [email protected] +34 8818 13580

2

Abstract Clinical terminologies are considered a key technology for capturing clinical data in a precise and standardized manner, which is critical to accurately exchange information among different applications, medical records and decision support systems. An important step to promote the real use of clinical terminologies, such as SNOMED-CT, is to facilitate the process of finding mappings between local terms of medical records and concepts of terminologies. In this paper, we propose a mapping tool to discover text-toconcept mappings in SNOMED-CT. Name-based techniques were combined with a query expansion system to generate alternative search terms, and with a strategy to analyze and take advantage of the semantic relationships of the SNOMED-CT concepts. The developed tool was evaluated and compared to the search services provided by two SNOMED-CT browsers. Our tool automatically mapped clinical terms from a Spanish glossary of procedures in pathology with 88.0% precision and 51.4% recall, providing a substantial improvement of recall (28% and 60%) over other publicly accessible mapping services. The improvements reached by the mapping tool are encouraging. Our results demonstrate the feasibility of accurately mapping clinical glossaries to SNOMED-CT concepts, by means a combination of structural, query expansion and named-based techniques. We have shown that SNOMED-CT is a great source of knowledge to infer synonyms for the medical domain. Results show that an automated query expansion system overcomes the challenge of vocabulary mismatch partially.

Keywords: SNOMED CT; mapping; ontology matching; query expansion; information retrieval; clinical terminology;

3

1- Introduction The use of free text and local terms in electronic health records is widespread and is a source of poor data quality and a barrier to semantic interoperability, computerized clinical decision support and secondary use of data [1]. Capturing data in a precise and standardized manner is critical to integrate and exchange information among different applications, medical records, and decision support systems. Therefore, clinical terminologies are required by information systems to capture, process, store, use, and transfer clinical data in a standard form [2]. Nowadays, the Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) is the largest terminology in the biomedical domain, which is intended to provide comprehensive, multilingual terminology for encoding all aspects of electronic health records. It is based on a taxonomy of more than 300.000 concepts associated with terms [3]. A literature review about the SNOMED-CT use, including 488 papers from 2001 to 2012, has revealed that only few studies have reached a considerable maturity level of SNOMED-CT use in clinical practice [4]. Therefore, an important step to promote the use of SNOMED-CT in clinical settings is to facilitate the process of finding correspondences (or mappings) between local terms of medical records and SNOMED-CT concepts. Currently, there are several browsers and services that enable searching for a term in SNOMED-CT. Most of them mainly use conventional name-based techniques, such as exact string match, partial string match (using wildcards) and Boolean queries. These techniques are insufficient to deal with the search in a huge terminology, such as SNOMED-CT, with more granular and detailed concepts than any other terminology system [5-7]. To our knowledge, SNOMED-CT search engines do not include linguistic techniques to solve the problem of vocabulary mismatch, or structural techniques to exploit the semantic relationships of the terminologies. In 2008, J. Rogers et al. have evaluated the features of 17 SNOMED-CT browsers and concluded that text-to-concept searching functionality in most browsers is impoverished and idiosyncratic [8]. Moreover, P. Ruch et al. [6] and M. Chiang et al. [9] have exposed that the development of new search tools is required to improve the quality of SNOMED-CT coding. In this paper, we propose a new workflow for searching in SNOMED-CT, which combines classical techniques (lexical and name-based) with two novel techniques to discover textto-concept mappings in the SNOMED-CT terminology. The first technique applies a query expansion system to reformulate and expand the search terms. The second one exploits the SNOMED-CT relationships to obtain additional context of the concepts. The workflow was used to build one mapping tool focused on the SNOMED-CT terminology with two different matching profiles: one is fully automatic - hereafter referenced as SAMT (SNOMED-CT Automatic Mapping Tool) - and the other one is semi-automatic - hereafter referenced as SSMT (SNOMED-CT Semi-Automatic Mapping Tool). SAMT aims to automatically map clinical terms to SNOMED-CT concepts, that is, without the intervention of an expert user; whereas, SSMT is intended to aid experts in the mapping process by recommending several SNOMED-CT concepts for a clinical term. Several experiments were conducted with a Spanish glossary of procedures in pathology to assess the developed tool and to compare it with other search services provided by the US National Library of Medicine (NLM) [10] and ITServer browsers [11].

2- Background 2.1. SNOMED-CT SNOMED-CT is a comprehensive clinical terminology managed by the International Health Terminology Standards Development Organization (IHTSDO) that provides a

4

standard for clinical information. SNOMED-CT is currently available in English, English, Spanish, Danish and Swedish, with other translations under way. It contains over 300,000 active concepts, which have a unique id and several associated descriptions, each one representing a human-readable term that describes the same clinical idea, thus they can be seen as synonymous terms (see Fig.1). Moreover, each concept has one or more semantic relationships to other concepts. These relationships can be classified in the following two categories: ● The hierarchical relationships ‘IS A’ relate a concept to its more general concepts. ● The attribute or logical relationships associate two concepts specifying a characteristic of one of these concepts. There are different kinds of attribute relationships; for example, procedure site, which describes the body site acted on affected by a procedure, or method, which represents the action being performed to accomplish a procedure (see examples in Fig.1)

2.2. Matching techniques Some disciplines (e.g. information retrieval, ontology matching [12] or natural language processing) have been actively working on developing techniques to find mappings between terms and concepts. These techniques can be grouped into 3 types [13]: namebased, linguistic and structural techniques. Name-based techniques compare strings of search terms and name of concepts in order to find those which are similar. There are many ways to compare strings depending on if the strings are viewed as a set of letters or a set of words. Two of the most frequently used methods are edit distance and tokenbased distance. Edit distance is a way of quantifying how similar two strings are by counting the minimum number of operations required to transform one string into the other. Token-based distance considers a string as a set of words. It measures how similar two strings are in terms of the number of common words. There are several software 1 2 packages for computing string distances, such as Simmetrics , SecondString , the 3 4 Alignment API and SimPack . Terminological or linguistic resource-based techniques use resources such as dictionaries, lexicons, and thesauri. These resources provide linguistic relationships (e.g., 5 synonyms, hyponyms) to find correspondences between terms. Wordnet is an electronic lexical database which groups English words into sets of synonyms called synsets. WordNet has been widely used in information retrieval and ontology integration to expand query terms with synonyms, hyponyms and hypernyms [15, 16]. Some approaches, rather than use explicit linguistic relations, have inferred missed synonymy from large concept-oriented metathesaurus [17, 18]. Finally, semantic or structure-based techniques use structural properties of ontologies, such as, semantic relationships, in order to have more information about the concepts to help in the matching task [19, 20].

2.3. NLM and ITServer SNOMED-CT browsers NLM and ITServer browsers provide text-to-concept search methods by using mainly name-based techniques. NLM browser takes advantages of the UMLS Metathesaurus, which contains information about biomedical concepts and synonymous terms from many controlled vocabularies and classifications, such as, SNOMED-CT, MeSH and OMIM 1

http://sourceforge.net/projects/simmetrics/ http://secondstring.sourceforge.net/ 3 http://alignapi.gforge.inria.fr/ 4 https://files.ifi.uzh.ch/ddis/oldweb/ddis/research/simpack/index.html 5 http://wordnet.princeton.edu/ 2

5

[10]. NLM browser provides some text-to-concept mapping capabilities, two of which are especially appropriate for both English and non-English search terms: Exact Match and Word Match. Exact Match retrieves only concepts that include a synonymous term that exactly matches the search term. Word Match breaks the search term into its component parts, or words, and retrieves all concepts containing any of those words. ITServer browser provides a service to search SNOMED-CT concepts from a list of terms in natural language (Spanish and English terms) [11]. This service mainly uses name-based techniques, such as edit distances and token-based distances.

3- Methods Our mapping tool includes a set of name-based, terminological, structural and disambiguation techniques to search text-to-concept mappings. Fig. 2 shows the workflow of the tool. First, the search term is normalized. Next, it is expanded with alternative terms. Then, name and structure-based techniques take the alternative terms and search for equivalent concepts in the SNOMED-CT terminology. These techniques obtain a ranking of candidate concepts, so finally some strategies are used to select the final SNOMED-CT concepts. In sections 3.1-3.5, we give an overview of the developed techniques. Then, in section 3.6 we expose the differences between the two profiles created within the tool: SAMT and SSMT.

3.1 Preprocessing search terms and SNOMED-CT descriptions Before applying the workflow shown in Fig. 2, all descriptions of both the International and Spanish Edition of SNOMED-CT were normalized by means of: ● ●

Tokenization of term names into their constituent words (tokens). Normalization of terms, including mainly case-insensitive, elimination of stopwords and characters. A different list of stopwords for each language (English and Spanish) was used.

The name and structure-based techniques use these normalized SNOMED-CT descriptions in Fig. 2, instead the original SNOMED-CT. Search terms are also preprocessed following the same procedure as for the SNOMED-CT descriptions.

3.2 Query Expansion Expert encoders frequently make use of synonyms when the search of a term is unsuccessful [21]. Based on this strategy, we have developed an automatic query expansion to reformulate or expand the search term in order to improve retrieval performance. To our knowledge, this technique has not previously used to search SNOMED-CT. Starting with a search term “T”, the technique first splits the term into its constituent words “W”. Next, it generates a set of synonyms for each word “W” of the term “T” (see section 3.2.1). Finally, the search term is expanded by replacing one or more words by the generated synonyms (see section 3.2.2).

3.2.1 Automatic discovering of synonyms in SNOMED-CT This process relies on the following assumption: if two multi-word synonymous descriptions in SNOMED-CT share subsequences of words, then non-common parts might be synonyms. For example, considering that “renal biopsy” is a synonym of “kidney biopsy” (see Fig. 1), then it is quite likely that “renal” is a synonym or a similar term of “kidney”. For each word “W” of a search term “T”, our approach first extracts pairs of SNOMED-CT

6

synonymous descriptions associated to the same concept, in such a way that one of the descriptions must contain “W”, whereas the other one must not contain it. For example in Fig. 3, for the word “excision”, our approach extracts the pair of descriptions “excision of lung” and “resection of lung”, both belonging to the same concept (conceptId=119746007). Then, our method applies subsequence analysis over each pair of extracted descriptions. Specifically, it searches for common parts in both descriptions, it discards these and it obtains the two non-common strings. If one of the strings is equal to the word “W”, then the other string is a synonym for “W”. Hereafter, we consider that a ‘hit’ occurs when the algorithm detects a synonym for the target word by comparing lexically two SNOMED-CT descriptions. Following the example with “excision” and given the pair of descriptions “excision of lung” and “resection of lung”, “of lung” is the common part in both descriptions, whereas “excision” and “resection” are the non-common strings (see Fig. 3). Thus, “resection” is a new synonym for “excision”. After processing each pair of synonymous descriptions, the algorithm obtains a ranking of synonyms of “W” sorted by the number of hits, that is, the number of times that they were found as synonyms.

3.2.2 Generation of alternative terms Two conditions are followed to select the final list of synonyms of each word: (1) they must contain at least 5 hits and (2) only the three best synonyms (based on the number of hits) are selected. Condition 1 allows us to eliminate some erroneous or very infrequent synonyms detected by the algorithm. Condition 2 allows us to select only the best synonyms of one word and prevent the overgrowth of the generated alternative terms in the next step. Alternative terms are generated by replacing and combining the selected synonyms of each word “W” of the search term “T”. For example for the search term “cutaneous excision”, the method first obtains the synonyms “topical” and “skin” for “cutaneous” and “resection”, “removal” and “ostectomy” for “excision”. Then, it replaces these synonyms in the search term, generating the following alternative terms: AlternativeTerms(“cutaneous excision”) = { cutaneous, skin, topical } x { excision, resection, removal, ostectomy } = { “cutaneous excision”, “cutaneous resection”, “cutaneous removal”, “cutaneous ostectomy”, “skin excision”, “skin resection”, “skin removal”, “skin ostectomy”, “topical excision”, “topical resection”, “topical removal”, “topical ostectomy” }. The values of the two parameters (number of hits and synonyms) used to select the final list of synonyms of each word were chosen by test-and-error, that is, we tried different values and selected the values (5 and 3 respectively) with the best performance for the mapping tool. We detected that if the method uses synonyms with a lower number of hits, then it tends to generate less suitable alternative terms, thereby decreasing the precision of the mapping tool. In contrast, if we set a value higher than 5 hits, then the method is more prone to lose relevant alternative terms, thus it could reduce the coverage of the tool.

3.3 Name-based techniques Name-based techniques are applied to find approximate string correspondences between a search term (including its expanded terms) and all SNOMED-CT descriptions. Specifically, the techniques Levenshtein distance and Cosine Similarity of the SimMetrics library were used. Levenshtein distance is a type of edit distance for measuring the similarity between two strings, whereas that Cosine similarity bases similarity on the number of common words between the strings in relation to the number of words in each string. These techniques obtain a score in the range of [0, 1] for each SNOMED-CT description. Our system ranks the descriptions by score. For example, table 1 shows the top 2 descriptions for the search term ‘excision biopsy of skin’.

7

3.4 Structure-based techniques Name-based techniques obtain a ranking of concepts sorted by lexical similarity. But these techniques sometimes get false positives. For example, if we are looking for the term ‘excision biopsy of skin’ in SNOMED-CT, Levenshtein distance suggests the concept ‘incision biopsy of skin’ (conceptId: 282014007) as only two editions are needed to transform the search term into the returned concept. However, this term may not be the most appropriate semantically. Therefore, we have designed a strategy to prevent these false positives. Our strategy analyzes the best lexical candidate concepts by exploiting its inter-concept logical relationships defined in the SNOMED-CT terminology. These relationships provide additional context or information about the concepts (see Fig. 4). For example, the concept ‘total pneumonectomy’ (conceptId: 49795001) is related to the concept ‘excision - action’ (conceptId: 129304002) through the logical relationship ‘method’ and to the concept ‘lung’ (conceptId: 181216001) through the relationship ‘procedure site - Direct’. These relationships would be useful to suggest the concept ‘total pneumonectomy’ as a candidate mapping for the search term ‘excision of lung’, despite the low lexical similarity. In addition, the relationships can be useful to identify the most relevant parts or substrings of a concept. For example, the concept ‘excision biopsy of skin lesion’ is related to the concept ‘excision biopsy’ (conceptId: 277261002) through the logical relationship ‘method’ and to the concept ‘skin’ (conceptId: 39937001) through the relationship ‘procedure site - Direct’. Our strategy to take advantage of the SNOMED-CT relationships follows several steps. Firstly, it extracts the best candidates obtained by name-based techniques (we typically use the top 10 candidates). Next, the SNOMED-CT logical relationships of these concepts are traversed to extract its associated concepts. Finally, the descriptions of these concepts are compared to the search term in order to calculate new string similarity metrics. Our approach can use all relationships or it can be configured to use only specific relationships. Considering that our dataset includes a glossary of terms of procedures in pathology (see section 4.1 Dataset), our approach was configured (for the experiments) to use two relevant logical relationships for concepts of the SNOMED-CT hierarchy ‘procedure’. These relationships are: method and procedure site - Direct, which have already been described in the background (section 2.1). By using these relationships, our tool is able to extract two SNOMED-CT concepts (the action and the body site affected) for each lexical candidate concept. For the experiments, we defined two similarity metrics for checking whether the search term contained the description of the extracted concepts: ● Action similarity is a metric which is equal to 1 if all the words of the concept obtained via ‘method’ are present in the search term. Otherwise, the metric is equal to 0. ● Site similarity is a metric which is equal to 1 if all the words of the concept obtained via ‘procedure site’ are present in the search term. Otherwise, the metric is equal to 0. Let’s see an example of this process for the searched term ‘excision biopsy of skin’. By using name-based techniques, our tool takes the two best candidate concepts (see table 1). Next, it extracts two concepts for each candidate by using the relationships method and procedure site (see Fig.5). Then, it checks if the search term ‘excision biopsy of skin’ contains the description of the extracted concepts, and it assigns new scores for the metrics action and site similarity (see table 2). By using these metrics, the candidate ‘excision biopsy of skin lesion’ obtains better scores than 'incision biopsy of skin'. This can help us choose ‘excision biopsy of skin lesion’ as final candidate, as the name-based techniques do not agree to choose the best one.

8

3.5 Disambiguation Two strategies for disambiguating the set of candidates were followed: ●

Heuristic rules. Based on practical trial and error, we created some rules to select the final candidates.

●

Machine learning (Support Vector Machines). Using an available gold standard (see section 4.1 Dataset) and support vector machines (SVM), the disambiguation step is approached as a binary classification problem. 6 The Kernlab package of the R language was used for this purpose.

3.6 Differences between SAMT and SSMT The two profiles created within the mapping tool exploit all the techniques exposed in the sections 3.1-3.5 to find mappings to SNOMED-CT. The main difference between them is in the settings of the disambiguation technique. SAMT only selects candidates that are very accurate, avoiding false positives. However, SSMT is intended to achieve high coverage more than to be highly accurate. Therefore, the selection techniques were adjusted to select the top 5 candidates for each term. Below, these settings are briefly described:

3.6.1 Heuristic rules for SAMT ● ●

● ●

The candidates with some structure-based metric (that is, action and site similarity) equal to 0 are discarded. The average score of name-based metrics (that is, Levenshtein distance and Cosine similarity) is obtained. The candidates with average lower than 0.95 are also discarded. Only the concept with the highest average is selected. If two candidates have the same average score, the candidate who did not need query expansion is preferred.

3.6.2 Heuristic rules for SSMT ● ● ● ●

The average score of name-based metrics is obtained. The candidates with some structure-based metric equal to 0 decrease their average score by 10%. The 5 candidates with the highest average score are selected. If two candidates have the same average score, the candidate who did not need query expansion is preferred.

3.6.3 SVM for SAMT and SSMT We used SVM following these steps:

6

●

Create a table by using the similarity metrics between each term of the dataset and each SNOMED-CT concept (see example in table 3). The last column establishes if experts assigned a mapping (1 or 0).

●

Randomly generate a training and testing set.

●

Train a model using SVM on the training set.

●

Classify testing set using the trained SVM model

http://cran.r-project.org/web/packages/kernlab/index.html

9

●

Evaluate testing set against expert mappings

The training of the model for SAMT was adjusted to significantly penalize false positives, whereas in the training for SSMT, the output decision function was defined to return the 5 candidates with the highest confidence value in order to increase true positives.

4 - Evaluation The main goal of this evaluation was to test our initial hypothesis, that is, the combination of name-based techniques with linguistic and structural techniques, considerably improves the search in SNOMED-CT, compared to name-based search. With this objective in mind, the browsers supplied by the NLM and the ITServer [10, 11] offered us a benchmark to evaluate the enhancement of performance. These two browsers are amongst the most advanced search tools for SNOMED-CT; they work perfectly with both Spanish and English terms as they incorporate the International and Spanish Edition of SNOMED-CT; and they provide web services with facilities for automating the search process.

4.1 Dataset 7

Recently, the Spanish Society of Anatomic Pathology (SEAP) has published a glossary of Spanish terms of procedures in pathology, including the mappings to SNOMED-CT concepts manually assigned by experts [22]. This dataset can be viewed as a gold standard for evaluating the effectiveness of our tool. For the evaluation, all the terms related to biopsy procedures that were manually mapped to a single SNOMED-CT concept were selected. The nearly 300 selected terms can be viewed in Appendix 1. Table 4 shows some terms and their manual mappings.

4.2 Experiments Two experiments were designed to evaluate SAMT, SSMT and other search services. The first experiment evaluates the efficiency of several services to obtain final mappings for the terms of the dataset; whereas the second one evaluates the performance of the services to suggest a set of candidate mappings. The benchmark of the first experiment includes: ●

The search service Exact Match provided by NLM browser (see section 2. Background). The search was limited to concepts of SNOMED-CT hierarchy ‘procedure’. ● The search service provided by ITServer, which returns a ranking of SNOMEDCT concepts that match with a given term. The search was also limited to the hierarchy ‘procedure’, and only the concepts with a higher score than 0.85 were selected. This threshold was chosen by test-and-error: we tried different values and selected the threshold with the best performance. ● Four settings of SAMT were evaluated in order to analyze the contribution of each technique developed: ○ Setting 1: SAMT using preprocessing (section 3.1), name-based techniques (3.3) and heuristic rules for SAMT (3.6). ○ Setting 2: Setting 1 + query expansion (3.2) ○ Setting 3: Setting 2 + structural techniques (3.4) ○ Setting 4: Same as Setting 3 but using Support Vector Machines (3.6) instead of heuristic rules. The standard information retrieval evaluation measures (precision, recall and F-measure) 7

https://www.seap.es/

10

were calculated against expert mappings provided by SEAP. The evaluation measures are defined as follows:

Precision = Recall =

#correct found mappings #all found mappings

#correct found mappings #all possible mappings

Fmeasure = (1 + β2 ) ∗

(β2

precision ∗ recall ∗ precision) + recall

The F-measure score can be interpreted as a weighted average of precision and recall. We establish that β = 0.7 in order to put more emphasis on precision than recall because we consider that the precision is more important for an automatic mapping task. For the second experiment, we limited to 5 concepts the number of candidate mappings. The benchmark of this experiment included: ●

The search services Exact Match and Word provided by NLM browser. The output of both search types was aggregated. The search was limited to the hierarchy ‘procedure’ and it was set up to return five concepts, prioritizing the output of Exact Match.

●

The search service provided by ITServer. The search was limited to the hierarchy ‘procedure’ and the top 5 concepts of the ranking were selected.

●

Four settings of SSMT were evaluated: ○ ○

Setting 1: SSMT using preprocessing (section 3.1), name-based techniques (3.3) and heuristic rules for SSMT (3.6). Setting 2: Setting 1 + query expansion (3.2)

○

Setting 3: Setting 2 + structural techniques (3.4)

○

Setting 4: Same as Setting 3 but using Support Vector Machines (3.6) instead of heuristic rules.

For this experiment we have evaluated only the recall. We consider that in this case the recall gains importance since providing the correct binding in a set of not many candidates would save much time to expert coders.The precision is not so important in this experiment because all approaches are configured to return up 5 candidate concepts. To perform both experiments, we have randomly split the dataset into 2 sets: 200 terms were used by the Setting 4 for learning and 100 terms were used for testing all approaches. We have repeated this process 5 times for each experiment to obtain a more reliable evaluation.

5- Results Table 5 shows the average performance of the 5 runs in terms of recall, precision and Fmeasure for SAMT and NLM and ITServer browsers in the first experiment. The mappings obtained during this experiment can be seen in the Appendix 2. All approaches (NLM, ITServer and SAMT) achieved high precision, in the range of 84.7% to 91.3%. The recall obtained by the NLM browser was 32.0%. The ITServer browser obtained 40% recall, showing 25% improvement in recall over NLM. The setting 3 of SAMT achieved 51.4% recall, showing 28% improvement over ITServer. The Fmeasure achieved by NLM was 54.7%. The ITServer service and the setting 3 of SAMT achieved 16% and 30% improvement over NLM, respectively.

11

A reduction of 6% in precision was obtained using the setting 2 of SAMT (which uses query expansion) over the setting 1 of SAMT. However, an important increase of 20% in recall was achieved using the setting 2 over the setting 1. The F-measure was also improved by more than 6% using the setting 2. Increases of 2.5% in precision and 4% in recall were obtained using the setting 3 (which includes the structural techniques) over the setting 2. In addition, settings 3 and 4 showed that heuristic rules were able to improve the F-measure over SVM by 8%. In summary, the best setting for the SAMT is the setting 3, which includes: preprocessing, name-based techniques, query expansion, structural techniques, and heuristic rules. Table 6 shows the average recall of the 5 runs for NLM, ITServer and SAMT in the semiautomatic mapping task (second experiment). The candidate mappings obtained during this experiment can be seen in the Appendix 3. The recall achieved by the NLM browser was 41.6%. The ITServer browser obtained 54% recall, showing 30% improvement over NLM. The setting 3 of SSMT achieved 71% recall, showing a 31% improvement over ITServer. An important increase of 21% in recall was achieved using the setting 2 of SSMT (which uses query expansion) over the setting 1. A slight increase of 3% in recall was obtained using the setting 3 (which includes the structural techniques) over the setting 2. Again, results of setting 3 and 4 showed that heuristic rules work better than SVM to select the top 5 candidates. The best setting for the SSMT is also the setting 3.

6. Discussion 6.1. Comparison of SAMT with other search services In this section, we compare the best setting of SAMT (setting 3) with the NLM and ITServer search services. Table 7 summarizes the techniques used by them. SAMT incorporated three additional techniques: query expansion, structural and disambiguation techniques. In terms of performance in the automatic mapping task, all approaches (NLM, ITServer and SAMT) achieved high precision, which is a key point for an automatic mapping task (see table 5). The NLM’s precision is slightly lower than ITServer’s and SAMT’s, which is surprising as it uses exact matching over a huge set of synonyms included in the UMLS Metathesaurus, whereas both ITServer and SAMT use approximate string matching rather than exact match. Results show that: (1) query expansion is a key technique to improve the SAMT´s recall; (2) although SAMT uses query expansion, it is able to maintain high precision; (3) SAMT’s disambiguate techniques to select the final mappings are working well. Besides SNOMED-CT browsers, there are other works related to our mapping tool, such as [2, 23, 24, 25], although these are more focused on mapping terms from structured clinical models. Sheng Yu's method [23] has applied information retrieval techniques, 8 such as Lucene , to bind clinical terms from archetypes to SNOMED-CT concepts. The evaluation of the method includes an interesting graph in which the recall is shown in relation to the number of candidate concepts suggested by the method [26]. The method got 40%,50% and 55% recall by suggesting 1,5 and 10 candidates per term, respectively. Our mapping tool clearly improved the results of the Sheng Yu's method.

8

http://lucene.apache.org/core/index.html

12

6.2 Query expansion 9

Our synonym discovery system infers missed synonyms of words using as a corpus over one million of concept descriptions of SNOMED-CT. We found that it was able to identify (1) words with identical or very similar meaning in all contexts, such as “total” and “complete” (in Spanish: “total” and “completo”); (2) words with similar meaning in the medical context, such as “limb” and “member” (in Spanish: “extremidad” and “miembro”) and (3) nouns and adjectives pointing to the same concept, such as “stomach” and “gastric” (in Spanish: “estómago” and “gástrico”). Table 8 shows a sample of synonyms detected with our system. Furthermore, our mapping tool expands the search terms by replacing one or more words by the inferred synonyms. The alternative terms have significantly contributed to increase the recall of our mapping tool compared to other evaluated tools. Appendix 4 shows some examples in which the alternative terms have been decisive for discovering the correct mappings in SNOMED-CT. Moreover, WordNet has been commonly used in information retrieval to expand query terms with synonyms and other linguistic relations [15, 16]. Though WordNet contains a sufficiently wide range of common words, it does not cover special domain vocabularies. We found that there are two potential problems of using WordNet to extract synonyms of words for the medical domain. Firstly, some medical domain words are not covered by Wordnet. Secondly, some words have many senses and synonyms, but some of them are not be suitable for the medical context. For example, the word “limb” has 6 senses in WordNet, and only one is suitable for the medical context. Unlike the approaches using Wordnet as a synonym source, our approach automatically discovers new synonyms from SNOMED-CT. This makes our method finds relevant synonyms for the medical context (and more specifically it finds frequent synonyms in SNOMED-CT), rather than many generic synonyms.

6.3 Error analysis We did a thorough analysis of the errors, by checking the results of SAMT that did not correspond with the expert mappings. We found that many of the mismatches identified were actually due to very similar SNOMED-CT concepts. We report three types or sets of mismatches. Table 9 shows an example of each type of mismatch. The columns of Table 9 show (1) the search term, (2) the SNOMED-CT concept returned by SAMT, (3) the expert mapping and (4) a comment about the mismatch. The first set of mismatches includes SAMT’s mappings that could be considered correct, and even more accurate than the expert mappings. For example, the search term “Extirpation of major salivary gland” was mapped by the experts to the concept “Excision of salivary gland” and by SAMT to “Excision of major salivary gland” (see table 9). The automatic mapping “Excision of major salivary gland” is a child concept of “Excision of salivary gland” in the SNOMED-CT hierarchy and it could be the most accurate mapping. Therefore, these SAMT’s mappings should be reviewed again by expert encoders because they could improve the quality of the initial mappings. In the second set, the expert mappings are more specific than the search terms and SAMT’s mappings. In these cases, the expert encoders assumed that the biopsy procedures are being performed on a damaged or injured body part. SAMT was not able to suppose this and thus it returned a more general mapping. For example, the search term “Incisional biopsy of breast” was mapped by the experts to the concept “Incisional biopsy of breast mass” and by SAMT to “Incisional biopsy of breast”.

9

The system is available on: http://snomed-synonym-finder.appspot.com/

13

The third set of mismatches contains SAMT’s mappings that are less accurate than the expert mappings. These incorrect mappings were due to the correct concepts and the search terms have different names and the query expansion has not worked well in these cases (see last row of Table 9). Appendix 5 includes more examples of mismatches between SAMT and the experts.

6.4 Applications and scope of the mapping tool In this work, we have developed a tool to discover text-to-concept mappings in the SNOMED-CT terminology. The mapping tool includes two different matching profiles: one is fully automatic (SAMT) and the other one is semi-automatic (SSMT). Both profiles essentially share the same goal, that is, to facilitate the process of finding mappings between clinical terms and SNOMED-CT concepts. SAMT has been adjusted to return a single relevant concept, thus it would be more suitable for a mapping process in which there are no experts available, and/or for a mapping task in which a quick result is required (e.g. for real-time applications). On the other hand, SSMT has been adjusted to suggest 5 concepts for each search term, so it is more appropriate for a mapping task in which experts are available to review and select the most accurate mapping. SAMT has achieved moderate recall (51%) and high precision (88%) in the experiments, whereas SSMT achieved substantial coverage (70%) by suggesting 5 candidate concepts. Note that some studies have measured the performance of technicians and physicians in coding tasks with SNOMED-CT and have discovered that both the coding performance and the inter-coder agreement were imperfect [7, 9]. S.Young et al. found that users with an intermediate level of coding knowledge mapped correctly only 62% of the terms to SNOMED-CT concepts [7]. Although the evaluation of the tool was performed with a glossary of Spanish terms of procedures in pathology, the tool was developed to search for any term in SNOMED-CT. In fact, it can be used to map both Spanish or English terms as the descriptions of both the International and the Spanish Edition of SNOMED-CT were normalized and indexed during the initialization phase (see Section 3.1). Note that all the techniques (including the name-based techniques and query expansion) work identically in both languages. The mapping tool can be also configured to search concepts in certain hierarchies (such as procedures, clinical findings or observable entity) or the entire SNOMED-CT.

6.4.1 How applicable would the mapping tool be to other terminologies? We believe that a framework that combines different techniques, as our tool does, is a key point to obtain good results in mapping processes. Therefore, in this sense, the approach used by our tool would be suitable for any terminology. Note that, the proposed structural technique is mainly oriented to clinical terminologies with a rich axiomatic content; specially, with abundant semantic relationships between their concepts. Regarding the proposed query expansion technique, it could be completely reused to generate alternative terms in other contexts. We consider that SNOMED-CT descriptions form a great corpus to find synonyms. This terminology provides a good coverage of synonyms in different clinical areas, as it is considered to be the most comprehensive clinical terminology in the world. Consequently, we presume the alternative terms generated by our query expansion technique (using the SNOMED-CT corpus) can clearly contribute to improve the search in other terminologies.

14

6.5 Future work The structural techniques used in our mapping tool must be configured to use specific logical relationships. However, the selection of appropriate relationships depends on the type of search term or entity and it requires considerable knowledge of the internal structure of SNOMED-CT. In this work, we manually reviewed the SNOMED-CT concepts related to biopsies in order to select the relevant logical relationships in the axiomatic definitions of these concepts. Recently, some papers have proposed automatic approaches to analyze the axiomatic content of SNOMED-CT [27,28]. For example, RIO framework, developed by E. Mikroyannidi et al. [27], allows to detecting patterns and repetitive structures in the axioms of similar entities of SNOMED-CT. Our mapping tool would take advantage of a framework like RIO because it could provide us a quick abstraction of a set of similar concepts (e.g. “biopsy” concepts), and this in turn would facilitate and even improve the selection of the axiomatic content (i.e. logical relationships) used by the structural techniques of our mapping tool. Our mapping tool does not exploit the post-coordination strategies for SNOMED-CT concepts. Our techniques are focused on finding a single concept able to fully represent the search term. In addition, it would be interesting to analyse the recall of SSMT in relation to the number de candidate concepts suggested by the tool, in a similar way that the Sheng Yu’s evaluation [26]. In our future work, we will explore strategies to detect terms which can not be represented by a single concept, and we will develop techniques to map these terms to post-coordinated expressions of several concepts. Moreover, our mapping tool is optimized to return a set of possible matches in SNOMEDCT given an input term, that is, a short text such as “excision biopsy of skin”. In the future, the tool could be adapted to support automated annotation of long input texts (such as the abstracts of articles or clinical narratives from a patient record) with SNOMED-CT concepts. Our query expansion system generates a set of alternative terms, but it does not apply any strategy to estimate the quality or reliability of each of them. In the future, two parameters may be used to estimate it: number of words substituted in the alternative term and an estimation of quality for each of the substituted words. The quality of each word could be measured using the number of hits (number of times that they were found as synonyms throughout SNOMED). Moreover, our word-level synonym discovery system only looks for one-word to one-word synonyms (e.g. “resection” and “excision”). In the future, it could be considered one-word to two-words synonyms (e.g. “colon” and “large intestine”). These strategies could improve the generation of alternative terms and thus to increase the recall of the mapping tool. Furthermore, we will try to integrate the mapping tool in any SNOMED-CT browser. In this way, the browser would have better search options. In addition, our mapping tool could take advantage of some features frequently included in the browsers, such as hierarchy visualization and navigation.

7- Conclusions Semantic interoperability should be addressed by the application of shared clinical terminologies. The size and complexity of the current terminologies, such as SNOMEDCT, demand the development of advanced computer tools to assist users for accessing, searching and navigating. In this paper, we developed a tool that uses name-based techniques, query expansion and SNOMED-CT semantic relationships in order to discover mappings between clinical terms and SNOMED-CT concepts. We found that

15

SNOMED-CT is a great source of knowledge to infer unknown and unrecorded synonyms from those already existing in SNOMED-CT. The automated query expansion system allowed to discovering mappings between clinical terms and SNOMED-CT concepts that would not otherwise be found using simple string matching. The experiments showed an improvement in automatic mapping between terms and concepts of 20% when the query expansion was included in our tool.

Acknowledgements The work presented in this paper has been developed in the funded National Project OntoNeuroPhen (FIS2012-PI12/00373) by the Instituto de Salud Carlos III.

References [1] Stroetmann VN, Kalra D, Lewalle P, Rector A, Rodrigues JM, Stroetmann KA Semantic interoperability for better health and safer healthcare. [accessed July 2014]. [2] Qamar R. Semantic mapping of clinical model data to biomedical terminologies to facilitate interoperability. PhD thesis, University of Manchester; 2008. [3] SNOMED-CT. Systematized nomenclature of medicine-clinical terms. [accessed July 2014]. [4] Lee DH , De Keizer NF, Lau FY, Cornet R; Literature Review of SNOMED-CT Use; J Am Med Inform Assoc. 2014;21:e11-e19 [5] De Lusignan S, Chan T, Jones S. Large complex terminologies: more coding choice, but harder to find data - reflections on introduction of SNOMED-CT (Systematized Nomenclature of Medicine - Clinical Terms) as an NHS standard. Inform Prim Care. 2011;19(1):3-5. [6] Ruch P, Gobeill J, Lovis C, Geissbühler A. Automatic medical encoding with SNOMED categories. BMC Medical Informatics and Decision Making 2008, 8(Suppl 1):S6 [7] Young S, Hoi H, Hwa K, Sun H, Lee J, Kwan B. Comparison of Knowledge Levels Required for SNOMED-CT Coding of Diagnosis and Operation Names in Clinical Records. Healthcare Informatics Research 2012 Vol: 18(3). [8] Rogers J, Bodenreider O. SNOMED-CT: browsing the browsers. Proceedings of the 3rd international conference on Knowledge Representation in Medicine (KR-MED 2008). May 31-Jun 2 2008. [9] Chiang MF, et al. Reliability of SNOMED- CT Coding by Three Physicians using Two Terminology Browsers. in American Medical Informatics Association Annual Symposium. 2006. Washington, D.C. [10] NLM SNOMED-CT Browser [accessed July 2014].

16

[11] ITServer - Online SNOMED-CT browser [accessed July 2014].

[12] Euzenat J, Shvaiko P, Ontology Matching, Springer-Verlag, Heidelberg, DE, 2007 [13] Taboada M, Lalín R, Martínez D. An automated approach to mapping external terminologies to the UMLS. IEEE Trans Biomed Eng 2009; 56:1598–605. [14] Jonquet C, LePendu P, Falconer S, Coulet A, Noy NF, Musen MA, Shah NH. NCBO Resource Index: Ontology-based search and mining of biomedical resources. Web Semantics: Science, Services and Agents on the World Wide Web. Volume 9, Issue 3, September 2011.Pages 316-324 [15] Varelas G, Voutsakis E, Raftopoulou P, Petrakis EG, Milios EE. 2005. Semantic similarity methods in wordNet and their application to information retrieval on the web. In Proceedings of the 7th annual ACM international workshop on Web information and data management (WIDM '05). ACM, New York, NY, USA, 10-16. [16] Huang K, Geller J, Halper M, Perl Y, Xu J. Using WordNet synonym substitution to enhance UMLS source integration.. In: Artificial Intelligence in Medicine, 46 (2009), Nr. 2, S. 97-109 [17] Huang KC, Geller G, Halper M, Cimino JJ. Piecewise Synonyms for Enhanced UMLS Source Terminology Integration. In: Proc. AMIA Annual Symp. Chicago, IL: 2007. pp. 339–343. [18] Hole WT, Srinivasan S. Discovering missed synonymy in a large concept-oriented metathesaurus.Proc AMIA Symp, pp. 354-358 [19] García M, Allones JL, Hernández D, Taboada-Iglesias MJ: Semantic similarity-based alignment between clinical archetypes and SNOMED-CT: An application to observations. International Journal of Medical Informatic, 2012 Aug;81(8):566-78 [20] Koopman B, Zuccon G, Nguyen A, Vickers D, Butt L, Bruza P. Exploiting SNOMEDCT Concepts & Relationships for Clinical Information Retrieval: Australian e-Health Research Centre and Queensland University of Technology at the TREC 2012 Medical Track. In Proceedings of 21st Text REtrieval Conference (TREC 2012). [21] Van der Kooij J, Goossen WT, Goossen-Baremans AT, Jong-Fintelman M, Van Beek L. Using SNOMED-CT codes for coding information in electronic health records for stroke patients. Stud Health Technol Inform. 2006;124:815-23. [22] Spanish Society of Anatomic Pathology. Normalized catalogue for Anatomic Pathology Specimens and Procedures (2011). [accessed July 2014]. [23] Yu S, Damon B, and Bisbal J. “An Investigation of Semantic Links to Archetypes in an External Clinical Terminology through the Construction of Terminological" Shadows”, in: IADIS, International Association for Development of the Information Society, July 26– 28; Freiburg, Germany, 2010.

17

[24] Lezcano L, Sánchez-Alonso S, Sicilia MA. Associating clinical archetypes through UMLS metathesaurus term clusters. Journal of medical systems 36, no. 3 (2012): 12491258. [25] Khan WA, Khattak AM., Hussain M, Amin MB, Afzal M, Nugent C, Lee S. An Adaptive Semantic based Mediation System for Data Interoperability among Health Information Systems. Journal of medical systems 38, no. 8 (2014), 1-18. [26] Yu S, Berry D, Bisbal J. Performance analysis and assessment of a tf-idf based archetype-SNOMED-CT binding algorithm. In Computer-Based Medical Systems (CBMS), 2011 24th International Symposium on (pp. 1-6). IEEE. [27] Mikroyannidi E, Stevens R, Iannone L, Rector A. Analysing Syntactic Regularities and Irregularities in SNOMED-CT. J. Biomedical Semantics 3 (2012): 8. [28] Dentler K, Cornet R. Redundant Elements in SNOMED CT Concept Definitions. In Artificial Intelligence in Medicine, pp. 186-195. Springer Berlin Heidelberg, 2013.

Figure legends Fig.1 Example of SNOMED-CT concepts and relationships Fig.2 Search tool workflow Fig.3 Automatic discovering of synonyms for ‘excision’ in SNOMED-CT Fig.4 Example of SNOMED-CT relationships that provide additional context about the concepts ‘total pneumonectomy’ and ‘excision biopsy of skin lesion’ Fig.5 Analysis of lexical candidates through SNOMED-CT relationships

Tables

Table 1: Two best candidates for ‘excision biopsy of skin’ according to name-based techniques Levenshtein distance

Cosine similarity

incision biopsy of skin (282014007)

0.90

0.58

excision biopsy of skin lesion (312968005)

0.74

0.8

Candidate descriptions (id)

18

Table 2: Name and structure-based metrics of the best candidates for the term ‘excision biopsy of skin’ Candidate concept (id)

Levenshtein distance

Cosine similarity

Action similarity

Site similarity

incision biopsy of skin (282014007)

0.90

0.58

0

1

excision biopsy of skin lesion (312968005)

0.74

0.8

1

1

Table 3: Similarity metrics between each term of the dataset and SNOMED-CT concept as well as the validity of the mapping assigned by experts Term / Concept

Levenshtein distance

Cosine similarit y

Action similarity

Site similarity

Class (correct mapping?)

Term: excision biopsy of skin Concept: incision biopsy of skin (282014007)

0.90

0.58

0

1

0 (Positive)

Term: excision biopsy of skin Concept: excision biopsy of skin lesion (312968005)

0.74

0.8

1

1

1 (Negative)

...

...

...

...

...

termN - conceptN

Table 4: Example of dataset terms and expert mappings ID Concept assigned by expert

Spanish SNOMED-CT Description

English SNOMED-CT Description

Biopsia de miocardio

387828005

biopsia de miocardio

myocardial biopsy

Biopsia de retina

172573007

biopsia de lesión retiniana

biopsy of retinal lesion

Mastectomía total

172043006

mastectomía simple

simple mastectomy

3980006

resección subtotal del esófago

subtotal resection of esophagus

Term

Esofaguectomía parcial

19

Table 5: Results of the automated mapping task (first experiment) Approach

Recall (%)

Precision (%)

NLM browser (Exact Match)

32.0

84.7

F-measure (β = 0.7)

ITServer browser

40.0

90.1

63.6

SAMT: Setting 1

41.2

91.2

64.9

SAMT: Setting 2

49.4

85.8

68.9

SAMT: Setting 3

51.4

88.0

71.1

SAMT: Setting 4

42.0

91.3

65.6

54.7

Table 6: Results of the semi-automatic mapping task (second experiment) Approach

Recall (%)

NLM browser (Word and Exact Match)

41.6

ITServer browser

54

SSMT: Setting 1

56.8

SSMT: Setting 2

68.8

SSMT: Setting 3

71.0

SSMT: Setting 4

51.0

Table 7: Summary of the techniques used by NLM, ITServer and SAMT for the automatic mapping task Approach

NLM browser (Exact Match)

ITServer browser

SAMT: Setting 3

Preprocessing of terms

X

Name-based techniques

Exact Match

✓

Approximate String Matching

✓

Approximate String Matching

Query expansion

X

X

✓

Structural techniques

Disambiguation techniques

X

X It returns a set of candidate mappings

X

X It returns a ranking of candidate mappings.

✓

✓ Heuristic rules and SVM select final mappings

20

Table 8: Sample of synonyms found by our system in SNOMED-CT Type of “synonyms”

synonyms (Spanish)

synonyms (English)

words with very similar meaning in all contexts

total ~ completo parcial ~ incompleto bucal ~ oral

total ~ complete partial ~ incomplete buccal ~ oral

words with similar meaning in the medical context

curetaje ~ raspado miembro ~ extremidad extirpación ~ remoción

curettage ~ excision member ~ limb extirpation ~ removal

estómago ~ gástrico nouns and adjectives pointing hígado ~ hepático to the same concept piel ~ cutáneo

stomach ~ gastric liver ~ hepatic skin ~ cutaneous

Table 9: Examples of mismatches between SAMT and human encoders. Search term (In English and Spanish)

Automatic Mapping using SAMT

Expert Mapping

Comment

(SNOMED-CT concept)

(In English and Spanish)

Extirpation of major salivary gland

Excision of major salivary gland

Excision of salivary gland

Extirpación de glándula salival mayor

Resección de glándula salival mayor (ID= 234937001)

Resección de glándula salival (ID = 71735005)

Incisional biopsy of breast

Incisional biopsy of breast

Incisional biopsy of breast mass

Biopsia por incisión de mama

Biopsia por incisión de mama (ID=237378001)

Biopsia por incisión de tumor mamario (ID=28768007)

Biopsy of globe

Biopsy of lesion of globe

Biopsy of eye proper

Biopsia de globo ocular

Biopsia de lesión del globo ocular (ID=231559005)

Biopsia del ojo propiamente dicho (ID=446938009)

This SAMT´s mapping could be considered correct, and even more accurate than the expert mapping SAMT’s mapping is more generic than the expert mapping

SAMT´s mapping is less accurate than the expert mapping

21

Figures

Fig 1:Example of SNOMEDCT concepts and relationships

Fig. 2 Search tool workflow

22

Fig. 3 Automatic discovering ofsynonyms for ‘excision’ in SNOMED-CT

23

Fig. 4 Example of SNOMEDCT relationships that provide additional context about the concepts ‘total pneumonectomy’ and ‘excision biopsy of skin lesion’

Fig. 5 Analysis of lexical candidates through SNOMED-CT relationships

24