A Language Classifier that Automatically Divides Medical ... - CiteSeerX

A Language Classifier that Automatically Divides Medical Documents for Experts and Health Care Consumers Michael POPRAT a, Kornél MARKÓ b, Udo HAHNa a

Jena University Language & Information Engineering (JULIE) Lab, Jena, Germany b Medical Informatics Department, Freiburg University Hospital, Germany

Abstract. We propose a pipelined system for the automatic classification of medical documents according to their language (English, Spanish and German) and their target user group (medical experts vs. health care consumers). We use a simple n-gram based categorization model and present experimental results for both classification tasks. We also demonstrate how this methodology can be integrated into a health care document retrieval system.

1. Introduction The user population of medical document retrieval systems and their search strategies are really diverse. Not only physicians but also nurses, medical insurance companies and patients are increasingly getting access to these text-based resources. Especially health care consumers (henceforth, HCC) spend more and more time on getting information from Web resources to gather a better understanding about their personal health status, sufferings, (in)adequate treatments, medications and other interventions. Although many patients already exhibit a considerable level of expertise concerning their diseases, not every piece of relevant information from the Web is particularly suited and understandable for non-experts. To help HCCs in finding appropriate information, as well as medical experts in suppressing layman information, we realized a text analysis pipeline which consists of a simple but high-performing classification algorithm. It distinguishes the kind of natural language in which a document is written and, subsequently, divides this document space into easily understandable texts, and, into those documents that require expert-level knowledge for proper understanding. The approach we propose uses frequencies of the occurrence of typical character combinations (n-grams) only and thus refrains from heavy linguistic analysis.

2. Methods In our experiments, we used the off-the-shelf classification software TEXTCAT which is primarily designed for determining the language of documents. The underlying method is very simple (see also [1]): the most frequent character n-grams, extracted from training documents, represent a particular category (the category model). In order to

classify a new, unseen document a ranked n-gram list has to be extracted from the new document (the document profile) and has to be compared to every category model. The similarity criterion is based on a simple two-step procedure, which is applied to both, the category models and document profiles. First, the document profile is compared to a model profile pairwise. The rank-order statistics is based on the measurement of how far ‘out of place’ [1] a particular n-gram in the document profile is from its position in the category model. The sum of all out-of-place values for all n-grams is the document’s distance from that particular category. Second, the document is assigned to that specific category to which it has the Rank LMEN LMES LMGE DocGE smallest overall distance. 26 s_ es ei b For the language classification task, we left 27 er _l k er_ all the parameters of the TEXTCAT tool ‘as 28 _o de_ in _d is’, except that we always enforced a non29 he_ la te er_ ambiguous classification decision. We used 30 d_ os ie f the language models (i.e., the category 31 t_ _de_ b he models) for English, Spanish and German 32 the_ _p t_ te that come with TEXTCAT. Table 1 depicts samples from category profiles for the language classification task. Columns 2-4 show the models for the different languages considered, column 5 represents a document profile.

Table 1: N-gram Based Character Language Models for English (LMEN), Spanish (LMES) and German (LMGE) and for a German Document (DocGE) (Rank 26-32). Bold N-grams Occur Both in the Document Profile and in the Language Models.

3. Experimental Setting In order to evaluate the language and target Expert HCC group classification results, we acquired various Netdoctor Merck text corpora from the Web and from online (16 MB) (48 MB) EN textbooks (see Table 2). We eliminated style Mayoclinic Medline (3,6 MB) (65 MB) tags, hyper links and other tags from the Web Merck Netdoctor documents and obtained plain text files. For (12 MB) (6,5 MB) English, we used the online version of Merck, a ES Familydoctor Scielo collection of Medline abstracts and documents (182 MB) (4,4 MB) from the Internet portals Netdoctor Merck Netdoctor (www.netdoctor.co.uk) and Mayoclinic (48 MB) (107 MB) GE (www.mayoclinic.com). For Spanish, our MWW Medline corpora consist of the Spanish version of Merck, (3,6 MB) (21 MB) documents from The Scientific Electronic Table 2: Survey of the Corpora Collection Library Online (www.scielo.org), the Spanish (Sizes in Parentheses) version of Netdoctor (www.netdoctor.es) and Familydoctor (www.familydoctor.es). Finally, for German, we used the German version of Merck, German abstracts indexed in Medline, and the German version of Netdoctor (www.netdoktor.de) as well as Medicine World Wide (www.m-ww.de). Besides the language dimension (English, Spanish, German), we also distinguished between texts written for experts (articles from the Merck text books,

abstracts from Medline, and documents from Scielo) and those written for HCCs (Netdoctor, Mayoclinic, Familydoctor, Medicine World Wide). The document collection was split 10 times randomly in two sets with about the same size to obtain a training and a test set with no overlapping documents (see Figure 1, A). In the training phase, we created target group models (TGM, Figure 1, B) by extracting and ranking n-grams from typical expert and HCC documents. In contrast to language models for which a training document size between 21 KB to 132 KB is sufficient [1], the target group models had to cover a broader variety of styles for the coarse grained classification. We therefore used a larger training set (66 MB–102 MB). In the test phase, all documents to be classified were mixed in a first step. Afterwards, TEXTCAT was used with the language models (LMEN, LMES, LMGE,) that come with the software. Thus, we obtained a test set for the second classification task, namely the distinction between expert and HCC documents (Figure 1, C). For this purpose we also made use of TEXTCAT, though we now relied on the models that had been created in the training phase. The final result is a collection of documents that are classified according to their language, as well as to the target group they are likely to address (Figure 1, D).

Figure 1: Experimental Setting: Document Processing Pipeline for Testing and Training

4. Experimental Results and Discussion In our experiments, we evaluated whether the automatic procedures for the distinction between various kinds of languages and the two groups of expertise, viz. experts and HCCs, provided reasonable outcomes. For the language classification task, we achieved very good results (100% for English, 99,6% for Spanish and 95,4% for German). This yields a sound basis for the second test set to be used for the target

group classification. It is conspicuous that the results for the German language identification task are about 5 percentage points below the results for English and Spanish. An analysis of the misclassified documents revealed that they were classified as English documents, because their predominant language was, surprisingly, English (although crawled from www.netdoktor.de). For the target group classification, we classified the documents from the second test set (i.e., after language classification) using TEXTCAT and the language specific expert and HCC models created in the training phase. Here, we attained overall classification results between 89.4% for German and 95.8% for Spanish, English (94.9%) being close to Spanish in terms of performance (see also Table 4). The accuracy results for the correct classification of HCC documents were in general higher (90.0% for German, 95.0% for English, and 98.1% for Spanish) compared to the results for expert documents (88.9% for German, 93.9% for Spanish, and 94.8% for English). Again, the overall classification results for German documents are about 5 percentage points below the results for Spanish and English. The reason for that might be the target group model for HCC documents (TGMHCC-GE), that resulted from the noisy document collection we used in the training phase (see above). The comparison of the two classification tasks shows different classification results: Firstly, language-specific classification yields better results compared to the target group classification task due to obvious reasons: guessing a language is simpler with the methods we use because substantial characteristics of a particular language are mirrored in the respective character set (umlauts in German, accents in Spanish, none of these in English). Therefore, the language models have a high discriminative power. By contrast, the distinction between expert and HCC documents is, on the one hand, a less well-defined decision task and, on the other hand, seems to require a deeper text (and content) analysis. Still, our results show that simple methods yield high accuracy figures for various document classification tasks. English 100.0%

Spanish 99.6%

German 95.4%

Expert HCC All

Table 3: Accuracy Results for Language Classification (Coverage: 100%)

English 94.8% 95.0% 94.9%

Spanish 93.9% 98.1% 95.8%

German 88.9% 90.0% 89.4%

Table 4: Accuracy Results for Expert-HCC Classification (Coverage: 100%)

5. Related Work Comparing our experiments to others is difficult: First, apart from the simple n-gram approach, there are other classification methods with learning and classification functionality [2,3]. Second, a lot of classification experiments are based on nonmedical document sets with a varying number and granularity of categories. Karlgren and Cutting [4], e.g., use the same computational approach and run experiments on the mixed-domain English Brown corpus 1 . For the distinction between informative vs. imaginative texts, they come up with 96% and 95% accuracy, respectively. For a four-category classification task (press, fiction, non-fiction, and miscellaneous), they achieve between 75% and 95% accuracy. Kessler et al. [5] 1

http://helmer.aksis.uib.no/icame/brown/bcm.html

employ a 13-category classification system, which achieves a classification accuracy ranging between 93-100% for the scientific category. Stamatatos et al. [6] compute genre categorization on a four-category system by merely comparing absolute word frequencies in general language with those from four genre-specific corpora, a task for which they achieve around 97% accuracy. The state-of-the-art performance figures for medical text categorization, however, are due to Kornai and Richards [7] who achieve 86% correct classifications of clinical verbatims on a 12-category system. Up until now, almost no effort has been made to apply medical text classification with respect to their target group. In a study closest in spirit to our approach and objectives [8], a multi-layered classification pipeline was presented. German documents were, firstly, classified into non-medical and medical ones. Then, from the latter set, documents with clinical (e.g., pathology reports) and non-clinical (textbooks) content were determined automatically. Finally, for the non-clinical texts, their target group (experts vs. laymen) was identified, yielding results of 75% and 71,4% accuracy, respectively. In the present study, we could show on a larger corpus that the classification results can be improved and that the proposed method can also be applied to other languages than German.

6. Conclusion, Current and Future Work

Figure 2: Screen Shot from the MorphoSaurus Search Engine (http://www.morphosaurus.net)

We here proposed a 3x2 classification system that relies on n-gram statistics for classifying documents according to the kind of language and their target group, viz. medical experts vs. health care consumers. Based on the good results we achieved, we applied it as an add-on for the MorphoSaurus indexing system [9]. We supply a search engine in order to retrieve documents that fulfill an expert’s or an HCC’s

information need. We tested this approach using the language and target group models described in the previous sections and added the information about language and target group to the document index structure created by the MorphoSaurus system. The screenshot in Figure 2 shows some retrieval results for the query “quit smoking” which might be a typical request by health care consumers. Because of the multi-linguality of the MorphoSaurus system, not only English documents can be retrieved, but also German and Spanish ones (see [9] for details). Figure 2 shows that for both English and German documents the information about their particular target group (patients as HCCs vs. medical experts) is given. Even more important, with this approach, it is also possible to classify documents from text collections that are not part of the training set (here: documents from www.aerzteblatt.de) and, because of its heterogeneous readership, also divide them into their respective target groups. All in all, because of its high performance, the classification pipeline proposed in this paper indeed improves the quality in document retrieval systems that are used by a multilingual and expertise-wise heterogeneous user community. In our future work, we plan to substantiate this assumption by carrying out retrieval experiments in which experts as well as HCCs have to judge the documents returned in a result set with regard to satisfying their information needs in terms of relevancy and comprehensibility.

Acknowledgement This work was partly supported by Deutsche Forschungsgemeinschaft (DFG), grant KL 640/5-2, and the European Network of Excellence “Semantic Mining” (NoE 507 505)

References [1] [2] [3] [4] [5] [6] [7]

[8] [9]

Cavnar WB and Trenkle JM. N-gram based text categorization. In SDAIR’94, pp 161–175. Las Vegas, NV, USA, April 11-13,1994. Calvo RA, Lee J-M, and Li X. Managing content with automatic document classification. Journal of Digital Information, 5(2), 2004 (Article No. 282). Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47, 2002. Karlgren J and Cutting D. Recognizing text genres with simple metrics using discriminant analysis. In COLING’94, pp 1071–1075. Kyoto, Japan, August 5-9, 1994. Kessler B, Nunberg G, and Schütze H. Automatic detection of text genre. In ACL’97/EACL’97, pp 32–38. Madrid, Spain, July 7-12, 1997. Stamatatos E, Fakotakis N, and Kokkinakis G. Text genre detection using common word frequencies. In COLING 2000, pp 808–814. Saarbrücken, Germany, 31 July – 4 August, 2000. Kornai A and Richards JM. Linear discriminant text classification in high dimension. In Hybrid Information Systems. Proceedings of the 1st International Workshop on Hybrid Intelligent Systems, pp 527– 537. Adelaide, Australia, December 11-12, 2001. Heidelberg: Physica, 2002. Hahn U and Wermter J. Pumping documents through a domain and genre classification pipeline. In LREC 2004 Vol. 3, pp 735–738. Lisbon, Portugal, May 26-28, 2004. Markó, K, Schulz, S and Hahn, U: MorphoSaurus. Design and evaluation of an interlingua-based document retrieval engine for the medical domain. Methods Inf Med, 44 (4): 537-545, 2005.

A Language Classifier that Automatically Divides Medical ... - CiteSeerX

A Language Classifier that Automatically Divides Medical ... - CiteSeerX

Suggest Documents

A Language Classifier that Automatically Divides Medical Documents ...

A Shopping Agent That Automatically Constructs ... - CiteSeerX

A Structural Classifier to Automatically Identify Form Classes - CiteSeerX

digital divides - CiteSeerX

Automatically Discovering Properties that Specify the ... - CiteSeerX

Bridging Urban Digital Divides - CiteSeerX

A GAP that Divides[version 1; referees: 3 approved]

Factors that Enhance English Language Teachers ... - CiteSeerX

Factors that Influence Language Development - CiteSeerX

Digital Divides in the Pacific Islands - CiteSeerX

Bridging Divides through Technology Use ... - CiteSeerX

Breast Cancer Molecular Subtype Classifier That ...

Modernity, mobility and the digital divides - CiteSeerX

The Laplacian Classifier - CiteSeerX

'Reference Culture' That Divides. - International Journal for History ...

On a grammar-based design language that supports ... - CiteSeerX

Language Identification using Classifier Ensembles - Translation and

Digital Divides in the Pacific Islands - CiteSeerX

Language Identification using Classifier Ensembles - Semantic Scholar

Analyzing Appraisal Automatically - CiteSeerX

AUTOMATICALLY ASSESSING ACOUSTIC ... - CiteSeerX

Automatically Proving Linearizability - CiteSeerX

'Reference Culture' That Divides. - International Journal for History ...

A cloning approach to classifier training - CiteSeerX