Developing a Linguistically Annotated Corpus of ... - IEEE Xplore

2014 IEEE International Conference on Bioinformatics and Biomedicine

Developing a Linguistically Annotated Corpus of Chinese Electronic Medical Record Zhipeng Jiang, Fangfang Zhao, Yi Guan Department of Computer Science and Technology Harbin Institute of Technology Harbin 150001, China Email: hit.jiang@hotmail.com

Within the general domain, Chinese word segmentation and POS tagging in English under the frame of statistical machine learning have achieved high accuracy closing to manual annotation (about 98% [7] and 97% [8]). Syntactic analysis in English only gets F1 score of 91% due to its complexity. POS tagging and syntactic analysis in Chinese are about 5~6 percent behind English because of the variability of Chinese sentences. Nevertheless, these high-precision models usually require a large-scale annotated corpus for training. Corpus engineering plays an important role in statistical NLP development. Penn Treebank (PTB) [9] is a classic English annotated corpus from general domain, covering technology, journalism, literature and other areas, its guidelines lay the foundation for development of other guidelines.

Abstract—Electronic Medical Record (EMR) is the material base of smart healthcare, its automatic analysis is dependent on nature language processing (NLP) technologies. Syntactic analysis, as the basic technology of NLP, can be used to convert the free text of EMR to structured text. However, research on syntactic analysis, even Chinese word segmentation and part-ofspeech (POS) tagging on Chinese electronic Medical record (CEMR), is currently at a blank stage because of the lack of annotated corpus on CEMR. To resolve this problem, we propose the annotated scheme from Chinese word segmentation to syntactic analysis, and built the first syntactically annotated corpus of CEMR. Through analyzing the annotated CEMR, we find it has stronger grammatical regularity and particular statistical distribution. These finds are taken advantage to improve the Stanford parser and develop a state-of-the-art Chinese word segmentation and POS tagging system for CEMR. The evaluation results show a substantial benefit to statistical machine learning models from the annotated CEMR.

Introducing a domain-specific dictionary or annotated corpus is considered as a solution to cross-domain performance penalty, and a domain-specific annotated corpus is proved more effective [10]. Most English annotated guidelines of biomedical corpus are from PTB, just with varying degrees of modifications. Smith [11] generalizes the medical vocabulary to expand the POS set. Pakhomov [12] completely follows the POS set of PTB, only specifies some annotated rules of EMR, such as special symbols, drug name, dosages and foreign words. GENIA corpus [13] developed in 2003 consisting of 500 MEDLINE abstracts is a famous biomedical parsing treebank. It deletes some POS tags that do not appear in biomedical literatures, and increases prefix and suffix according to syntax. To extend the PTB guidelines to clinical domain, 273 clinical notes are annotated with the POS tags, shallow parsing and named entity to form the Mayo corpus. The clinical Text Analysis and Knowledge Extraction System (cTAKES) uses it to train and evaluate [14]. The MiPACQ clinical corpus [15] is also taken from the Mayo Clinic EMR, but involves more comprehensive annotation, such as syntactic annotations, predicate-argument semantic annotations and the Unified Medical Language System (UMLS) entity semantic annotations. By training on this corpus, the performance of NLP components on English clinic text is boosted significantly.

Keywords—CEMR; Chinese word segmentation; part-of-speech tagging; syntactic analysis

I.

INTRODUCTION

EMR is the storage of all health care data and information in electronic formats [1], it is a kind of semi-structured text in which masses of free text exists. NLP techniques are regarded as solutions to analyze and structured the free text so as to mine medical knowledge. For this purpose, many Medical Language Processing (MLP) systems appeared and mainly focused on medical information extraction, they were designed for the specific application and achieved satisfactory results on some specific tasks. However, most of MLP systems still remain at the rule-based level, complex and general system is hard to build because of the lack of annotated corpus. This problem becomes more prominent in China since CEMR oriented information extraction research just started. Chinese word segmentation and POS tagging, as the basis of the framework of mainstream Chinese information extraction, are yet trained on the general domain corpus [2][3][4]. Syntactic analysis even has not been integrated into the Chinese medical information extraction system despite it has been proven to be able to improve the performance of medical information extraction [5][6].

978-1-4799-5669-2/14/$31.00 ©2014 IEEE

In contrast, there is not even any Chinese lexically and syntactically annotated clinical text so far. Additionally, the English annotation guidelines of clinical text can not be

307

directly used because of the language gaps. These create barriers to the development of MLP in China. II.

METHODS

A. Annotation Scheme Our corpus is randomly sampled from CEMR of a large comprehensive Level-A hospital in China. We have proposed a preprocessing scheme for the particularity of data acquisition of CEMR, and a annotation scheme to adjust for the CEMR based on English biomedical annotation [16]. In the stage of guidelines revision, the segmentation guidelines, the POS tagging guidelines and the bracketing guidelines for PCTB have been simplified, supplemented and modified in order to adapt to CEMR [16]. The annotators who are graduates with linguistic background are trained by consulting doctors during annotating CEMR.

Fig. 1. Annotation by the first annotator

B. Annotator agreement To guarantee the quality of the corpus, annotators firstly needed to correct the same small corpus in each round, respectively, during all the stages of Chinese word segmentation, POS tagging and syntax annotation. Then, based on the gold corpus generated after discussing, IAA on Chinese word segmentation and POS tagging was computed according to the following formula in (1). EvalB1 , a famous tool on evaluation of parsing tree, was used to calculate IAA on syntax annotation.

IAA =

# of agreed tags # of all tags

Fig. 2. Annotation by the second annotator TABLE I.

(1)

Annotation Layer Chinese word segmentation

Average IAA (%) 97.56

POS tagging

93.34

Syntax annotation

91.22

III.

In addition to the ambiguity of guidelines, the major source of disagreements on Chinese word segmentation is the lack of clinical knowledge. Our solutions are recording the uncertain cases to consult doctors and segmenting the terminology using large-grained size. For example, the word “ 上颌颌炎(maxillary sinusitis)” is not segmented to guarantee its integrity.

RESULTS AND ANALYSIS

A. Corpus characteristics We have built a corpus of 2553 sentences with Chinese word segmentation, POS tags and phrase tags. It is a collection of 138 records embracing discharge summaries and progress notes. Of them, 70 are from the department of neurology, 68 are from the department of general surgery. Discharge summaries contain several sections of Discharge Instructions (DI), Treatment Effects (TE), Discharge Conditions (DC), Treatment Course (TC), Admission Conditions (AC), Clinical Definite Diagnosis (CDD), Clinical Initial Diagnosis (CID), Admitted Diagnosis of Clinic (ADC), Date of Admission/Discharge (DAD) and Patient Information (PI). Progress notes contain sections of Treatment Plan (TP), Differential Diagnosis (DD), Assessment (AS), Characteristics of Cases (COC), Clinical Initial Diagnosis (CID) and Chief Complaint (CC). Each record is segmented in XML according to its section caption, and annotated in PCTB format, a sample is shown in Fig. 3. Some basic statistical information are given in Table 2 and 3, involving the distribution of main POS categories and punctuation.

Disagreements on syntax annotation are mostly caused by the ellipsis of grammatical constituent. It is comforting that only a few disambiguation rely on clinical knowledge. For example, the first annotator deemed “ 腹部 (stomach)” as a subject of the whole sentence to annotate as Fig.1. However, the second annotator considered the clause “无胃肠型及蠕动波 (without gastrointestinal type and peristaltic wave)” with the ellipsis of subject , which may be patient. So he annotated as Fig.2. This choice depended entirely on annotators' knowledge background. Disagreements caused by background knowledge did affect the IAA to some extent. But, with increasing of iterations, IAAs were still on the rising trend. The averages shown in Table 1, are satisfactory and even closed to English clinical Treebank [15]. It implies that annotated experience can make up for the lack of domain knowledge. Annotators are ability to build a corpus with high quality. 1

AVERAGE IAA IN THE FIRST THREE ITERATIONS

From the lexical views, CEMR's the average sentence length (25.99 tokens per sentence) is similar to the general corpus (PCTB) as a whole due to significant differences in EMR’s sections, where the longest is 67.67 tokens per

http://nlp.cs.nyu.edu/evalb/

308

sentence in the COC section, the shortest is just 1.0 token per sentence in the TE section.

right) ” is simplified to the sentence “ 右侧中枢性面瘫 (right-sided central facial palsy)” for omitting the predicate, meanwhile its syntactic component is changed to noun phrase(NP) from simple clause(IP). We want to emphasize that this flat structure with ellipsis may make traditional statistical parsing model even worse. B. Development and evaluation of POS tagger and parser Before building POS tagger and parser using the annotated CEMR, we trained the Stanford POS tagger and parser [17] on PCTB corpus and tested them on different sections of CEMR. The precision of POS tagging and the F1 measure of parsing as evaluation metrics were calculated using EvalB. It is interesting that the two variables do not meet the positive correlation assumptions considered generally. They negatively correlate at -0.9 in several sections of the CDD, CID, ADC, DAD and PI sections, though the correlation of other sections is still high (0.99), such as the DI, TE, DC, TC and AC sections. We find that the text of sections with negative correlation are more structured, in which information is almost simply listed, lacking of rich context. This different writing style may cause the different statistical distribution. In the next experiment, we would attempt to group sections by above difference so as to improve parsing model by enriching training corpus under the same distribution.

Fig. 3. Example from annotated corpus TABLE II.

TABLE III.

THE DISTRIBUTION OF MAIN POS CATEGORIES POS tag NN

Total (%) 27.41

PU

21.69

VV

13.15

CD

8.07

On building of a full-parsing model, we took an initial study to train and test the Stanford parser on the annotated CEMR. The annotated CEMR of the department of neurology was split at random in 80/20 manner where 80% of the sentences were training set and 20% were testing set. We also grouped training set by correlation to verify the above assumption, and used 20% of the CEMR in the department of general surgery to test the performance of cross-departments adaptation.

THE DISTRIBUTION OF MAIN PUNCTUATION Punctuation Comma

Total (%) 67.55

Period

14.67

Colon

5.73

Quotes

3.59

Table 4 presents details of using different corpus to train models. The results using CEMR (CEMRa) are significantly higher than PCTB even though the number of annotated sentences is small, indicating that in-domain data can bring a substantial amelioration. Grouped training (CEMRg) shows a small improvement in Parsing F1-scores, even a slight decline in POS tagging accuracy, because the corpus is not enough and the size of different sections is unbalanced. For example, the DAD section only containing 9 sentences obtains a larger enhancement (22.09%). At last but not least, results on testing data from the department of general surgery (CEMRc) are quite poor. It is proved that corpus from different departments can be seen as corpus from different domains.

To further analyze the grammatical phenomena of CEMR, we summarized some syntactic characteristics according to the statistical distribution in CEMR. We compared the average Length-Depth Ratio (LDR) of parsing trees in each section, where LDR can be calculated: LDR =

# of tokens # of layers

(3) TABLE IV.

Results show that the shape of parsing tree varies from section to section significantly. In detail, the shape of parsing tree in the section mainly describing symptoms is more flat, like the AC section (0.43), differs from the section listing diseases, like the CID section (1.94). On the other hand, the proportion of the number of omitted labels is about 2 times in PCTB, meaning elliptical constructions are more frequently used to form a concise expression. For instance, the sentence “右侧存在中枢性面瘫 (central facial palsy exists on the

EVALUATION ON THE PCTB AND CEMR CORPUS

Evaluation metric POS tagging accuracy (%) Parsing F1-scores (%)

PCTB

CEMRa

CEMRg

CEMRc

77.68

93.76

93.59

85.94

53.58

80.36

80.68

68.64

For Chinese word segmentation and POS tagging (S&T), we took a character-based S&T model [18] as baseline, and built a more effective S&T system. We first proposed new

309

feature templates for CEMR, and used beam search algorithm to replace Viterbi search algorithm in training. Moreover, transformation-based error-driven model [19], a rule-based machine learning model, was added as a post process to correct tagging errors. Similarly, we chose a more efficient training algorithm [20] instead, and proposed new transformation templates for CEMR.

[3] [4] [5]

Finally, we compared our S&T system with related work in Table 5. To enlarge training data, we combined the PCTB corpus with randomly 80% of CEMR to train all models, and took the remaining 20% as the test data. It is clear that our system obviously outperforms other systems, particularly surpasses the word-lattice based model [7], a state-of-the-art S&T model in general domain, about 4% in both Chinese word segmentation and POS tagging. The encouraging results illustrate that the rule-based method is more applicable to CEMR. In other words, the assumption about CEMR with stronger grammatical regularity can be verified. Comparing to the pipe-line annotated framework, the joint annotated framework is still superior in CEMR. TABLE V.

[7] [8]

[9]

COMPARISON OF OUR S&T SYSTEM AND RELATED WORK

Character-based

Chinese word segmentation(F1score) 90.15%

POS tagging(F1score) 88.73%

Word-lattice

90.45%

89.05%

Ours pipeline

84.15%

82.11%

Ours joint annotated

94.39%

93.20%

POS tagger

[6]

IV.

[10]

[11] [12]

CONCLUSION

[13]

In this paper, we have presented a annotation scheme for CEMR, including text preprocessing, guidelines adaptation and agreement evaluation. To our knowledge, we have built the first corpus consisted of CEMR both with lexical annotations and with syntactic annotations. Based on the corpus, we summarize the characteristics of CEMR, and group training data by correlation to improve the performance of Stanford parser. Furthermore, an accurate S&T system is implemented, and gives the best published results on CEMR. The outperformance of our system implies the sublanguage domain of clinical text can make the rule-based method thrive again. To resolve the problem of adaptation to different departments, we are planning to explore the semi-automatic annotation model like the active learning model to build corpus with larger scale and higher quality on CEMR.

[14]

[15]

[16]

[17] [18]

ACKNOWLEDGMENT This work was supported by the Natural Science Foundation of China under grant No. 60975077.

[19]

REFERENCES

[20]

[1] [2]

Terry J. Hannan, “Electronic medical records,” Health Informatics: An Overview, 1996, pp. 133-148. Ye Feng, Chen YingYing, Zhou GenGui, Li HaoWen and Li Ying, “Intelligent Recognition of Named Entity in Electronic Medical

310

Records,” Chinese Journal of Biomedical Engineering, vol. 30, no. 2, pp. 256-262, 2011. (in Chinese) Han X and Ruonan R, “The Method of Medical Named Entity Recognition Based on Semantic Model and Improved SVM-KNN Algorithm,” In Proceedings of the 2011 IEEE SKG, 2011, pp. 21-27. Li Yi, Bao Pengfei and Xue Wanguo, “Research on Information Extraction of Electronic Medical Records in Chinese,” Journal of Biomedical Engineering, vol. 27, no. 4, pp. 757-762, 2010. (in Chinese) B. De Bruijn, C. Cherry, S. Kiritchenko, J. Martin, and X. Zhu, “Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010,” Journal of the American Medical Informatics Association, vol. 18, no. 5, pp. 557-562, 2011. P. G. Mutalik, A. Deshpande, and P. M. Nadkarni, “Use of generalpurpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS,” Journal of the American Medical Informatics Association: JAMIA, vol. 8, no. 6, pp. 598-609, 2001. K. Zhang and M. Sun, “Reduce Meaningless Words for Joint Chinese Word Segmentation and Part-of-speech Tagging,” In Proceedings of CoRR, 2013. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network,” In Proceedings of HLT-NAACL, 2003, pp. 252259. Marcus M P, Marcinkiewicz M A and Santorini B, “Building a large annotated corpus of English: The Penn Treebank,” Computational linguistics, vol. 19, no. 2, pp. 313-330, 1993. Liu K, Chapman W, Hwa R and Crowley RS, “Heuristic sample selection to minimize reference standard training set for a part-of-speech tagger,” Journal of the American Medical Informatics Association: JAMIA, vol. 14, no. 5, pp. 641-650, 2007. Smith L, Rindflesch T and Wilbur W J. “MedPost: a part-of-speech tagger for bioMedical text,” Bioinformatics, vol. 20, no. 14, pp. 23202321, 2004. Pakhomov S V, Coden A, Chute C G. “Developing a corpus of clinical notes manually annotated for part-of-speech,” International journal of medical informatics, vol. 75, no. 6, pp. 418-429, 2006. Kim J D, Ohta T, Tateisi Y and Tsujii J, “GENIA corpus—a semantically annotated corpus for bio-textmining,” Bioinformatics, vol. 19, no. suppl 1, pp. i180-i182, 2003. Savova G K, Masanz J J, Ogren P V, Zheng J, Sohn S, Kipper-Schuler KC and Chute CG, “Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications,” Journal of the American Medical Informatics Association: JAMIA, vol. 17, no. 5, pp. 507-513, 2010. Albright D, Lanfranchi A, Fredriksen A, et al. “Towards comprehensive syntactic and semantic annotations of the clinical narrative,” Journal of the American Medical Informatics Association: JAMIA, vol. 20, no. 5, pp. 922-930, 2013. Zhipeng Jiang, Fangfang Zhao, Yi Guan and Jinfeng Yang. “Research on Chinese Electronic Medical Record Oriented Lexical Corpus Annotation,” High Technology Letters, vol. 24, no. 6, pp. 609-615, 2014. (in Chinese) Richard Socher, John Bauer, Christopher D. Manning and Andrew Y, “Parsing With Compositional Vector Grammars,” In Proceedings of ACL, 2013, pp. 455-465. Nianwen Xue and Libin Shen, “Chinese word segmentation as LMR tagging,” In Proceedings of the second SIGHAN workshop on Chinese language processing, 2003, pp. 176-179. Brill E, “Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging,” Computational Linguistic, vol. 21, no. 4, pp. 543-565, 1995. Zhou Ming, Wu Jin and Huang Chuang-Ning, “A fast learing algorithm for part of speech tagging: an improvement on Brill’s transformationbased alogrithm,” Chinese J.Computers, vol. 21, no. 4, pp. 358-366, 1998. (in Chinese)