2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
ML-CNN: a novel deep learning based disease named entity recognition architecture Zhehuan Zhao1, Zhihao Yang1,*, Ling Luo1, Yin Zhang2,*, Lei Wang2, Hongfei Lin1, Jian Wang1 1
College of Computer Science and Technology, Dalian University of Technology, Dalian, China, 116023 2 Beijing Institute of Health Administration and Medical Information, Beijing, China, 100850 E-mail:*
[email protected],
[email protected]
Abstract— In this paper, we present a deep learning based disease named entity recognition architecture. First, the wordlevel embedding, character-level embedding and lexicon feature embedding are concatenated as input. Then multiple convolutional layers are stacked over the input to extract useful features automatically. Finally, multiple label strategy, which is firstly introduced, is applied to the output layer to capture the correlation information between neighboring labels. Experimental results on both NCBI and CDR corpora show that ML-CNN can achieve the state-of-the-art performance. Keywords—disease; named entity recognition; convolutional neural network; deep learning; multiple label strategy
I.
III.
We validated ML-CNN by applying it to two corpora containing both mention-level and concept-level annotations: the NCBI Disease corpus [4] and the BioCreative V Chemical Disease Relation task (CDR) corpus [5]. ML-CNN achieves the state-of-the-art performance on both NCBI and CDR corpora. Comparisons with other state of the art methods indicate that the ML-CNN can learn useful features automatically since it needs little feature engineering. And the ablation tests show that the MLS strategy plays a key role in our method IV.
INTRODUCTION
Manual curation of disease names from the literature is expensive and it is difficult to keep up with the rapidly growing amount of relevant literature. Hence, automatic disease named entity recognition (DNER) is of utmost importance. Recently, deep learning methods have attracted much attentions in the NER of general field, as they can achieve the state-of-the-art performance with little feature engineering [1-3]. However, most deep learning methods treat NER as a sentence level sequence tagging problem which makes it more complicate than it should be. In this paper, multiple label convolutional neural network (ML-CNN) is proposed. It treats NER as a simple word level classification problem in which only the context to a fixed-size window around the target word is considered as input. II.
In our method, we assume that the context information is enough for predicting the target word’s label correctly. Therefore, ML-CNN treats NER as a simple word level classification problem in which only the context to a fixed-size window around the target word is fed into ML-CNN. First, each word in the context is represented as a real vector which is generated by concatenating corresponding word-level embedding, character-level representation and lexicon feature embedding. Then, the context words’ embedding representations are fed into the stacked CNNs. Finally, multiple label strategy (MLS) is applied to the output layer, which is first introduced to capture the correlation information between labels in neighborhoods simply by predicting the previous and the next words’ labels in auxiliary.
REFERENCES [1]
[2]
[3]
[4]
[5]
794
CONCLUSION
In this paper, we present a novel deep learning based disease NER architecture (ML-CNN). In this architecture, the word-level and character-level and lexicon feature embeddings are concatenated as input of CNN model. Then a CNN-based classifier model is built with the above embeddings and used to recognize the disease mentions in the texts. Experimental results on both NCBI and CDR corpora show that ML-CNN achieves the state-of-the-art performance. The main contributions of our work can be summarized as follows: 1. The word-level and character-level and lexicon feature embeddings can be tuned automatically during the training process. Therefore, little feature engineering is needed in ML-CNN. 2. MLS is adopted to capture the correlation information between labels in neighborhoods which has been proved to be effective and efficient.
METHOD
978-1-5090-1610-5/16/$31.00 ©2016 IEEE
EXPERIMENTAL RESULTS
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). “Natural language processing (almost) from scratch.” Journal of Machine Learning Research, 12(Aug), 2493-2537. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. “Neural architectures for named entity recognition.” Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2016 Sunil Kumar Sahu and Ashish Anand. “Recurrent neural network models for disease name recognition using domain invariant features” Proceedings of the 54th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2016. Doğan, Rezarta Islamaj, Robert Leaman, and Zhiyong Lu. “NCBI disease corpus: a resource for disease name recognition and concept normalization.” Journal of biomedical informatics 47 (2014): 1-10. Li, J. et al. "BioCreative V CDR task corpus: a resource for chemical disease relation extraction." Database 2016 (2016): baw068.