preprocessing of corpus and used the ANNIE [23] plug-in of. GATE. ... 3.2.2 Annie Sentence Splitter ..... [15] Jang H, Lim J, Lim J-H, Park S-J, Lee K-C, Park S-H:.
Int'l Conf. Information and Knowledge Engineering | IKE'15 |
17
A Hybrid Approach to Extract and Classify Relation from Biomedical Text Abdul Wahab Muzaffar1, Farooque Azam1, Usman Qamar1, Shumyla Rasheed Mir2 and Muhammad Latif1 1 Dept. of Computer Engineering, College of E&ME, National University of Sciences and Technology (NUST), H-12, Islamabad, Pakistan 2 Dept. of Computer Science and Software Engineering, University of Hail, Hail, Saudi Arabia Abstract - Unstructured biomedical text is a key source of knowledge. Information extraction in biomedical is a complex task due to the high volume of data. Manual efforts produce the best results; however, it is a near impossible task for such a large amount of data. Thus, there is a need of tools and techniques in biomedical text to extract the information automatically. Biomedical text contains relationships between entities of importance for practitioner and researcher. Relation extraction is an important area in biomedical which has gained much importance. The main work is done on rule based and machine learning techn iques in biomedical relation extraction. Recently the focus has changed to hybrid algorithms which have shown better results. This research proposed a hybrid rule based approach to classify relations between biomedical entities. This approach uses Noun-Verb based rules for the identification of noun and verb phrases. It then uses support vector machine, a machine learning technique to classify these relations. Our approach has been validated on standard biomedical text corpus obtained from MEDLINE 2001, an accuracy of 94.91%. Keywords: Relation Extraction; NLP; SVM; Classification
1
Introduction
With the huge information hidden in the biomedical field, in the form of publications, that is growing exponentially, it is not possible for researchers to keep hims elf updated with all developments in specific field [1, 2]. The emphasis of biomedical research is shifting from individual to whole systems, with the demand of extracting relationships between entities, e.g. protein-protein interaction, diseases -genes) from biomedical text to produce knowledge [3, 4]. Manual efforts to transform unstructured text into structured are a laborious process [5]. Automatic techniques for relation extraction are a solution to the problem. [6]. Numerous relation extraction techniques for biomedical text have been proposed [8-10]. These techniques can be
categorized into four major areas i.e. co-occurrence based, pattern-based, rule-based, and ML-based approaches. Cooccurrence is the simplest technique that identifies entities co occurs in a sentence, abstract or document [11]. Pattern-based systems use a set of patterns to extract relations; these patterns can be manually defined patterns or automatically generated patterns. Manually define patterns involve domain experts for defining patterns and is the time consuming process and have low recall [12]. To increase recall of manually generated patterns automatically generated patterns can be used. An automatic pattern generation can be used bootstrapping [13] or generate directly from corpora [14]. In rule-based systems set of rules can be built to extract relations [15, 16]. Rules-based systems can also be manually defined and automatically generated from training data. When the annotated corpora on biomedical is available machine learning based approaches become more effective and ubiquitous [17, 18]. Most approaches use supervised learning, in which relations extraction tasks are modeled as classification problems. Broadly, any relation extraction system consists of three generalized modules i.e. text preprocessing, parsing and relation extraction. In this paper, we proposed a hybrid approach to extract relations between disease and treatment from biomedical text.
2
Related Work
Research proposed by [19] focused on a hybrid approach to discover semantic relations that occur between diseases and treatments. Cure, Prevent and Side Effects are the semantic relations that have considered to be extracted between entities (Disease, Treatment). The authors claimed better results compared to previous studies done on this topic. Results show different figures for each of the three relations mentioned: Accuracy for Cure relation is 95%, Prevent relation has 75% Accuracy, 46% accuracy for Side Effect relation has been claimed. This paper [20], primarily focused to extract the semantic relations from biomedical text. Relations are extracted between
18
Int'l Conf. Information and Knowledge Engineering | IKE'15 |
diseases and treatments entities only. The authors propose an approach, which is a hybrid in nature, having two different techniques, to extract the semantic relations. In first technique relations extracted by pattern based on human expertise and in second one machine learning technique based on support vector machine classification. This new hybrid approach mainly relies on manual patterns when available relations examples are less, while feature values are used more when the number of available relations examples are sufficient. The authors claimed an overall F-measure of 94.07% for a cure, prevent and side effect relation.
3
Biomedical Text Corpus
PreProcessing
POS Tagger
Morphological Analyzer
Rules for Verb and noun Phrases
Proposed Framework
The proposed work is a hybrid approach which uses rulebased and machine learning technique for the extraction of semantic relations between disease and treatment entities from biomedical text. High level view of the proposed approach is as under:
Sentence Splitter
Tokinizer
Feature Extraction
Vector Representation & Class Label
Unigram Features based on POS
Word Size
Term Frequency
Verbs & Nouns Phrase Features
Term Presence Class Labels
Biomedical Text Corpus
Frequency based filtering
Relation Classification
Preprocessing Phase
Feature Extraction
Vector Representation and Class Label
Relation Classification
Fig. 2: Detailed Proposed Relation Extraction Framework
3.1
x Biomedical Text Corpus x Preprocessing phase x Feature Extraction x Vector Representation and Class Label x Relation Classification Detailed flow of the proposed framework is given in Fig. 2 below:
Biomedical Text Corpus
We used the standard text corpus that is obtained from [21]. This corpus/ data set contain eight possible types of relationships, between TREATMENT and DISEASE. This dataset was collected from Medline 2001 abstracts. Relations are annotated from sentences taken from titles and abstracts. Table I, presents the original data set, as published in previous research showing relationships and number of sentences. T ABLE I.
Fig. 1: Proposed Relation Extraction Approach
The framework is divided into five major steps as follows:
Support Vector Machine
Results
Sr # 1 2 3 4 5 6 7 8 9 Total
ORIGINAL DATASET [21]
Relationship Cure/ Treat for Dis Prevent Side Effect Disonly Treatonly Vague To See No Cure/ Treat No for Dis None
No of Sentences 830 63 30 629 169 37 75 4 1818 3655
Int'l Conf. Information and Knowledge Engineering | IKE'15 |
3.2
Preprocessing phase
In this phase following preprocessing steps has been carried out on the text corpora. We used GATE 7.1 [22, 23], for the preprocessing of corpus and used the ANNIE [23] plug-in of GATE. Following modules of ANNIE are used in this research to tag the corpus for further used of relation extraction phase: This phase was specially aimed to extract noun and verb phrases from corpora in order to provide semantic features to classifier. 3.2.1
ANNIE English Tokenizer in GATE
The Tokenizer convert the whole text into splits the text into small tokens, i.e. different type of words, punctuations and numbers [24]. 3.2.2
Annie Sentence Splitter
The sentence splitter splits the text into sentences, which is required for tagger. The sentence splitter uses a dictionary list of abbreviations to differentiate between full stops and other token types [25]. 3.2.3
ANNIE POS Tagger
This module produces a part-of-speech tag and annotates each word or symbol in the text. Part of speech tags can be a verb, noun, adverb or adjectives. Tagger can be customized by changing rule set given to it. 3.2.4
GATE Morphological analyzer
The Morphological Analyzer takes as input a tokenized GATE document. It identifies the lemma and an affix of each token by considering token’s part of speech tag, one at a time. These values are then added as features on the Token annotation. Morpher is based on certain regular expression rules [26]. This module uses to identify the common root of words in the text. 3.2.5
Noun-Verb-Noun Rules
We write rules and implement in JAPE [27]. These rules extract the Noun and verb phrases from text corpora. First three rules are extracting the noun phrases from text based on different criteria given in existing literature, while the last rule is extracting verb phrases. The rules are as under:
Rule: NP1 ( ({Token.category == "DT"}|{Token.category == "PRP$"})? ({Token.category == "RB"}|{Token.category == "RBR"}|{Token.category == "RBS"})* ({Token.category == "JJ"}|{Token.category == "JJR"}|{Token.category == "JJS"})* ({Token.category == "NN"}|{Token.category == "NNS"})+ )
19
:nounPhrase --> :nounPhrase.NP = {kind="NP", rule=NP1} This rule is chunking a noun phrase whenever the chunker finds an optional determiner (DT) or personal pronoun (PRP) followed by zero or more adverbs (RB,RBR,RBS) followed by zero or more adjectives (JJ,JJR,JJS) followed by one or more singular or plural noun (NN,NNS). Rule: NP2 ( ({Token.category == "NNP"})+ ) :nounPhrase --> :nounPhrase.NP = {kind="NP", rule=NP2} This rule is simple one describing that one or more proper nouns are to be annotated as the noun phrase. Rule: NP3 ( ({Token.category == "DT"})? ({Token.category == "PRP"}) ) :nounPhrase --> :nounPhrase.NP = {kind="NP", rule=NP3} This rule describes that a personal noun (PRP) can be optionally preceded by a determiner (DT). Rule: NounandVerb ( ({NP})? ({Token.category == "VB"} |{Token.category == "VBD"}|{Token.category == "VBG"} |{Token.category == "VBN"} |{Token.category == "VBZ"})+ ({NP})? ):NVP --> :NVP.VP = {kind="VP", rule=VP} This rule chunk all the verb phrases on the basis of one or more types of verbs ( VB ,VBD ,VBG ,VBN ,VBZ ). With the use of above mentioned rules we chunked all the verb phrases and noun phrases from the text. Table II demonstrates the 10 examples of both most frequent phrases occurred in the text. T ABLE II.
S/No 1 2 3 4 5 6 7 8 9
T OP T EN VP AND NP
VP be treat be perform to evaluate be administer to prevent to moderate to determine should be consider to examine
NP the treatment the use the effectiveness a combination surgical management this retrospective study the risk successful treatment the purpose
20
Int'l Conf. Information and Knowledge Engineering | IKE'15 |
10 3.3
be to compare
the efficacy and safety
Feature Extraction
A feature is anything that can be determined as being either present or absent in the item [34]. Feature extraction tries to find new data rows that can be used in combination to reconstruct rows of the original dataset and rather than belonging to one cluster, each row is created from a combination of the features [28]. 3.3.1
3.3.4
To finally chosen features both unigrams and phrasal vocabularies are filtered by the corpus frequency with a threshold so the important terms are only selected for features. Feature frequency > 3 Table IV shows the total no of features for unigrams, verb and noun phrases.
Unigrams features on basis of POS
We considered unigram features on the basis of part of speech (POS). Unigram consist of a single word labeled with a single attribute. Bag-of-word features and non-conjunctive entity attribute features are unigrams [29].One vocabulary initially builds at this stage to refine unigrams features. For unigram feature extraction, we build the vocabulary based on words of mainly four POS groups i.e. Adjectives (a), Adverbs (r), Verbs (v) and Nouns (n). Instead of each token string for a word its term root is used obtained with use of Morphological analyzer. By considering these term roots vocabulary size is much reduced. For simplicity and later comparison, we grouped the tagger POS designation into following equivalent POS. T ABLE III.
Frequency based filtering
T ABLE IV.
T OTAL NO OF FEATURES IN EACH SETUP
Feature Distribution Unigram
VP
NP
Total
S etup # 1
1267
31
101
1399
S etup # 2
673
8
37
718
S etup # 3
624
7
36
667
Vector Representation and Class Label
3.4
This phase mainly focuses on conversion of text into vectors so that it latter be used for classification. Vector is represented with the use of features and one important decision at this stage is to select each feature weight. Feature weight also affects the classification performance as shown from literature. Following are famous weight techniques used in literature:
T AGGER POS CONVERSION INTO SIMP LE POS
3.4.1
Term Frequency:
POS
Tagger’s POS s
a
JJ, JJR, JJS, JJSS
r
RB, RBR, RBS
3.4.2
v
VB, VBD , VBG, VBN, VBP, VBZ, M D
n
NN,NNP, NNS, NNPS
Either a feature occurred in a piece of text or not, it is a binary representation. Term Presence has following advantages over the other weighting techniques.
3.3.2
Filtered on the basis of word size
Normally words with length less or equal to 2 are stop words or erroneous and creating noise in classification task not such as “to”, “be” and “’s” etc. So we consider word size > 2 char in our unigram vocabulary. 3.3.3
Phrase Features
As shown from literature noun phrases and verb phrases are much important for relation classification we consider these phrasal features along with the unigrams. Verbs and noun phrase chucker separate these phrases in the text based on the above described rules. We build a separate vocabulary for these phrasal features.
How many time a feature actually occurred in the text. Easy to formulate, but suffer from limited class ification performance. Term Presence
x Easy to formulate a document or text vector x Less processing and computation in machine learning tasks x Better classification results as evidence from literature 3.4.3
TFIDF (Term frequency):
frequency
–
inverse
document
New weighting scheme gains importance for machine learning tasks in previous years. It is a relative weight of the feature with its document frequency. It requires more calculation to form a vector. 3.4.4
Class Labels:
Following are class labels are used with vector representation against each setup in multi class classification.
Int'l Conf. Information and Knowledge Engineering | IKE'15 |
T ABLE V.
C LASS LAB ELS IN EACH SETUP
S etups
3-Class Classification
2-Class Classification
3.5
Setup 1 Setup 2 Setup 3 Setup 4 Setup 5 Setup 6 Setup 7
# # # # # # #
Class Label: +1
Class Label: -1
Side Effect
Disonly + Treatonly Disonly + Treatonly Disonly + Treatonly
Cure
Prevent
Cure Prevent
Cure Prevent Side Effect
Disonly + Treatonly Disonly + Treatonly Disonly + Treatonly
Class Label: 0
CLASSIFICATION ACCURACY COMP ARISON WITH P REVIOUS AP P ROACH
Vauge Side Effect N/A N/A N/A
Relation Classification
Proposed Approach Accuracy Oana and Diana [20] RBF Cure 95% 94.91% 75% Prevent 92.05% 46% Side Effect 72.33% Table VI represents a comparison of the accuracy results obtained in previous work by [20] and our proposed approach. As we can see from the table, our technique has a major improvement over previous results in setup 2 and setup 3 and a slight less improvement in setup 1. As it can be seen, our approach is very constant in all three setups. Our results improve the previous ones in setup 1, setup 2 and setup 3 with the difference varying from: accuracy in setup 1 is almost same as that of previous approach 18 percentage point improvement (setup 2) to 46 percentage point improvement (setup 3). 10 fold Cross Validation
Previous Approach Accuracy
Classifier:
We use SVM classifier for our experimentation for which LIBSVM [33] is used that is integrated software for support vector classification, it supports multi-class classification. We used the LIBSVM in both setting i.e. Linear Kernel and Radial Based Function (RBF) Kernel. For linear kernel best results are obtained at c=0.5 while other parameters are on default settings, for RBF kernel best results are when g=0.05 and c=8.
4.1
T ABLE VI.
Vauge
Relation extraction is a text classification problem, we used support vector machine (SVM) as SVMs have already been used to yield higher accuracy on related tasks like text categorization [30]. Support vector machines (SVM) have also been widely used in the PPI extraction task, and have shown competitive results over other learning methods ([31, 32]).
4
10 fold means we split the data into 10 equal partitions and the process for 10 times while ensuring that each partition is used for testing at once. Such that 10 % data for testing and 90% for training. Average the performance results of 10 x iteration to get the final results.
Vauge
This phase mainly focuses on the classification of all those relations which exist between disease and treatment entities in the text corpora. We used SVM to classify relations in the dataset. Our main focus was to extract three main relations i.e. cure, prevent and side effect. 3.5.1
21
Results and Discussion 10-fold Class Validation
The model reported here is a combination of unigram, verb and noun-phrases, biomedical, with an SVM classifier. We use cross validation when there is a limited amount of data for training and testing. We swap the role of data i.e. the data used once used for training will be used for testing and the data used for testing will use for training.
4.2
Discussion
Table 7 shows the comparison of previous techniques and our approach on the same data set produced by [21]. In the current settings our approach gives the best accuracy. Oana Frunza and Diana Inkpen [20], claims even better results for these relations using UMLS. Our results differ in two aspects: firstly our assumption was to extract multiple-relations-persentence as done by Asma Ben Abacha and Pierre Zweigenbaum [22], while the objective of the authors in [20] was the relation detection in a one-relation-per-sentence assumption. Secondly, our contribution is a hybrid approach which is combination of rule based approach and machine learning approach. Rule based techniques have higher precision but low recall while machine learning has low precision when few training examples are available. In order to increase the recall of the rule based system, machine learning approaches can be combined with rule based approaches. To increase the precision of machine learning approaches, while training set is small, we may use rule base features.
5
Conclusion and Future Work
This research focuses on hybrid framework for extracting the relations between medical entities in text documents/ corpus. We mainly investigated the extraction of semantic relations between treatments and diseases. The proposed approach
22
Int'l Conf. Information and Knowledge Engineering | IKE'15 |
relies on (i) a rule-based technique and (ii) a supervised learning method with an SVM classifier using a set of lexical and semantic features. We experiment this approach and compared it with the previous approach [20]. The results taken by our approach shows that it considerably outperforms the previous techniques and provides a good alternative to enhance accuracy of relation extraction in the biomedical domain, if few training examples are available.
[10] Baumgartner WA, Cohen KB, Hunter L: An open-source framework for large-scale, flexible evaluation of biomedical text mining systems. Journal of biomedical discovery and collaboration 2008, 3:1. [11] Garten Y, Coulet A, Altman RB: Recent progress in automatically extracting information from the pharmacogenomic literature. Pharmacogenomics 2010, 11:146789.
In future we are planning to test our approach with other types of relations and different corpora; we will also work on multi-stage classifier to enhance the performance of relation extraction.
[12] Hakenberg J: Mining Relations from the Biomedical Literature. PhD Thesis 2009:179.
6
[13] Wang H-C, Chen Y-H, Kao H-Y, Tsai S-J: Inference of transcriptional regulatory network by bootstrapping patterns. Bioinformatics 2011, 27:1422-8.
References
[1] Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from information retrieval to biological discovery. Nature Review 2006, 7:119-129. [2] Ananiadou S, Kell DB, Tsujii J-ichi: Text mining and its potential applications in systems biology. Trends in biotechnology 2006, 24:571-9. [3] Chapman WW, Cohen KB: Current issues in biomedical text mining and natural language processing. Journal of biomedical informatics 2009, 42:757-9. [4] Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB: Frontiers of biomedical text mining: current progress. Briefings in bioinformatics 2007, 8:358-75. [5] Cusick ME, Yu H, Smolyar A, Venkatesan K, Carvunis A-ruxandra, Simonis N, Rual Jfrançois, Borick H, Braun P, Dreze M, Vandenhaute J, Galli M, Yazaki J, Hill DE, Ecker JR, Roth FP, Vidal M: Literature-curated protein interaction datasets perspective. Nature Methods 2009, 6:39-46. [6] Erhardt R a-a, Schneider R, Blaschke C: Status of textmining techniques applied to biomedical text. Drug discovery today 2006, 11:315-25. [7] Zhou D, He Y: Extracting interactions between proteins from the literature. Journal of Biomedical Informatics 2008, 41:393-407. [8] Airola A, Pyysalo S, Björne J, Pahikkala T, Ginter F, Salakoski T: All-paths graph kernel for protein-protein interaction extraction with evaluation of cross -corpus learning. BMC bioinformatics 2008, 9 Suppl 11:S2. [9] Kilicoglu H, Bergler S: Adapting a General Semantic Interpretation Approach to Biological Event Extraction. In Proceedings of BioNLP Shared Task 2011 Workshop. 2011:173-182.
[14] Liu H, Komandur R, Verspoor K: From Graphs to Events : A Subgraph Matching Approach for Information Extraction from Biomedical Text. In Proceedings of BioNLP Shared Task 2011 Workshop. 2011:164-172. [15] Jang H, Lim J, Lim J-H, Park S-J, Lee K-C, Park S-H: Finding the evidence for proteinprotein interactions from PubMed abstracts. Bioinformatics 2006, 22:e220-6. [16] Ono T, Hishigaki H, Tanigami A, Takagi T: Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 2001, 17:155-61. [17] Kim J-J, Zhang Z, Park JC, Ng S-K: BioContrasts: extracting and exploiting protein-protein contrastive relations from biomedical literature. Bioinformatics 2006, 22:597-605. [18] Giuliano C, Lavelli A, Romano L, Sommarive V: Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature. In ACL 2006. 2006, 18:401-408. [19] Oana Frunza and Diana Inkpen, “Extraction of DiseaseTreatment Semantic Relations from Biomedical Sentences”, Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, ACL 2010, pages 91–98, Uppsala, Sweden, 15 July 2010. [20] Asma Ben Abacha and Pierre Zweigenbaum,” A Hybrid Approach for the Extraction of Semantic Relations from MEDLINE Abstracts”, CICLing 2011, Part II, LNCS 6609, pp. 139–150, 2011.c_Springer-Verlag Berlin Heidelberg 2011. [21] Rosario B. and Marti A. Hearst. 2004. Classifying semantic relations in bioscience text. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, 430. [22] H. Cunningham, V. Tablan, A. Roberts, K. Bontcheva (2013) Getting More Out of Biomedical Documents with
Int'l Conf. Information and Knowledge Engineering | IKE'15 |
GATE's Full Lifecycle Open Source Text Analytics. PLoS Comput Biol 9(2): e1002854. doi:10.1371/journal.pcbi.1002854 — http://tinyurl.com/gate-life-sci/ [23] H. Cunningham, et al. Text Proces sing with GATE (Version 6). University of Sheffield Department of Computer Science. 15 April 2011. ISBN 0956599311. Available from Amazon. BibTex. [24] https://gate.ac.uk/sale/tao/splitch6.html#x9-1300006.2 [25] https://gate.ac.uk/sale/tao/splitch6.html#x9-1400006.4 [26] https://gate.ac.uk/sale/tao/splitch23.html#x2855200023.12 [27] H. Cunningham and D. Maynard and V. Tablan. JAPE: a Java Annotation Patterns Engine (Second Edition). Technical report CS—00—10, University of Sheffield, Department of Computer Science, 2000. [28] [28] Wu, L., Neskovic, P.: Feature extraction for EEG classification: representing electrodeoutputs as a Markov stochastic process. In: European Symposium on Artificial Neural Networks, pp. 567–572 (2007) [29] Jing Jiang and ChengXiang Zhai,” A Systematic Exploration of the Feature Space for Relation Extraction”, in the proceeding of NAACL/HLT, pp. 113-120, Association of computational linguistics, Michigan (2005). [30] Thorsten Joachims, “Text categorization with support vector machines: learning with many relevant features”, in Proceedings of ECML-98, 10th European Conference on Machine Learning, Claire Nédellec and Céline Rouveirol, Eds., Heidelberg et al., 1998, pp. 137–142, Springer. [31] Miwa M, Sætre R, Miyao Y, Tsujii J: A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2009:121-130. [32] Kim S, Yoon J, Yang J, Park S: Walk-weighted subsequence kernels for protein protein interaction extraction. BMC bioinformatics 2010, 11:107. [33] Chang, Chih-Chung and Lin, Chih-Jen “{LIBSVM}: A library for support vector machines”, ACM Transactions on Intelligent Systems and Technology, volume = 2, issue = {3}, year = {2011}, pages = {27:1--27:27}, Available at: Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
23