NER for Biomedical Domain. Fu Yu ... Research in biomedical domain has grown rapidly ... However, many NER systems didn'
Effective Adaptation of a HMM-based NER for Biomedical Domain
Fu Yu
Outline zMotivation zModel zEvaluation zTools zConclusions zReferences
Motivation z Research in biomedical domain has grown rapidly z Better indexing of biomedical articles z Assisting in relation extraction, term recognition
What do we want? zNamed entities: {gene names, protein names, etc. ze.g. p53 protein suppresses mdm2 expression Named Entity Recognition protein
gene
{ However, many NER systems didn’t get a good performance in biomedical domain…
Difficulties for Biomedical NER z Biomedical NEs may be very long z e.g. 47 kDa sterol regulatory element binding factor, etc.
z Some modifiers are often before basic NEs z e.g. activated B cell lines
z Complicated Constructions {Two or more NEs share one head noun: z e.g. 91 and 84 kDa proteins.
z Irregular Naming Conventions {An entity may be found with various spelling forms: z e.g. NF-kappaB, NFKappaB, NF-kappa B, (NF)-kappaB, etc.
Difficulties for Biomedical NER(2) z Cascaded named entities {One NE may be embedded in another NE: z e.g. kappa 3 binding factor
z Abbreviations of named entities z e.g. TCEd, IFN, TPA, etc.
zSo we can conclude that…
Difficulties for Biomedical NER(3) {NER for biomedical text is difficult!
Demo of NER in Biomedical Domain
Model zHMM-based NER zRich Feature Set zAbbreviation recognition zSolution for cascaded NEs
HMM-based NER z Hidden Markov models (HMMs), a “powerful tool for representing sequential data,” has been successfully applied to: {Part-of-speech tagging: z He books tickets
{Named entity recognition: z Mips Vice President John Hime
{Information extraction: z After lunch meet under the oak tree
HMM-based NER(2) z The purpose of HMM is to find the most likely tag sequence Cn=c1c2…cn for a given token sequence On=o1o2…on, that maximizes P(Cn|On) { In token sequence On, the token oi is defined as oi=, where wi is the word and fi is the feature set related with the word. { In tag sequence Cn, each tag consists of three parts: z Boundary category, which denotes the position of the current word in NE. z Entity category, which indicates the NE class. z Feature set, which will be introduced later.
HMM-based NER(3) Given a sequence of tokens (observations): … p53 protein suppresses mdm2 expression … and a trained HMM:
protein name gene name other n
n
Find the most likely state sequence: (Viterbi) argmaxCn P(C |O ) ...
...
...
p53 protein suppresses mdm2 expression
...
Any words, which are said to be generated by the designated “protein name” state, are extracted as a protein name: Protein name: p53 protein
Adaptation of a HMM-based NER z Using a linear-interpolating HMM (in order to reduce the effect of data sparseness)
{
Calculate the initial name class probability: function f() is calculated with maximum-likelihood estimates from training corpus, for example:
{ For other words and name classes: T() can be found from counting the events from training corpus
Model zHMM-based NER zRich Feature Set zAbbreviation recognition zSolution for cascaded NEs
Rich Feature Set zSimple Deterministic Features (Fsd) zMorphological Features (Fm) zPart-of-Speech Features (Fpos) zSemantic Trigger Features {Head Noun Triggers (Fhnt) {Special Verb Triggers (Fsvt)
Simple Deterministic Features z Using these features to capture capitalization, digitalization and word formation information
Rich Feature Set zSimple Deterministic Features (Fsd) zMorphological Features (Fm) zPart-of-Speech Features (Fpos) zSemantic Trigger Features {Head Noun Triggers (Fhnt) {Special Verb Triggers (Fsvt)
Morphological Features z Morphological information, such as: prefix/suffix {Important cue for named entity recognition {Using statistical method to get most frequent candidates z Each candidate is evaluated according to the formula below:
• #INi is the number that prefix/suffix i occurs within NEs; • #OUTi is the number that prefix/suffix i occurs out of NEs; • Ni is the total number of prefix/suffix i.
{ The candidates with Wt above a certain threshold (0.7 in our experimentation) are selected
Morphological Features(2) z Morphological information, such as: prefix/suffix { Calculate the frequency of each prefix/suffix in each NE class { Group prefixes/suffixes that have similar distribution among NE classes into one feature { Totally 37 prefixes and suffixes were selected and grouped to 23 features.
Rich Feature Set zSimple Deterministic Features (Fsd) zMorphological Features (Fm) zPart-of-Speech Features (Fpos) zSemantic Trigger Features {Head Noun Triggers (Fhnt) {Special Verb Triggers (Fsvt)
Part-of-Speech Features z In this NER system, each word is assigned a POS feature by the tagger. z Helpful to identify the NE boundary.
Rich Feature Set zSimple Deterministic Features (Fsd) zMorphological Features (Fm) zPart-of-Speech Features (Fpos) zSemantic Trigger Features {Head Noun Triggers (Fhnt) {Special Verb Triggers (Fsvt)
Semantic Trigger Features z Trigger words are key words inside or outside of named entities z Head Noun Triggers { Head noun means the main noun or noun phrase of some compound words and describes the function or the property z e.g. “B cells” is the head noun for the NE “activated human B cells”.
Semantic Trigger Features z Head noun is an important factor for distinguishing NE classes z IFN-gamma treatment z IFN-gamma activation sequence
z No matter how many similar expressions are within entities, entity classes are normally determined by head nouns
Rich Feature Set zSimple Deterministic Features (Fsd) zMorphological Features (Fm) zPart-of-Speech Features (Fpos) zSemantic Trigger Features {Head Noun Triggers (Fhnt) {Special Verb Triggers (Fsvt)
Semantic Trigger Features(2) z Special Verb Triggers { Some frequently occurred verbs adjacent to NE are useful for extracting the interaction between NEs z e.g. the verb bind is often used to indicate the interaction between proteins.
z In this system, 20 most frequent verbs, which are adjacent to NE, are selected.
Model zHMM-based NER zRich Feature Set zAbbreviation recognition zSolution for cascaded NEs
Abbreviation Recognition z Abbreviations are widely used in biomedical text z Mapping abbreviation to full form { Full form has more evidences to identify { The recognized abbr. can be helpful for the forthcoming abbreviations
z In practice, abbreviation and its full form often occur at the same time with brackets when first appear in biomedical documents. { full form (abbreviation) { abbreviation (full form)
Abbreviation Recognition(2) z The use of brackets also makes the annotation of NE more complicated { Sometimes the abbreviation and its full form are annotated separately z e.g. human mononulear leuko-cytes(hMNL)
{ sometimes, they are all embedded in the whole entity z e.g. leukotriene B4 (LTB4) generation
{Brackets need to be treated specially!
Abbreviation Recognition(3) z Process { Remove brackets (with abbr.) from sentence { Apply HMM-based recognizer { Restore brackets { Identify abbr. z z z
human mononulear leuko-cytes (hMNL) human mononulear leuko-cytes human mononulear leuko-cytes ( hMNL )
Model zHMM-based NER zRich Feature Set zAbbreviation recognition zSolution for cascaded NEs
Solution for cascaded NEs z 16.57% NEs in corpus are cascaded annotated z e.g. CIITA mRNA
z head noun features are still effective to some extent z e.g. IgG Fc receptor type IC gene z receptor is the head noun of protein and gene is the head noun of DNA. In general, the latter head noun will be more important. z e.g. IgG Fc receptor type IC gene
z Unfortunately, in practice the shorter NE is more possible to be identified z e.g. IgG Fc receptor type IC gene.
Solution for cascaded NEs(2) z However, people currently care more about the longest named entities { The longest named entities are more likely to be the subjects that people want to study { The longest named entities keep all information about the embedded named entities
z Using a rule-based method to solve this problem
Solution for cascaded NEs(3) z Four basic types of cascaded NEs: { 1. < head noun > { 2. < modifier > { 3. < > { 4. < word >
z These cascaded NEs may be generated iteratively z e.g. < modifier head noun >
, < head noun > …
Solution for cascaded NEs(4) z Collect four basic patterns of cascaded NEs
z Extend the patterns by combining the basic ones iteratively z modifier head noun Æ z modifier Æ
z 102 rules are incorporated to classify the cascaded NEs
Solution for cascaded NEs(5) z Apply the rules in the system { ...a Myc-associated zinc finger protein binding modif is one of … { The system will find that it matches the rule: binding modif Æ { The final result will be: …a Myc-associated zinc finger protein binding site is one of … { The algorithm will be applied iteratively until no new match is found
Evaluation of Features z Adapt a HMM-based combinations of features
NER
system
using
different
{ The present or past form of some special verbs often play like a adjective inside the biomedical NEs… z e.g. IL10-inhibited lymphocytes
Evaluation of Solution for Abbr. & Casc. z The abbreviation recognition method and the rule-based method improves the performance.
Some Other Tools
zAbner: extracts protein, DNA, RNA, cell line, and cell type
z Yagi: extracts only gene names, a brother of Abner
Demo of ABNER
Performance
Abner Alg. CRFs Model Prec. 69.9% Recall 72%
Yagi CRFs Model 75% 65%
Conclusions zStill a lot of room to improve. However, with existing extractors we can begin high level text mining work. zAs soon as better extractor is constructed, we can plug in easily.
References z D. Shen, J. Zhang, G. D. Zhou, J. Su and C. L. Tan. 2003. Effective Adaptation of a Hidden Markov Model-based Named Entity Recognizer for Biomedical Domain. In Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine. z Burr Settles. Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), Geneva, Switzerland, pages 104-107, 2004. z N. Collier, C. Nobata, and J. Tsujii. 2000. Extracting the names of genes and gene products with a hidden Markov model. In Proc. of COLING 2000, pages 201-207.
The End
Thank you! && Happy Christmas!