Incorporating Dictionary Features into Conditional

0 downloads 0 Views 334KB Size Report
Abstract. Biomedical Named Entity Recognition (BioNER) is an im- portant preliminary step for biomedical text mining. Previous researchers built dictionaries of ...
Incorporating Dictionary Features into Conditional Random Fields for Gene/Protein Named Entity Recognition Hongfei Lin, Yanpeng Li, and Zhihao Yang Department of Computer Science and Engineering, Dalian University of Technology, Dalian,China 116024 {hflin,yangzh}@dlut.edu.cn, lyp [email protected]

Abstract. Biomedical Named Entity Recognition (BioNER) is an important preliminary step for biomedical text mining. Previous researchers built dictionaries of gene/protein names from online databases and incorporated them into machine learning models as features, but the effects were very limited. This paper gives a quality assessment of four dictionaries derived form online resources, and investigate the impacts of two factors (i.e., dictionary coverage and noisy terms) that may lead to the poor performance of dictionary features. Experiments are performed by comparing performances of the external dictionaries and a dictionary derived from GENETAG corpus, using Conditional Random Fields (CRFs) with dictionary features. We also make observations of the impacts regarding long names and short names. The results show that low coverage of long names and noises of short names are the main problems of current online resources and a high quality dictionary could substantially improve the accuracy of BioNER. Keywords: BioNER, dictionary feature, CRF.

1

Introduction

Biomedicine literatures are expanding at an exponential rate. Biomedical text mining [1] can be an aid to information seekers who aim at finding knowledge from terabyte-scale texts. Biomedical Named Entity Recognition (BioNER) is the preliminary step of biomedical text miming, but its performance is far below that in the general domain. The best Named Entity Recognition (NER) systems on newswire articles can achieve an F-score over 95% [2][3], while the state-of-theart performances of BioNER are only between 75%-85% [1] varying with different datasets and evaluation measures. In JNLPBA 2004 task [4], five classes of named entities are required to recognize and the evaluation follows an exact matching criteria, where the top system [5] obtained an F-score of 72.6% which used a T. Washio et al. (Eds.): PAKDD 2007 Workshops, LNAI 4819, pp. 162–173, 2007. c Springer-Verlag Berlin Heidelberg 2007 

Incorporating Dictionary Features into CRFs for Gene/Protein NER

163

combination of HMM and SVM models plus post-processing with manual rules. BioCreative 2004 Task 1A [6] is to identify entities of one type which is tagged as “NEWGENE” or “NEWGENE1” and its evaluation measure is an F-score of relax matching where a name can have several representations. Finkel et al. [7] used a MEMM [8] model with carefully designed features plus post-processing of the abbreviations and mismatching brackets. Their system obtained an F-score of 83.2%, which was the best in BioCreative 2004 Task 1A. Why is this task so difficult? Liu et al. [9] built a large gene/protein database BioThesaurus from online resources. It was reported to have over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB1 entries. But the total amount of gene/protein names in the BioCreative 2004 training corpus is less than 10,000 before normalization. Thus the number of unknown names is at least hundreds of times more than names in a closed dictionary, and many long range names have extremely complex structures and many variant names. According to the report of JNLPBA 2004 task [4] most errors occurred at boundaries of long names. In addition, there are also many errors in single-token names. Some of the words have the shapes like common English words that are difficult to identify by orthographic features and many words are acronyms of gene/protein names that may be confused with other chemicals. So not given dictionary knowledge, it is extremely difficult to distinguish them from common English words and other named entities only using shallow parsing information. Dictionary-based systems suffer from the problem of low coverage and noisy terms. Currently there is no dictionary that contains all gene/protein names mentioned in literatures and most long names can not be found in dictionary. Since these dictionaries are automatically generated, they tend to bring large quantities of noises that will lead to low precision. Rule-based method is an effective way to recognize unknown words, but it is difficult to list all rules to model the structure of biomedical named entities, so it is always used in the post-processing stage. Machine learning method is more robust and can give better answers according to the context. Discriminative models with the structure of Markov Networks, such as CRFs [10] and MEMMs [8], are able to achieve state-of-the-art performances in this task. But there is a puzzling problem in many systems using these models that the performance improved little or decreased when external dictionary information was incorporated as features. In JNLPBA 2004 task, Zhou et al. [5] made use of SwissProt as external dictionary features, which improved the performance by 1.2 percent in F-score, while in BioCreative 2004 task, their performance reduced by 4 percent. Finkel et al. [7] used a lexicon of 1,731,581 entries built from LocusLink, Gene Ontology and internal resources, but the improvement in F-score was less than 1 percent. Settles [11] used a dictionary of five classes of entity, but the overall F-score decreased. In the following sections, we attempt to find the reasons for this phenomenon. Section 2 describes the implement of our baseline tagger. Section 3 presents what 1

http://www.pir.uniprot.org/database/knowledgebase.shtml

164

H. Lin, Y. Li, and Z. Yang

dictionaries are used and how they are incorporated as features. Section 4 and Section 5 present the experiments and result discussion.

2 2.1

Baseline Tagger Conditional Random Fields

A Conditional Random Field (CRF) [10] is a discriminative probabilistic model with the structure of Markov Network. Our experiment uses the linear chain CRFs, where given an input sequence o and state sequence s, the conditional probability P (s|o) is defined as follow:    1 P (s|o) = exp λk fk (si−1 , si , o, i) (1) Zo i k

where Zo is a normalization factor of all state sequences. fk (si−1 , si , o, i) is the feature function, and λk is the feature’s weight. si and si−1 refer to the current state and the previous state respectively. The training process is to find the weights that maximize the log likelihood of all instances in training data: LL =

 j

log P (sj |oj ) −

 λ2 k 2σ 2

(2)

k

where the second term in Formula (2) is a spherical Gaussian prior over feature weights. Once these settings are found, the labeling for a new unlabeled sequence can be done using a modified Viterbi algorithm [10]. CRFs have several advantages for labeling sequence data. Discriminative training make it possible to incorporate rich features that may be overlapped with each other and Markov Network is a powerful model to capture the information of contexts by computing the probability of state transition. Also automatic feature conjunction can be used to enhance the performance. 2.2

Implementation

Our baseline BioNER tagger is derived from Settles’ system [11], which is a CRFbased tagger with varieties of orthographic features, and features conjunctions with the window of [-1, 1]. Performance of this system is close to the best score in JNLPBA task, and it uses no external resources or manual rules for post processing. Detail of feature selection and other configuration can be found in Settles’ paper [11]. Our system do a little modification by adding POS tags and chunking tags as features, because these features are very important in BioNER and was chosen by many systems in the task. In this way, it can produce comparative results with other systems. The tagger is trained on the GENETAG corpus [12], which is the training and test data of BioCreative 2004 Task 1A, and achieves an F-score of 79.8% on the test set using relax match and 71.5% using exact match.

Incorporating Dictionary Features into CRFs for Gene/Protein NER

3

165

Dictionaries Features

3.1

Dictionaries

There are many online databases (e.g., LocusLink, EntrezGene) that are built to help biomedicine researchers. Many previous researchers [5][7][11][13] built gene/protein lists by extracting names from these databases. In this work, we investigated the qualities of four external dictionaries derived from the following four databases: LocusLink2 , EntrezGene3 , BioThesaurus4 and ABGene lexicon [14]5 . LocusLink. It presents information on official nomenclature, aliases, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites. It organizes information around genes to generate a central hub for accessing gene-specific information for fruit fly, human, mouse, rat and zebra fish. EntrezGene. It has been implemented to supply key connections in the nexus of map, sequence, expression, structure, function, citation, and homology data. Unique identifiers are assigned to genes with defining sequences, genes with known map positions, and genes inferred from phenotypic information. BioThesaurus. It combines all UniProtKB protein entries and multiple resources based on database cross-references in iProClass6. It was reported to have a coverage of 94% based on the gene names of BioCreative 2004 Task 1B. ABGene lexicon. Tanabe et al. [14] have used ABGene [15] to generate a large gene and protein lexicon of names found in MEDLINE database. Their approach yielded a final set of 1,145,913 gene names. In their experiments, assessment of a random sample determined the precision to be approximately 82%, and comparison with a gold standard gave an estimated coverage of 61% for exact matches and 88% for partial matches. GENETAG (Internal dictionary). It includes all the named entities labeled in training and test sets of GENETAG corpus except the alternative answers (the “Correct” files). Note that this dictionary contains all names in gold standard, so models trained on these features will lead to low generalization ability in real data, and it is only used to estimate the upper bound of dictionary features. For LocusLink and EntrezGene, the fields we used are the same as Cohen [13]. For BioThesaurus we extract names from the second column. We note that there are a lot of single characters and common English words in these dictionaries 2 3 4 5 6

ftp://ftp.ncbi.nlm.nih.gov/refseq/LocusLink/ ftp://ftp.ncbi.nih.gov/gene/ ftp://ftp.pir.georgetown.edu/databases/iprolink/ ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/ http://pir.georgetown.edu/iproclass/

166

H. Lin, Y. Li, and Z. Yang

such as “a”, “the”, and “is”. Then we use a stopword list of 500 common English words as a filter to remove these obvious noisy terms. In addition, we obtain a dictionary of 6,914,651 entries by combining the four external dictionaries. To our best knowledge, this is the largest gene/protein name dictionary that has been used in BioNER. Table 1 shows the total number of terms in each dictionary. Table 1. Total number of terms in each dictionary Dictionary

LocusLink EntrezGene BioThesaurus ABGene GENETAG

Number of terms 437,219

3.2

2,503,038

4,958,804

1,145,913 13,139

Matching Schemes

Once the dictionary is obtained, the next step is to choose the proper matching scheme. We present three matching schemes to estimate the quality of dictionaries and generate dictionary features. Method 1. Uppercase letters are converted into lowercase and hyphens are replaced by white spaces. Method 2. Every name is converted into its normalized form as the following steps. 1. Words are broken into token units. Four classes are defined for each character: uppercase letters, lowercase letters, digits and others. If adjacent characters are in different classes, the word is split into two token units from this point. For example, (e.g., “Abeta42 gene” to “A beta 42 gene”) 2. Non-alphabet and non-digit characters are replaced by white spaces. Every digit unit is converted into a single label “[D]”. All the Greek letters and Roman letters are replaced by “[G]” and “[R]”. (e.g., “A beta 42 gene” to “A [G] [D] gene”). 3. Uppercase letters are converted into lowercase. (e.g., “a [g] [d] gene”) 4. White spaces are removed. (e.g., “a[g][d]gene”) Method 3. Names are broken into token units as the step 1 of Method 2. Then the last unit is removed, and the rest part is converted into the normalized form as described above (e.g., “Abeta42 gene” to “a[g][d]”). 3.3

Incorporating Dictionary Features

Dictionary features used in the experiments are generated by a sliding window based on the maximum matching scheme, where the size of the window is up to the length of a complete sentence. If a name is found in dictionary, the beginning of window moves to the next token. Three tags are used to label the features of each token: “B-GENE”, “I-GENE” and “E-GENE” respectively referring to a token that is in the beginning, inner and end of a name in dictionary.

Incorporating Dictionary Features into CRFs for Gene/Protein NER

4 4.1

167

Experiments Dictionary Coverage

In our experiment, the dictionary coverage is defined as follow: Coverage =

NT P ∗ 100% NAP

(3)

where NT P is the number of names found in both the external dictionary and GENETAG dictionary, and NAP is the total number of names in GENETAG dictionary. We used the matching schemes introduced in Section 3.2 to evaluate this figure. We also investigated the dictionary coverage on single-token names, double-token names and multi-token names. The experiment is denoted as Experiment 1, and the results are shown in Table 2. Table 2. Dictionary coverage estimated by different matching schemes Dictionary

Matching scheme Coverage Coverage Coverage Coverage (single-token) (double-token) (multi-token) (all)

LocusLink LocusLink LocusLink EntrezGene EntrezGene EntrezGene BioThesaurus BioThesaurus BioThesaurus ABGene ABGene ABGene All All All

Method1 Method2 Method3 Method1 Method2 Method3 Method1 Method2 Method3 Method1 Method2 Method3 Method1 Method2 Method3

4.2

49.5 60.7 79.8 66.1 74.0 86.9 75.9 83.6 90.1 38.5 54.2 91.2 82.3 91.3 95

10.5 14.3 64.4 13.1 17.0 72.9 20.8 26.1 77.7 48.5 54.2 88.6 52.6 64.4 91.7

7.1 9.9 24.1 8.0 10.9 25.9 11.9 16.5 31.6 38.7 42.7 57.4 41.9 49.3 61

23.6 29.8 56.1 30.8 35.8 61.7 39.7 43.9 66.3 41.4 50.2 78.7 59.8 69 82.3

Impact of Coverage and Noises

We investigated the impact of coverage and noises of these dictionaries by incorporating dictionary information into the baseline tagger as features (Section 3.3) and then compared the performances on the test set of GENETAG corpus regarding single-token names, double-token names and multi-token names. This experiment is divided into four parts: Experiment 2. We compared the performances of the external dictionaries and the internal dictionary using the same matching scheme Mehod1 (Section 3.2). The evaluation measures are precision, recall, relax matching F-score and exact

168

H. Lin, Y. Li, and Z. Yang

matching F-score. Precision, recall and relax matching F-score are the same as BioCreative 2004 Task 1A, and exact matching F-score is the same as the JNLPBA task. The results are shown in Table 3. Table 3. Performances of dictionary features using Method1 Dictionary

Precision Recall F-score F-score (relax) (exact)

Baseline 80.7 (no dictionary features) LocusLink 83.2 EntrezGene 82.1 BioThesaurus 84.3 ABGene 82.3 All external 84.4 GENETAG 96

79.0

79.8

71.5

77 77.5 77.9 77.6 78.7 98.1

80 79.7 81 79.9 81.5 97

71.6 71.3 72.4 71.9 73.1 96.5

Experiment 3. In this experiment, performances of different matching schemes were compared using features from the combined external dictionary. The results are shown in Table 4. Table 4. Performance of different matching schemes Matching scheme Precision Recall F-score F-score (relax) (exact) Method1 Method2 Method3

84.4 83.7 81.3

78.7 78.6 78.9

81.5 81.1 80.1

73.1 72.6 72.3

Experiment 4. We assume there is little noise in the internal dictionary and noises are caused mainly by the introduction of external resources. Two dictionaries were used to evaluate the impact. One is the internal dictionary and the other is a dictionary which combines the internal dictionary and all external dictionaries. The coverages of the two dictionaries are both 100%, so the difference on performance reflects the impact of noises. We also investigated the different Table 5. Impact of noises on F-score Dictionary

F-score(Fs) F-score(Fd+Fm) F-score(Fs+Fd+Fm)

GENETAG 86 89.7 GENETAG+External 82.9 (-3.6%) 87.3 (-2.7%)

97 90.1 (-7.1%)

Incorporating Dictionary Features into CRFs for Gene/Protein NER

169

impact on short names and long names by selecting dictionary features based on single-token (Fs), double-token (Fd) and multi-token (Fm) names. The results are shown in Table 5. Experiment 5. New dictionaries were generated by mixing various proportions of the internal dictionary and external dictionaries. In this way we were able to “control” the coverages and noises of these dictionaries, thus obtaining more meaningful observation data for investigating the relationship between these factors and for prediction. In this experiment we depict two curves: one reflects the performance of the internal dictionary which we assume has little noise and the other reflects the performance of the mixed dictionary described above, where the impact of noises is the most serious of our available resources. The results are shown in Fig. 1.

Fig. 1. Relationships between dictionary coverage, noises and BioNER performances

5 5.1

Results and Discussion Experiment 1

From Table 2, it can be seen that the overall coverages of the four dictionaries are substantially low. No single dictionary is able to obtain a score over 50% in near exact matching scheme (Method1). LocusLink, which was used by many previous researchers, only has a coverage of 23.6%. EntrezGene has a coverage about 7 percent more than LocusLink and 9 percent lower than BioThesaurus. Also multi-token names matched in the former three dictionaries are obviously lower than single-token names. In all the four dictionaries, Method2 and Method3 both improve the coverage greatly, especially Method3 which improves the overall scores by over 20 percent on average. We examined the temple results and found that most of these cases were due to removing head nouns such as “gene”, “protein” and “receptor”. For example, “fibroblast growth factor receptor 2 gene” was not found in dictionary but “fibroblast growth factor receptor 2” could be found by removing the last token. However this method is certain to introduce many false positive instances. For example, term “gene E75” was converted to “gene”.

170

H. Lin, Y. Li, and Z. Yang

ABGene lexicon is somewhat different, which has much higher coverage on multi-token names and over all score than any other dictionaries. One explanation is that names in this dictionary were directly extracted from MEDLINE records and our test data, GENETAG corpus, is part of MEDLINE abstracts. However, the generation of it is an automatic procedure and there are also a number of names beyond the dictionary in MEDLINE records. 5.2

Experiment 2

Table 3 shows the impact on F-score of dictionaries with different coverages. It can be seen that the performances of all the external dictionaries differed not much from baseline, and dictionaries with higher coverage generally have slightly better F-score, which is very similar to the experiment results of previous researchers [5][7][11]. It indicates that dictionary features generated by exact matching help little when the dictionary coverage is low. By combining the four dictionaries another slight improvement can be obtained. The results produced by the internal dictionary features show that given a dictionary with substantially high quality, the system is able to obtain an F-score of 97%, which is similar to the best performance of NER in general domains. However this is an upper bound, since we used gold standard as part of dictionary and it is difficult to find this dictionary in practice. However, this score is much higher than the performance of the baseline tagger on training data itself (91%), and we did not use the full corpus of test data except the named entities. When analyzing the errors, we found most errors occurred at single-token words and some of them can only be inferred by semantic information. This is a good indicator that “enough” context information has been learnt from training data, and further improvement can be done only by enlarging the dictionary. 5.3

Experiment 3

Table 4 shows the performance of different matching schemes. It can be seen that Method1 is better than Method2 and Method3, although it has the lowest dictionary coverage. It indicates that fuzzy matching like Method2 and Method3 introduces many false positive instances that have negative effects on precision and training procedure. 5.4

Experiment 4

It can be seen from Table 5 that the overall performances decreased seriously when adding noises by mixing the internal and the external dictionaries. When using only single-token features, the F-score decreased by 3.6%, which indicated that using a dictionary with a coverage of 100% the score improved not much from the baseline (79.8% to 82.9%). For double-token and multi-token features it decreased by 2.7%. It indicates that the impact of noises on short names is more serious than long names. This is because the probability of matching long names is small and surface word features in a long name can help to reduce the impact

Incorporating Dictionary Features into CRFs for Gene/Protein NER

171

of noises. When analyzing the result errors, we find that some of the “noises” are common English words (e.g., “pigs”, “damage”), which are real noises. Some are acronyms (e.g., “MG” “IMM”) which may be gene/protein names according to the context, so not given information of the context they will mislead the training procedure where CRFs will reduce the weights of the dictionary features, which will lead to low recall. Some are general biomedical terms such as “lung”, “cell” and “cDNA”. Others are named entities of other substance (e.g., “Grf10” “xth11”) that could both affect the recall and precision. Besides, there are a lot of cases of boundary alternative names. For example, in the training corpus, “goat alpha s1-casein transcription unit” is labeled as a gene name, while only “alpha s1-casein” can be found in the dictionary. In another case, “type 1C-terminal peroxisomal targeting signal ( PTS1 )” is in the dictionary and “type 1C-terminal peroxisomal targeting signal” is the gold standard. As a result, the weights of the dictionary features (e.g., “I-GENE”, “E-GENE”) to predict a gene name will be reduced. 5.5

Experiment 5

Fig. 1. shows the relationships between dictionary coverage, noises and BioNER performances. The curve above reflects the performances of the features derived from various proportions of the internal dictionary that we assume has no noises. The other curve reflects the performances of features derived from the mixed dictionary that combines various proportions of the internal dictionary and the large external dictionaries with a total number of 6,914,651 entries. Generally, the F-score is increasing with increment of dictionary coverage and is also affected seriously by noisy terms introduced by external resources. From this graph, we can make prediction that by improving the quality of dictionaries, the performance of a BioNER system will vary between the two curves using the current method. This conclusion is encouraging, since the space for improvement is large. We also compared the best result of our experiment with the top system in BioCreative 2004 Task 1A (Table 6). Note that in our experiments no manual rules for post processing were used. Table 6. Comparison with the top system of BioCreative 2004 System

Precision Recall F-score

Top in BioCreative 82.8 Method1 84.4 Baseline 80.7

6

83.5 78.7 79.0

83.2 81.5 79.8

Conclusions and Future Work

In our experiment, we built several gene/protein dictionaries using online resources and investigated the impact of dictionary coverage and noises on the

172

H. Lin, Y. Li, and Z. Yang

performance of a CRF-based BioNER tagger with dictionary features. The results show that low coverage of long names and noises of short names are the main problems of these external dictionaries derived from online resources. The features based on exact matching and maximum matching is an effective way to reduce the impact of noises on long names. In addition, fuzzy matching like Method2 and Method3 can substantially improve the dictionary coverage, but with little help to the overall performance of the tagger. The possible reason is that names of different length were treated equally and we should develop more variable features in the next step. Also an internal dictionary derived form the training and test data was built to estimate the upper bound of performance of dictionary features and its relationship with dictionary coverage and noises. Experiment results also show that a high quality dictionary can substantially improve the performance of a BioNER system, but the qualities of current online resources have a distance from that. We assume there is still a large space for BioNER to improve using machine learning models with dictionary features. For single-token names, the problem is to reduce the noises, and for multi-token names the most important thing is to increase the dictionary coverage by building high quality dictionaries automatically or developing proper fuzzy matching schemes. Also, this strategy is very efficient, because building dictionary is once laborious work, and time-consuming methods (e.g., SVM [5] and web [7]) used in building dictionaries will be much more efficient than that used in the tagging procedure. Furthermore, the current evaluation of BioNER performance is a relatively ambiguous problem. In the task of BioCreative 2004, both in training data and test data there are alternative gold standards for the same gene/protein name, and in GENIA corpus and JNLPBA 2004 shared task, the answer is unique, so that a large number of errors occurred at boundaries. It indicates that the tasks need agreement in both development and evaluation stages, and the evaluation metric need to improve. Named entity recognition is the preliminary step for advanced text mining or information retrieval, so evaluation binding with next step application will be more practical. For example, the evaluation of NER and NE normalizations should be combined since these two procedures are often joined together, and the more valuable result is the normalized gene ID in database, such as SwissProt ID. Acknowledgments. This work is supported by grant from the Natural Science Foundation of China (No.60373095 and 60673039) and the National High Tech Research and Development Plan of China (2006AA01Z151).

References 1. Cohen, A.M, Hersh, W.R.: A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(1), 57–71 (2005) 2. Bikel, D., Schwartz, R., Weischedel, R.: An algorithm that learns what’s in a name. Machine Learning 34, 211–231 (1997)

Incorporating Dictionary Features into CRFs for Gene/Protein NER

173

3. Tjong, E.F., Sang, K., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), pp. 142–147 (2003) 4. Kim, J.D, Tomoko, O., Yoshimasa, T., et al.: Introduction to the Bio-Entity Recognition Task at JNLPBA. In: Proceedings of the International Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-04), pp. 70–75 (2004) 5. Zhou, G., Su, J.: Exploring Deep Knowledge Resources in Biomedical Name Recognition. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), pp. 96–99 (2004) 6. Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6(1), S1 (2005) 7. Finkel, J., Dingare, S., Manning, C.D.: Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinformatics 6(1), S5 (2005) 8. McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of The Seventeenth International Conference on Machine Learning, pp. 591–598. Morgan Kaufmann, San Francisco (2000) 9. Liu, H., Hu, Z., Torii, M., Wu, C., Friedman, C.: Quantitative Assessment of Dictionary-based Protein Named Entity Tagging. Journal of the American Medical Informatics Association 13(5), 497–507 (2006) 10. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco, CA (2001) 11. Settles, B.: Biomedical Named Entity Recognition Using Conditional Random Fields and Novel Feature Sets. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), pp. 104–107 (2004) 12. Tanabe, L., Xie, N., Thom, L.H., Matten, W., Wilbur, W.J.: GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6(1) (2005) 13. Cohen, A.M.: Unsupervised gene/protein entity normalization using automatically extracted dictionaries. In: Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, Proceedings of the BioLINK2005 Workshop; Detroit, MI: Association for Computational Linguistics, pp. 17–24 (2005) 14. Tanabe, L., Wilbur, W.J.: Generation of a Large Gene/Protein Lexicon by Morphological Pattern Analysis. Journal of Bioinformatics and Computational Biology 1(4), 611–626 (2004) 15. Tanabe, L., Wilbur, W.J.: Tagging gene and protein names in biomedical text. Bioinformatics 18(8), 1124–1132 (2002)

Suggest Documents