A Distributional Semantics Approach to Simultaneous Recognition of

0 downloads 0 Views 201KB Size Report
Abstract. Named Entity Recognition and Classification is being studied for last two decades. Since semantic features take huge amount of training time and are.
A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes of Named Entities Siddhartha Jonnalagadda1, Robert Leaman1, Trevor Cohen2, and Graciela Gonzalez1 1 Arizona State University, USA The University of Texas Health Science Center at Houston, USA [email protected], [email protected], [email protected], [email protected] 2

Abstract. Named Entity Recognition and Classification is being studied for last two decades. Since semantic features take huge amount of training time and are slow in inference, the existing tools apply features and rules mainly at the word level or use lexicons. Recent advances in distributional semantics allow us to efficiently create paradigmatic models that encode word order. We used Sahlgren et al’s permutation-based variant of the Random Indexing model to create a scalable and efficient system to simultaneously recognize multiple entity classes mentioned in natural language, which is validated on the GENIA corpus which has annotations for 46 biomedical entity classes and supports nested entities. Using distributional semantics features only, it achieves an overall micro-averaged Fmeasure of 67.3% based on fragment matching with performance ranging from 7.4% for “DNA substructure” to 80.7% for “Bioentity”. Keywords: Distributional, Semantics, Multiple, Named, Entity, Recognition, Classification, GENIA, Biomedical.

1 Introduction The problem of Named Entity Recognition and Classification (NERC) has been studied for almost two decades [24] and there has been significant progress in the field. While earlier attempts were almost all dictionary or rule-based systems, most of the modern systems use supervised machine-learning, whereby a system is trained to recognize named entity mentions in text based on specific (and numerous) features associated with the mentions that the system learns from annotated corpora. Thus, machine-learning based methods are very dependent not only on the specific technique or implementation details, but also the features used for it. Most of the contemporary high-performing tools use non-semantic features like parts of speech, lemmata, regular expressions, prefixes, n-grams, etc. The high computational cost associated with using deep syntactic and semantic features largely restricted the NERC systems to orthographic, morphological and shallow syntactic features. Another common limitation of NERC systems based on machine learning techniques such as conditional random fields is the significant computational needs when training on a large, rich corpus like GENIA. Conditional random fields have time complexity O(t*S^2*k*n) for training and O(S^2*n) for decoding[17, 23], where: A. Gelbukh (Ed.): CICLing 2010, LNCS 6008, pp. 224–235, 2010. © Springer-Verlag Berlin Heidelberg 2010

A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes

225

t is the number of training instances S is the number of states, which is linear in the number of entity classes and exponential in the order k is the number of training iterations performed n is the training instance length While such probabilistic graphical models have also been used for multi-class NERC[7,22,29,31], these are typically trained for less than six entities and are not particularly computationally efficient. In contrast, however, our system has time complexity O(t*S^0) for training and O(S^0) for decoding. Distributional Semantics is an emerging field that concerns the automatic estimation of the quantitative relatedness between words and between passages based on the distribution of words in a corpus. These estimates of relatedness have been shown to correspond well with human judgment in a number of evaluations, and have proved useful in many applications also [2]. Random Indexing[14], a recently emerged scalable method of distributional semantics, enables the processing of larger corpora than were possible with previous methods. In this paper, we present and evaluate an initial application that explores the use of distributional semantics for simultaneously recognizing and classifying all the named entities present in the GENIA corpus, which could represent a more elegant solution to the problem of multi-class NERC.

2 Background Semantic features of varying degrees of sophistication have been used previously in systems like ABNER [29] and the joint parser and NER tool developed in Stanford by Finkel [7]. However, their use has not resulted in any improvement in precision and recall. ABNER, a pioneering system for Biomedical NERC using conditional random fields, uses list-look up techniques based on 17 dictionaries that map individual tokens to their semantic types. The dictionaries include some entered by hand (Greek letters, amino acids, chemical elements, known viruses, plus abbreviations of all these), and those corresponding to genes, chromosome locations, proteins, and cell lines. These dictionaries were built carefully using sound algorithmic techniques. However adding these semantic features to the existing word-level features actually had a deleterious effect of decreasing the f-measure by 0.3%. Finkel’s tool uses the distributional similarity model built by Clark [3] in 2000 to determine the cluster to which a particular token belongs to. The clusters were built apriori from the British National corpus and English Gigaword corpus. The major limitations of this approach are that Clark’s model uses only the adjacent tokens to calculate the distributional similarity and that the ambiguity in the semantic type of the token depending upon the larger context is not taken into consideration. It is also reported that because they were able to find only 200 clusters, it resulted in slower inference and there was no improvement in performance. On the other hand, most of the state-of-art NERC systems such as BANNER don’t use any semantic features including distributional semantic features for want of evidence for scalability and impact on performance [19]. The main contribution of this paper is to create a framework to readily adapt distributional semantic features for NERC, and evaluate the performance of this approach on a corpus with multiple classes and nested entities.

226

S. Jonnalagadda et al.

2.1 Distributional Semantics Methods of distributional semantics can be classified broadly as either probabilistic or geometric. Probabilistic models view documents as mixtures of topics, allowing terms to be represented according to the probability of their being encountered during the discussion of a particular topic. Geometric models, of which Random Indexing is an exemplar, represent terms as vectors in multi-dimensional space, the dimensions of which are derived from the distribution of terms across defined contexts, which may include entire documents, regions within documents or grammatical relations. For example, Latent Semantic Analysis (LSA) [18] uses the entire document as the context, by generating a term-document matrix in which each cell corresponds to the number of times a term occurs in a document. On the other hand, the Hyperspace Analog to Language (HAL) model [20] uses the words surrounding the target term as the context, by generating a term-term matrix to note the number of times a given term occurs in the neighborhood of every other term. In contrast, Schütze’s Wordspace [28] defines a sliding window of around 1000 frequently-occurring fourgrams as a context, resulting in a term-by-four-gram matrix. Usually, the magnitude of the term vectors depends on the frequency of occurrence of the terms in the corpus and the direction depends on the terms relationship with the chosen base vectors. Random Indexing: Most of the distributional semantics models have high computational and storage cost associated with building the model or modifying it because of the large number of dimensions when a large corpus is modeled. While dimensionality reduction techniques such as Singular Value Decomposition (SVD) are able to generate a reduced-dimensional approximation of a term-by-context matrix, this compression comes at considerable computational cost. For example, the time complexity of SVD with standard algorithms is essentially cubic [4]. Recently, Random Indexing [14] emerged as promising alternative to the use of SVD for the dimension reduction step in the generation of term-by-context vectors. Random Indexing and other similar methods are motivated by the Johnson–Lindenstrauss Lemma [12] which states that the distance between points in a vector space will be approximately preserved if they are projected into a reduced-dimensional subspace of sufficient dimensionality. While this procedure requires a fraction of the RAM and processing power of Singular Value Decomposition, it is able to produce term–term associations [14] of similar accuracy to those produced by SVD-based Latent Semantic Analysis. Random Indexing avoids the need to construct and subsequently reduce the dimensions of a term-by-context matrix by generating a reduced-dimensional matrix directly. This is accomplished by assigning to each context a sparse high-dimensional (on the order of 1000) elemental vector of the dimensionality of the reduced dimensional space to be generated. These vectors consist mostly of zeros, but a small number (on the order of 10) +1 and -1 values are randomly distributed across the vector. Given the many possible permutations of a small number +1’s and -1’s in a highdimensional space, it is likely that most of the assigned index vectors will be close-toorthogonal (almost perpendicular) to one another. Consequently, rather than constructing a full term-by-context matrix in which each context is represented as an independent dimension, a reduced-dimensional matrix in which each context is represented as a close-to-independent vector is constructed. Term vectors are then

A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes

227

generated as the linear sum of the sparse elemental context vectors of each context in which they occur, weighted according to frequency. The method scales at a rate that is linear to the size of the corpus, and is consequently much faster than previous methods (processing, for example the entire MEDLINE corpus in around 30 min), allowing for the rapid prototyping of semantic spaces for experimental purposes. In addition, Random Indexing implementations, such as the Semantic Vectors package used in this research [33], tend to support both term-by-document and sliding-window based indexes, allowing for the comparison between these types of indexing procedures in particular tasks. Random Indexing also efficiently integrates new documents into an existing semantic space, allowing for implementation of efficient NERC systems. Paradigmatic vs. Syntagmatic relations: Recent research in distributional semantics has explored the differences between relations extracted depending on the type of context used to build a model [26]. As defined by de Saussure [27], there are two types of relationship between words – syntagmatic and paradigmatic. If two words cooccur significantly in passages or sentences, they are said to be in syntagmatic relationship. Examples include terms that occur frequently in succession such as p53 and tumor, APOE and AD, and poliomyelitis and leg. If two words can substitute for each other in the sentences while maintaining the integrity of the syntactic structure of a sentence, they are said to be in a paradigmatic relationship. Examples include: p53 and gata1, AD and SDAT, and poliomyelitis and polio. Since words in paradigmatic relationship don’t occur together in the same context, extracting such a relationship typically requires 2nd order analysis, while a 1st order analysis is sufficient to extract syntagmatic relationships. Sahlgren argues that using a small sliding-window rather than an entire document as a context is better suited to extracting paradigmatic relations, and supports this argument with empirical results [25]. The NERC task involves finding words that could conceivably replace the token we want to label without disturbing syntactic structure. However, in scientific language, domain semantics also determine which terms could replace one another [11]. We have chosen to model paradigmatic relationship using the vector coordinate permutations model introduced by Sahlgren, Holst and Kanerva [25], as it has been observed that the relations captured by this method tend to emphasize terms of a similar semantic class [34, 2]. Encoding Word Order Using Permutation: In addition to providing a paradigmatic model, Sahlgren's permutation-based method encodes word-order, thus accounting for the sequential structure of language. The order of the word signifies the grammatical role and hence the meaning of the word. This method is an alternative implementation to the convolution and superposition operations used by BEAGLE [13] to encode word-order information in word spaces. Sahlgren’s method captures word information by permutation of vector coordinates which is a computationally light alternative to BEAGLE’s convolution operation. To achieve this, Sahlgren et al use Random Indexing of vectors [14] to generate context vectors for each term and use permutation or shuffling of coordinates (shifting of all of the non-zero values of a sparse elemental vector to the left or right according to the relative position of terms) to replace the convolution operator. In this way, a different close-to-orthogonal elemental vector is generated for each term depending on its position within the sliding window. A

228

S. Jonnalagadda et al.

semantic term vector for each term is then generated as the linear sum of the permuted elemental vectors for each term co-occurring with this term in a sliding window. This permutation function is reversible, allowing for construction of order-based queries. Permutation-based indexing is supported in the Semantic Vectors package (see below), and is described in further detail in Sahlgren, Holst and Kanerva 2008 [25]. 2.2 Semantic Vectors System Semantic Vectors (http://semanticvectors.googlecode.com) is a scalable open source package written in Java and depends only on Apache Lucene. Semantic Vectors software package can be used to create distributional semantic vectors from corpora and also perform different mathematical operations on them. The package also supports operations for finding scalar products and cosine similarity, normalization, tensor operations (inner and outer product, sum, normalization), convolution products, and orthogonalization routines for vector negation and disjunction between term vectors and document vectors. Apache Lucene: Apache Lucene (http://lucene.apache.org/) is a powerful and widely used piece of open source software that is used by us for tokenization and indexing to extract the relative positions of terms from the corpus. The positions of terms within each document are input to Semantic Vectors package to create a reduced-dimensional approximation of a position-dependent term-by-term matrix. Lucene builds an index for all the documents that need to be searched and a count of the tokens in the document are stored in the term-document matrix. We use the tokenization methods provided by Lucene's StandardAnalyzer class to standardize the tokens from sentences in the test set with those from sentences in the training set. The rules for tokenization from Lucene as available in the documentation of http://lucene.apache.org/java/2_3_0/ api/org/apache/lucene/analysis/standard/StandardTokenizer.html. 2.3 GENIA Corpus To the best of our knowledge GENIA corpus [15,16] is the most complex corpus used to evaluate NERC systems, with around 100,000 annotations for 47 biologically relevant categories from 2000 PUBMED abstracts consisting of more than 400,000 words. Roughly 17% of the entities are embedded within another entity. Because of the limitations discussed in the introduction, there is no framework which recognizes and classifies all the entities above at the same time.

3 Methods The architecture is a 2-stage pipeline as shown in Figure 1. The entire corpus is broken into more than 18000 documents, each of which contains a unique sentence of the GENIA corpus. A Lucene index is built for this set of documents. The term and document vectors are built using the Semantic Vectors package. We used the Sahlgren’s Permutation-based model [25] with the dimensionality of the reduced-dimensional

A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes

229

Fig. 1. System Architecture

space as 200 to produce the Random Index Vectors. We selected a sliding window size taking into account the 5-tokens before and after the target token. The corpus is divided into two halves – one half is the training set and the other half is the test set. The Lucene Tokenizer breaks the sentence into tokens and the SimFind algorithm is used to find the token in the training set that is most similar to the target token. The entity class of the similar token is then assigned to the target token. SimFind therefore takes into consideration the surrounding context when determining the semantic type of each token while previous methods considered the semantic type of the token independent of the context. In this research, we utilize the estimates of similarity provided by Random Indexing for two purposes. Firstly, as token labels are context-dependent, we find the 100 most similar sentences from the training set that are similar to the vector sum of the terms belonging to the target sentence. Next, we find the first token from the similar sentences that is same as the target token or similar to it. Thus, the SimFind algorithm takes into account all the other tokens present in the sentence and it also doesn’t assume that the target token is present in the training set. The pseudo code for the algorithm is explained in Table 1. The complete source code with documentation is publicly available at: http://www.public.asu.edu/~sjonnal3/SV_ NER_src.zip. We use the list of 421 stop words created by Fox from the Brown corpus [9] to improve the efficiency of SimFind. These stop words were selected to be maximally efficient and effective in filtering the semantically neutral words. There are several options for the labeling model. The simplest is the IO model, which indicates whether the token is inside an entity or outside an entity, which is the model we employ for this work. Another possible model is IOB, where each token is labeled to be either

230

S. Jonnalagadda et al. Table 1. SimFind algorithm SimFind(targetToken, Line){ List simSentences = getSimilarSentences(Line,100); List goldenTokenLabel = getTokenLabels(simSentences); STEP1: FOREACH (goldenTokenLabel) IF (goldenTokenLabel has targetToken as token) return goldenTokenLabel; STEP2: IF (token IN STOPLIST) return ; terms = 1; STEP3: terms *= 10; =getSimWords(targetToken,terms); FOREACH (equivToken) FOREACH (goldenTokenLabel) IF (goldenTokenLabel has targetToken as token) return goldenTokenLabel; IF (simIndex>0.5) goto STEP3; return ;

The SimFind function is the core method which retrieves the sentences which share the same context as the target sentence and for each token in the target sentence. The algorithm first checks for the earliest appearance of the target token in the set of similar sentences arranged in the order of similarity. The next step should be to search for the presence of the tokens similar to the target tokens. However, to minimize the total time taken, we eliminate the tokens which appear too frequently in common English and hence are highly unlikely to be part of a biomedical entity.

getSimilarSentences(line, numberOfResults){ break line into tokens using Lucene Tokenizer; form query vector by computing the sum of tokens; search for similar documents in Random Index Vectors; set the number of results to be numberOfResults; listOfSimilarSentences are the sentences from the training set which correspond to these documents; return listOfSimilarSentences; } getSimWords(targetToken, count){ form query vector as targetToken; search for similar terms in Random Index Vectors; set the number of results to be count; return list of similar terms; } getTokenLabels(simSentences){ for each token in simSentences find the label from the xml annotation and add to listOfTokenLabels; return listOfTokenLabels; }

The getSimilarSentences function is responsible for finding the specified number of sentences from the training set that are similar to the vector sum of the terms belonging to a given sentence in the test set.

}

The getSimWords function is responsible for fetching the tokens in the corpus similar to a given token.

The getTokenLabels function is used to get the semantic type of the tokens in an annotated sentence.

beginning of an entity, inside an entity, or outside an entity. There are also systems using IOBEW model which in addition label for the end of the entity and one-word entity. In the recent evaluation of BANNER [19], an NERC tool, which used a corpus annotated with biomedical entities for recognizing gene entities, the difference between the performances of these three labeling models was found to be less than 1%. Each token can belong to multiple semantic types as GENIA annotates nested entities. Since there are 36 entity classes at leaf level[15], there are 236 possible types of labels with the IO model.

A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes

231

4 Results NERC systems are typically evaluated using exact matching, which requires that both the left and right boundary match exactly. For many applications, however, determining the exact boundary is not necessary and it is sufficient to determine whether the sentence contains an entity of the specified type or not and its approximate location. Thus, recently more realistic matching techniques like core-term matching and fragment matching are becoming prominent [32]. In core-term matching, the system’s annotated named entity must contain a core term of the named entity in the gold standard. This requires that every annotation in the corpus should also mention which is the core-term. In a corpus like GENIA with around 100,000 entities, this would require an excessive amount of annotation resources. In fragment matching, each token is treated separately. This provides a measure of how much fraction of the entity is matched and is thus more realistic than conventional exact matching and loose partial matching. Since it is shown [5] that 5x2 validation is statistically more powerful than 10x1 validation, we chose to evaluate using 5x2 validation. We present in Table-2 the precision, recall and f-score measures achieved by our system on all the entities annotated in the GENIA corpus except the biologically irrelevant entities like Protein N/A, DNA N/A, and those with insufficient data. We also provide the count of true positives, false positives, and false negatives in each case. For most of the entities, we are one of the first to use GENIA for evaluation. Hence our results also serve as comparison for all NERC systems that would be evaluated using GENIA corpus. In addition, for each entity we calculate the F-score for a system that randomly assigns a positive or negative in the ratio of the number of actual true or false cases respectively. If a corpus has t tokens belonging to a particular entity class and f tokens not belonging to that entity class, a system which randomly assigns tokens to that class in proportion to the known proportion of positives and negatives would result in both precision and recall approximating t/(t+f). The f-score of the random system would therefore also be approximately t/(t+f), which serves as a quantitative estimate the difficulty of NERC task for a specific entity class. This quantity is labeled Random F-score in Table-2. The entities in Table 2 are arranged in descending order of their f-scores based on our system. It is encouraging to see that more than half of the entity classes have an f-score greater than 50% just based on distributional semantics features and also the huge differences between f-score and Random F-score. The system also has a considerable good overall micro-averaged f-score of 67.3% which is calculated by adding the respective true positives, false positives and false negatives of each entity class. It took around 5 minutes to build the semantic vectors from the documents belonging to the GENIA corpus and around 3 hours to produce results for the testing set which constitutes more than 9000 sentences. This suggests that this framework is scalable and could have significant impact on the precision and recall of a more complex system.

232

S. Jonnalagadda et al. Table 2. Results for the GENIA entities Entity

Bio-entity Substance Organic compound Compound Amino acid Protein Lipid Virus Source Atom Nucleotide Other organic compound Protein molecule Organism Amino acid monomer Mono Cell Inorganic Natural source Carbohydrate Nucleic acid DNA DNA domain or region Cell type Cell line Artificial source RNA Body part Other name Protein domain or region Protein complex Protein family or group Peptide RNA molecule Multi Cell Polynucleotide Protein subunit DNA molecule Tissue RNA family or group Protein substructure DNA family or group DNA substructure Overall Score

FN

TP

FP

80.7 78.3 78.2 78.2 70.3 70.1 66.5 66.4 63.7 61.1 60.5 60.5

60479 46587 46244 46382 27331 26692 1243 1641 10434 150 114 2105

16131 13796 13792 13822 11917 11864 637 862 6554 92 86 1285

12868 11976 11944 11980 11091 10890 618 797 5326 99 63 1470

59.4 58.8 53.1

60.1 59.2 56.9

13194 2085 256

8456 1412 162

9014 1460 225

69.4 57.1 52.0 63.2 51.0 48.3 44.4

45.9 53.3 57.6 45.7 54.2 52.6 48.5

55.3 55.1 54.6 53.1 52.6 50.4 46.4

100 97 6017 43 9181 7829 5889

44 73 5560 25 8803 8366 7362

118 85 4427 51 7752 7051 6253

42.7 44.0 43.9 47.0 39.6 42.6 41.9

50.7 44.9 44.3 41.2 45.0 40.0 38.8

46.3 44.5 44.1 43.9 42.1 41.3 40.2

3046 2375 2442 707 148 11591 606

4089 3022 3118 797 226 15645 842

2968 2912 3074 1011 181 17367 958

40.4 34.0

40.1 39.8

40.2 36.7

1509 3761

2226 7289

2256 5697

41.9 36.5 36.5 44.9 31.2 24.6 22.8 28.3

32.7 36.7 34.7 27.0 31.1 22.6 23.7 15.7

36.7 36.6 35.6 33.7 31.2 23.6 23.3 20.2

149 453 315 62 379 174 151 67

207 783 547 76 834 533 510 170

307 777 593 168 838 597 486 360

12.2 12.8

16.5 14.5

14.0 13.6

21 270

151 1844

106 1588

6.1 66.3

9.3 68.4

7.4 67.3

11 342330

170 174180

107 157909

Precision (%) 78.9 77.0 77.0 77.0 69.4 69.2 66.1 65.6 61.4 62.0 57.0 62.1

Recall (%) 82.5 79.6 79.5 79.5 71.1 71.0 67.0 67.3 66.2 60.2 64.4 58.9

60.9 59.6 61.2

F score

Random F-score 26.22 20.92 20.82 20.82 13.69 13.37 0.66 0.86 5.62 0.10 0.05

1.28 7.87 1.23 0.15 0.10 0.66 3.76 0.05 6.05 5.31 4.35 2.14 1.87 1.98 0.61 0.10 10.31 0.56 1.33 3.36 0.15 0.40 0.30 0.10 0.40 0.30 0.20 0.15 0.05 0.66 0.05

A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes

233

Fig. 2. Depiction of which entities cause confusion for each entity. Each dotted arrow shows which biologically-relevant leaf-level entity class (at head of the arrow) causes most confusion for each leaf-level entity class (at the tail of the arrow) with the corresponding confusion percentage below its name.

There have been several attempts [1,8,10,21,30,35,36] using machine learning to find nested entities in text with many entities like GENIA corpus. As discussed in section 2, these systems limit themselves to work for less than six entities at a time due to computational cost. Since our framework also recognizes nested entities, we believe that it can be used to provide features that can be quickly calculated and can replace the features with slower inference. We attempted to analyze the errors made by our system by characterizing the confusion between the entitiy classes. An entity class A is said to have confused entity class B, iff either at least one of the false positives of B actually belongs to A or at least one of the false negatives of B was considered by the system to belong to A. The confusion percentage of entity class A relative to entity class B can be defined as the percentage of times A confuses B for a given corpus and a given cross-fold validation. Such knowledge helps us in discovering, refining or validating relationship between entity classes and creating more meaningful ontologies. Information on which entity classes damage the results of the target entity class will be valuable in creating more efficient and powerful rules or features. For example: 34% of the mistakes in classifying “RNA

234

S. Jonnalagadda et al.

domain or region” were caused because of “DNA domain or region”; 44% of the mistakes caused in classifying “Protein complex” were caused by “Protein molecule”; and 23% of the mistakes caused in classifying “Lipids” were caused by “Protein molecule”. In a significant number of cases, most of the confusions were caused by the immediate siblings as would be expected, but there were many exceptions. For example: “RNA domain or region” with “DNA domain or region”; “Lipids” with “Protein molecule”; and “DNA domain or region” with “Protein family or group”. This reflects both the ambiguity inherent in natural language and also the fact that while the GENIA ontology reflects a consideration of the major properties of an entity, the local context of a mention may be more indicative of a single property that may be shared with entities which are otherwise significantly different.

5 Conclusion We have proposed a scalable, efficient and accurate system using distributional semantic vectors to recognize all the entity classes in natural language using an annotated corpus. Our system is validated on GENIA corpus which has 46 entity classes with annotation that supports nested entities and achieves an overall micro-averaged f-score of 67.3% using fragment matching. In the future, we would present a machine-learning based system that uses distributional semantic features in addition to the available features.

References 1. Byrne, K.: Nested Named Entity Recognition in Historical Archive Text. In: Proceedings of International Conference on Semantic Computing (2007) 2. Cohen, T., Widdows, D.: Empirical distributional semantics: methods and biomedical applications. Journal of Biomedical Informatics 42 (2009) 3. Clark, A.: Inducing Syntactic Categories by Context Distribution Clustering. In: Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning (2000) 4. David, B., Lloyd, T.: Numerical linear algebra. Society for Industrial and Applied Mathematics, Philadelphia (1997) 5. Dietterich, T.G.: Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Comput. 10 (1998) 6. Eddy, S.R.: Hidden Markov Models. Curr. Opin. Struct. Biol. 6 (1996) 7. Finkel, J.R., Manning, C.D.: Joint Parsing and Named Entity Recognition. In: Proceedings of of NAACL HLT (2009) 8. Finkel, J.R., Manning, C.D.: Nested Named Entity Recognition. In: EMNLP (2009) 9. Fox, C.: A Stop List for General Text. ACM SIGIR Forum 24 (199) 10. Gu, B.: Recognizing Nested Named Entities in GENIA Corpus. In: Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis (2006) 11. Harris, Z.S.: The structure of science information. Journal of Biomedical Informatics 35 (2002) 12. Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz Mappings into a Hilbert Space. Contemporary Mathematics 26 (1984)

A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes

235

13. Jones, M.N., Mewhort, D.J.K.: Representing Word Meaning and Order Information in a Composite Holographic Lexicon. Psychol. Rev. 114 (2007) 14. Kanerva, P., Kristofersson, J., Holst, A.: Random Indexing of Text Samples for Latent Semantic Analysis. In: Proceedings of the 22nd Annual Conference of the Cognitive Science Society (2000) 15. Kim, J.D., Ohta, T., Tateisi, Y., et al.: GENIA Corpus-a Semantically Annotated Corpus for Bio-Textmining. Bioinformatics-Oxford 19 (2003) 16. Kim, J.D., Ohta, T., Tsujii, J.: Corpus Annotation for Mining Biomedical Events from Literature. BMC Bioinformatics 9 (2008) 17. Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of ICML (2001) 18. Landauer, T.K., Dumais, S.T.: A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychol. Rev. 104, 211–240 (1997) 19. Leaman, R., Gonzalez, G.: BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition. In: Proceedings of PSB (2008) 20. Lund, K., Burgess, C.: Hyperspace Analog to Language (HAL): A General Model of Semantic Representation. Language and Cognitive Processes (1996) 21. Màrquez, L., Villarejo, L., Martí, M.A., et al.: Semeval-2007 Task 09: Multilevel Semantic Annotation of Catalan and Spanish. In: Proceedings of the 4th International Workshop on Semantic Evaluations (2007) 22. McCallum, A., Li, W.: Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. In: Proceedings of CoNLL (2003) 23. McDonald, R., Fernando, P.: Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics (2005) 24. Rau, L., Res, G., Center, D., et al.: Extracting Company Names from Text. In: Proceedings of IEEE Conference on Artificial Intelligence Applications (1991) 25. Sahlgren, M., Holst, A., Kanerva, P.: Permutations as a Means to Encode Order in Word Space. In: Proceedings of CogSci. (2008) 26. Sahlgren, M.: The Word-Space Model. Doctoral Dissertation in Computational Linguistics. Stockholm University (2006) 27. Saussure, F., Bally, C., Séchehaye, A., et al.: Cours de linguistique générale. Payot, Paris (1922) 28. Schütze, H.: Automatic Word Sense Discrimination. Computational Linguistics 24, 97– 123 (1998) 29. Settles, B.: ABNER: An Open Source Tool for Automatically Tagging Genes, Proteins and Other Entity Names in Text. Bioinformatics 21 (2005) 30. Shen, D., Zhang, J., Zhou, G., et al.: Effective Adaptation of a Hidden Markov ModelBased Named Entity Recognizer for Biomedical Domain. In: Proceedings of ACL (2003) 31. Song, Y., Kim, E., Lee, G.G., et al.: POSBIOTM-NER in the Shared Task of BioNLP/NLPBA 2004. In: Proceedings of IJNLPBA (2004) 32. Tsai, R.T., Wu, S.H., Chou, W.C., et al.: Various Criteria in the Evaluation of Biomedical Named Entity Recognition. BMC Bioinformatics 7 (2006) 33. Widdows, D., Ferraro, K.: Semantic Vectors: A Scalable Open Source Package and Online Technology Management Application. In: Proceedings of LREC (2008) 34. Widdows, D., Cohen, T.: Semantic Vector Combinations and the Synoptic Gospels. In: Proceedings of the Third Quantum Interaction Symposium (2009) 35. Zhou, G., Zhang, J., Su, J., et al.: Recognizing Names in Biomedical Texts: A Machine Learning Approach. Bioinformatics 20 (2004) 36. Zhou, G.D.: Recognizing Names in Biomedical Texts using Mutual Information Independence Model and SVM Plus Sigmoid. Int. J. Med. Inf. 75 (2006)

Suggest Documents