A Case Restoration Approach to Named Entity Tagging in Degraded ...

A Case Restoration Approach to Named Entity Tagging in Degraded Documents1 Rohini K. Srihari Cymfony Inc State University of New York at Buffalo [email protected] Abstract This paper describes a novel approach to named entity (NE) tagging on degraded documents. NE tagging is the process of identifying salient text strings in unstructured text, corresponding to names of people, places, organizations, times/dates, etc. Although NE tagging is typically part of a larger information extraction process, it has other applications, such as improving search in an information retrieval system, and post-processing the results of an OCR system. We focus on degraded documents, i.e. case insensitive documents that lack orthographic information. Examples include output of speech recognition systems, as well as e-mail. The traditional approach involves retraining an NE tagger on degraded text, a cumbersome operation. This paper describes an approach whereby text is first “restored” to its implicit case sensitive form, and subsequently processed by the original NE tagger. Results show that this new approach leads to far less precision loss in NE tagging of degraded documents.

1. Introduction This paper focuses on the importance of named entity tagging in degraded text. Named entity (NE) tagging [1,10] is a key component of an information extraction (IE) system. NE tagging is the process of identifying salient entities in a document, such as the names of people, places and organizations, as well as dates/times and monetary expressions. For example, consider the sentence George W. Bush plans to invade Iraq by February at a cost of $51 billion. In this sentence, an NE tagger should tag George W. Bush as a person, Iraq as a country name, February as a date, and $51 billion as a monetary amount. Of course, there are further steps required, such as normalizing the date to an actual calendar date, but such problems are beyond the focus of this paper. The set of tags varies depending on the domain; a biomedical domain may require tagging of

Cheng Niu, Wei Li, Jihong Ding Cymfony Inc. chengniu,wei,[email protected]

proteins and chemical compounds. NE tagging is typically the first step in an IE system that is designed to extract entities, relationships between entities, as well as key events in which entities participate. The Message Understanding Conference (MUC) [2] has helped define standards for IE. NE tagging can also play an important role in other tasks as well. An information retrieval (IR) application may take advantage of NE tagging to improve precision of search, by simply using it as a sophisticated method of selecting index terms. An optical character recognition (OCR) application may use an NE tagger as a postprocessing module in an effort to identify potential errors and prompt a user for manual correction. Typically, proper names are such a source of error. Finally, NE tagging is a vital component of question-answering applications [11] which are more advanced information retrieval systems. This paper focuses on NE tagging in degraded documents. Degradation in this case refers to the lack of orthographic information. Documents that are all lowercase or all uppercase are the best examples of such degradation. Orthographic (or case) information is a key feature used in NE tagging; the absence of this feature typically results in poorer NE tagger performance on degraded text. Electronic corpora used by the intelligence agencies, such as the Foreign Broadcast Information Services (FBIS) are typically all uppercase. NE tagging is a critical first step in automatically extracting relationships/events from such corpora, and hence a method of overcoming the problems due to case insensitivity is required. The output of speech recognition systems is another example of such degradation: it consists of all uppercase text with virtually no punctuation. E-mail is also a source of degradation, since users tend to use all lowercase for convenience. With respect to OCR, source documents that are entirely uppercase (such as typewritten manuscripts) are defined as degraded. Results of OCR on mixed case could also

1

This work was supported in part by SBIR grant F30602-02-C-0156 from the Air Force Research Laboratory (AFRL)/IFEA.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE

be considered degraded depending on how often uppercase characters and lowercase characters are confused. The usual approach to the degradation problem is to train on a corpus of degraded text [5,6,8]. However, this approach has several drawbacks, including the need for multiple versions of an NE tagger. Furthermore, it doesn’t help in cases where further IE modules, such as relationship and event detection need to be invoked. In this paper, we propose a unique case restoration approach to the problem. This consists of using a statistical model to “restore” a degraded document to its implicit mixedcase mode. Once this is accomplished, the standard NE tagger can be applied. As mentioned previously, the case restoration module may also be applied to OCR documents in order to “restore” selective text to mixed case. This approach has several advantages including: (i) the ability to use one NE tagger on multiple types of documents, (ii) the ease of obtaining training data for the case restoration approach versus annotated data for the retraining approach, and (iii) the ability to apply further, more sophisticated IE modules to the case restored text. Results show that this approach results in loss of NE precision of only 2% on degraded documents compared to 6% using the retraining approach. Section 2 discusses the NE tagger in more detail. Section 3 presents the case restoration algorithm. Section 4 discusses testing and benchmarking of this approach followed by conclusions.

2. Named Entity Tagger Traditionally, NE taggers have followed one of two major approaches, a purely statistical approach, or a purely grammatical or pattern-based approach. We have pioneered a hybrid approach to NE tagging where both models are effectively combined [10]. Considerable research has gone into the proper sequencing of the various stages of NE tagging such that all evidence, even potentially conflicting evidence, is incorporated. The structure of the NE tagger system is shown in Figure. 1. The Local Pattern Matching module contains pattern match rules for time, date, percentage, and monetary expressions. These tags include the standard MUC tags, as well as several other new sub-tags such as age, duration, measurement, address, email, and URL. The pattern matcher consists of CymPL grammar rules that operate on a token list data structure that reflects the various levels of natural language processing. CymPL grammars are compiled to a special type of finite state automata which is an extension to a tree walking

automata. Subsequent modules are focused on tagging location, person, organization, product and event names. Local Pattern Matching (CymPL) Temporal and Numerical Expressions

Lexicon Recognition

Keyword-driven NE Tagging (Lexicon Agent)

Intermediate results or final results

NE Super-type Tagging (MaxEnt HMM) Person,location,organization, Product, and event

NE Sub-type Tagging (MaxEnt HMM) Sub-Types for Person,location, organization,Product, and event

SVO Parsing (CymPL) Syntactic groups and SVO dependency

SVO-supported NE Tagging (Lexicon Agent) NE error correction based on parsing structures

Figure 1: Architecture of Hybrid NE Tagger Local Pattern Matching is most effective for temporal (time, date, month, year, duration, etc.), numerical (money, percentage, measurements such as weight, length, etc.) and contact (address, email, phone, URL, etc.) expressions which are fairly predictable. The keyword-driven NE tagger is powerful in resolving ambiguous names. It does so by calling grammars from inside a lexicon and by leveraging Lexicon Agent global capabilities in co-occurrence checking, thereby selectively propagating NE tags to all occurrences of the phrase. The core module of the system is a maximum entropy [7] based Hidden Markov Model (HMM) which generates proper name super-types, namely, person, organization, location, product, and named event. The subsequent module utilizes results from super-type tagging in order to derive sub-types such as city, airport, government, etc. from the location super-type tag. This two-step modeling ensures that the NE performance for super-type tagging will not degrade even when the size of the training corpus is insufficient for accurate sub-type tagging. The Subject-Verb-Object (SVO) Parsing module decodes the logical dependency relationships between linguistic units such as the logical subject, logical object and logical complement of a verb. Active sentences such as John loves Mary and passive sentences such as Mary is loved by John are parsed into the same underlying logical structure, i.e. S-V-O: John-love-Mary. Such SVO structures enable the checking of structural constraints in support of NE disambiguation, which goes beyond simple


contextual checking for a linear string or n-gram model. SVO-supported NE tagging can leverage structural constraints in confirming or revising NE results. For example, the rule object-of(‘hire’) Æ Person can tag Kraft as a person even in cases where there is long distance dependency between Kraft and the key verb hire such as Kraft, previously trained in MIT, has recently been hired. Multiple levels of NE modules can be added to the system pipeline architecture. This permits a gradual refinement of NE results based on the depth of information extraction. For example, an NE tagging module based on Lexicon Agents has been added after SVO semantic parsing. Each level resolves the ambiguity to the extent possible given the evidence available at that level. A weighting scheme, which is critical in coordinating multiple levels of NE evidence has been implemented to generate the final NE tag and associated confidence. This method of NE tagging using “deferred decisions” is unique to our NE tagger.

equality, this is equivalent to maximizing the joint probability Pr(W sequence, T sequence) . This joint probability can be computed by bi-gram HMM as follows,

Pr(T sequence, W sequence) = ∏ Pr(w i , t i | w i −1 , t i −1 ) i

.

and the backoff model is as follows,

Pr(wi , t i | w i −1, t i −1 ) = λ1P0 (wi , t i | w i −1, t i −1 ) + (1 - λ1 )Pr(wi | t i , t i −1 )Pr(ti | w i −1, t i −1 )

Pr(w i | t i , t i −1 ) = λ 2 P0 (w i | t i , t i −1 ) + (1 - λ 2 )Pr(w i | t i ) Pr(t i | w i-1 , t i −1 ) = λ3 P0 (t i | w i-1 , t i −1 ) + (1 - λ3 )Pr(t i | w i-1 ) 1 Pr(w i | t i ) = λ 4 P0 (w i | t i ) + (1 - λ 4 ) V Pr(t i | w i-1 ) = λ5 P0 (t i | w i-1 ) + (1 - λ5 )P0 ( t i )

3. Case Restoration Approach The Case Restoration approach is based on the theory of a source-channel model [9]. It is assumed that the original text, which is case sensitive, goes through a noisy channel (such as a speech recognizer) which degrades the text. The source-channel model provides a theoretical and practical way of recovering the original text based on observations from the corrupted (case-insensitive) text. Case restoration is, by nature, a problem at the lexical level; syntactic structures seem to be of no particular help. In [12], both N-gram context and long distance cooccurrence evidence were used in order to achieve the best performance in tone restoration. A similar result was predicted for case restoration. But we observe that the majority of case restoration is captured by N-grams alone. So a simple bi-gram Hidden Markov Model was selected as the proper choice of language model to capture the phenomena. For further performance enhancement, we may include co-occurrence evidence in the future. Currently, the model is trained on a normal, case sensitive text file in the chosen domain. Three orthographic tags are defined in this model: (i) initial uppercase followed by lowercase, (ii) all lowercase, and (iii) all uppercase. Given a word sequence W sequence = w 0 w 1 w 2 Λ w n , the goal for the case

where V denotes the size of the vocabulary, the backoff coefficients λ’s are determined using Witten-Bell smoothing algorithm, and the quantities P0 (w i , t i | w i −1 , t i −1 ) , P0 (w i | t i , t i −1 ) ,

P0 (t i | w i-1 , t i −1 ) , P0 (w i | t i ) , P0 (t i | w i-1 ) are computed by the maximum likelihood evaluation.

4. Testing A corpus of 965KB was used for testing NE tagging performance (Table 1 and Table 2), using an automatic scorer following MUC standards. Each table shows precision, recall and f-measure (average of precision/recall) for various categories of NE tags. Table 1 shows the baseline performance obtained on a case sensitive corpus. For the purposes of evaluation of the case restoration approach, this should be considered the gold standard. All subsequent degradation is reported with respect to the overall 89.4% f-measure in Table 1. Table 2 illustrates the results obtained for the same corpus transformed to all uppercase; case restoration has been applied. The overall F-measure for the case restored corpus is only 2% less than that for the original case sensitive corpus. This is the least degradation of NE performance reported in the literature.

restoration task is to find the optimal tag sequence T sequence = t 0 t1 t 2 Λ t n which maximizes the conditional

Pr(T sequence | W sequence) .

By

probability Bayesian


Table 1: Named Entity Tagging for Case Sensitive Input: baseline Type Precision Recall F-Measure TIME 79.3% 83.0% 81.1% DATE 91.1% 93.2% 92.2% MONEY 81.7% 93.0% 87.0% PERCENT 98.8% 96.8% 97.8% LOCATION 85.7% 87.8% 86.7% ORGANIZATION 89.0% 87.7% 88.3% PERSON 92.3% 93.1% 92.7% OverAll 89.1% 89.7% 89.4% Table 2: Named Entity Tagging for Case Insensitive Input, Using Case Restoration Type Precision Recall F-Measure TIME 78.4% 82.1% 80.2% DATE 91.0% 93.1% 92.0% MONEY 81.6% 92.7% 86.8% PERCENT 98.8% 96.8% 97.8% LOCATION 84.5% 87.7% 86.1% ORGANIZATION 84.4% 83.7% 84.1% PERSON 91.2% 91.5% 91.3% OverAll 86.8% 87.9% 87.3% Without case information, the NE statistical model has to mainly rely on keyword-based features which calls for a much larger training corpus. This is the knowledge bottleneck for all NE systems adopting the Feature Exclusion approach because manual annotation of a large corpus is expensive, slow and error-prone. In order to overcome this bottleneck, [2] proposed augmenting the NE training corpus by including machine tagged case sensitive documents. This approach still requires retraining of the NE module, but it improves the model due to its increased training size. It has reported better performance than previous feature exclusion efforts, with only 3-4% performance degradation. However, due to the noise introduced by the tagging errors, the training corpus can only be augmented by a small fraction (1/8~1/5) with positive effect. So the knowledge bottleneck is still present. We also did a baseline benchmark (not involving case restoration) as shown in Table 3. A further benchmark using the simple retraining approach is shown in Table 4. A comparison of Table 3 with Table 1 shows that NE performance degrades by 51%, compared to the 3-4% using case restoration.

Table 3: Named Entity Tagging for Case Insensitive Input, Using Baseline System Type Precision Recall F-Measure TIME 54.8% 53.8% 54.3% DATE 86.1% 86.3% 86.2% MONEY 65.3% 73.8% 69.3% PERCENT 93.4% 90.0% 91.7% LOCATION 87.8% 13.2% 23.0% ORGANIZATION 37.6% 0.7% 1.4% PERSON 44.7% 1.2% 2.3% OverAll 82.8% 25.0% 38.4% A comparison of Table 4 and Table 1 shows that NE performance using the retraining approach is 6.26%, which is still larger than using the case restoration approach. Table 4: Named Entity Tagging for Case Insensitive Input, Using Retraining Type Precision Recall F-Measure TIME 74.5% 74.5% 74.5% DATE 90.7% 91.4% 91.1% MONEY 80.5% 92.1% 85.9% PERCENT 98.8% 96.8% 97.8% LOCATION 89.6% 84.2% 86.8% ORGANIZATION 75.6% 62.8% 68.6% PERSON 84.2% 87.1% 85.7% OverAll 85.8% 80.7% 83.1% Finally, we also did some initial experiments applying the NE tagger to a corpus of documents generated by an OCR system. Precision ranged from a high of 96% on date/time NEs to a low of ~40% for organization names. Proper name strings had a precision of 86% which is promising in terms of using this technique for semiautomated OCR post-processing2. Organization names (ORG) tend to be the most domain specific and many of the spurious ORG tags were caused due to OCR errors. Recall on the other hand was fairly high (above 90%) for all NE categories except the PERSON category. We are in the process of experimenting with case restoration on this test set. Initial results have been promising. For example, the following reflects erroneous output from the NE tagger based on the original OCR output: 16 East 16 th Street Ne w York . Due to OCR errors in recognizing the “w” as an uppercase “w”, the location tag 2

Not all tagged person, organization, and locations are proper names. For example, "Lon" is tagged as a person, though it is not a proper name. "M.D." is tagged as an organization, though not a proper name.


has been split into two location tags. By applying case restoration to the OCR output first, we obtain the following correct output: 16 East 16 th Street NE W York .

[6] Palmer, David D., Mari Ostendorf, & John D. Burger. 2000. Robust Information Extraction from Automatically Generated Speech Transcriptions. Speech Communications, Vol. 32, pp. 95-109.

5. Conclusions

[7] Ratnaparkhi, A. 1998. Maximum Entropy Models for Natural Language Ambiguity resolution, PhD thesis, Univ. of Pennsylvania.

This paper has discussed a case restoration approach to the problem of NE tagging on degraded documents. The application of NE tagging to applications other than information extraction, e.g. OCR post-processing, have been outlined. The results have shown that the case restoration approach has superior performance to the retraining approach; several additional benefits of this approach have also been discussed. Future work will involve more benchmarking in the OCR (including handwritten text) domain. We also intend to apply this to “restore” case in e-mail followed by NE tagging. We plan to extend the case restoration module to include general language restoration. For example, abbreviations in e-mail should be restored to their proper lexical forms, and punctuation should be restored wherever possible. The latter is especially necessary in spoken document databases representing the output of a speech recognizer.

6. References

[1] Bikel, Daniel M., Richard Schwartz, & Ralph M. Weischedel. 1999. An Algorithm that Learns What’s in a Name. Machine Learning, Vol. 1,3, pp. 211-231.

[8] Patricia Robinson et al. 1999. Overview: Information Extraction from Broadcast News. In Proceedings of The DARPA Broadcast News Workshop, Herndon, Virginia. [9] Roukos, Salim. 1996. Language Representation, in Survey of the State of the Art in Human Language Technology, National Science Foundation, also available at http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html. [10] Srihari, R. K., C. Niu, & W. Li 2000a. A Hybrid Approach for Named Entity and Sub-Type Tagging, Proceedings of ANLP 2000, Seattle. pp. 247-254. [11] Srihari, R & W. Li 2000b. A Question Answering System Supported by Information Extraction, Proceedings of ANLP 2000, Seattle, pp. 166-172. [12] David Yarowsky. 1994. A Comparison Of CorpusBased Techniques For Restoring Accents In Spanish And French Text, 2nd Workshop on Very Large Corpora, pp. 319-324.

[2] Chieu, H.L. & H.T. Ng. 2002. Teaching a Weaker Classifier: Named Entity Recognition on Upper Case Text. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia. [3] Chinchor, N. & E. Marsh, 1998. MUC-7 Information Extraction Task Definition (version 5.1), Proceedings of MUC-7. [4] Krupka, G.R. & K. Hausman, 1998. IsoQuest Inc.: Description of the NetOwl (TM) Extractor System as Used for MUC-7, Proceedings of MUC-7. [5] Kubala, Francis, Richard Schwartz, Rebecca Stone & Ralph Weischedel. 1998. Named Entity Extraction from Speech. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop.