Domain Adaptive Information Extraction From Text - Semantic Scholar

2 downloads 66745 Views 141KB Size Report
A sentence fragment which contains two or more protein names and is ..... new domain, and the modules with the best performance would be selected.
Domain Adaptive Information Extraction From Text by Robert Arens Advisor: Marc Light

An information extraction system is designed to operate over a specific domain, and cannot be applied to new domains without being adapted if it is to perform well. We will investigate the problem of adapting information extraction systems to new domains by first defining the task of information extraction and giving an example of an information extraction system. We will then outline the modules comprising a typical system designed to solve the task and comment on their impact on domain adaptation. Previous work on adapting systems to new domains will be reviewed, along with work by the author. We will conclude with open problems and future areas of research in this field.

1. Information Extraction

1.1 Defining Information Extraction Information Extraction (IE) can be generally thought of as the creation of a structured representation of specific facts from unstructured or semi-structured natural language text documents (Grishman 1997, Freitag 1998). We say "generally" because the definition of IE can change subtly from author to author. The Message Understanding Conferences (MUC) (Chinchor 1998) define the IE task as extracting information for a set of predefined fields in a template corresponding to some desired information relationship from a set of documents. Cowie and Wilks (2000) define it as selectively structuring and combining data found in one or more texts. Stevenson (2004) defines it simply as the completion two subtasks, named-entity (NE) tagging and relation extraction. Given a sentence like "The leftist guerrillas bombed the neighboring American and French embassies in Lima.", these three definitions might result in systems which extract very different

information. A system based on the MUC definition might define a template with three fields: an occurrence of a bombing, who did it, and who the victim was, with each field having one slot to fill. This would generate two pieces of information - guerrillas bombed the American embassy, and guerrillas bombed the French embassy. A new template containing a fourth field for where the bombing took place would let us know these took place in Lima. Cowie and Wilks might find the data "guerrillas", "American embassy", "French embassy", and "Lima", and use the terms "attacked" and "neighboring" to combine it. It could be structured any number of ways - guerrillas made attacks in Lima, guerrillas attacked the American embassy, the American embassy is next to the French embassy, etc. Stevenson would find the same data as Cowie and Wilks as named entities, and put them in relationships - guerrillas and Lima are the relationship "bombing activity", France and Lima are in the relationship "embassy in", etc. Regardless, the motivation for IE systems is concrete; given vast and expanding sets of text containing information useful to some person or task and finite human resources to get that information, technologies to bridge the gap are necessary. For example, the MEDLINE database of medical article references and abstracts contains over 13 million entries, with up to 3,500 entries added per day (MEDLINE Fact Sheet 2005). The technologies used to bridge this IE gap are largely based in the fields of natural language processing (NLP) and machine learning (ML). NLP technologies attempt to understand and generate natural human language (Jurafsky and Martin 2000). ML allows technologies to improve automatically with experience (Mitchell, 1997). A field similar to IE, and also based largely on both NLP and ML, is information retrieval (IR), which deals with the representation, storage, organization of, and access to information items (BaezaYates and Ribeiro-Neto 1999). IR is helpful but not sufficient for filling the IE gap. While a good IR system will aid in finding documents relevant to the information desired, human resources will still be consumed in manually searching these relevant documents. For example, a biologist might try to find out about proteins which interact with each other. A query to MEDLINE for

"protein interaction" returns over 180,000 relevant entries, and there is no

guarantee that all the abstracts are relevant or that all relevant documents have been found. IR systems will help ensure that more correct documents are found, but going through these documents will no doubt

be a time-consuming task. Blaschke and Valencia (2002) introduced an IE system, SUISEKI, which does just this. Though the system is not perfect, it is a good starting point, and far preferable to reading 180,000 entries manually.

1.2 SUISEKI: A prototypical IE system The SUISEKI system begins by taking a user query to MEDLINE. This query is used to retrieve MEDLINE abstracts returned by the query. The system then separates the abstracts into individual sentences, considered separately. The sentences are then split into their component words (or tokens), which are then assigned parts of speech (verb, noun, adjective, etc.). The system then identifies tokens which are names of proteins. It discovers interactions between the proteins using a set of manually created rules, which they refer to as "frames", corresponding to language constructions which are commonly used to describe protein interactions. A sentence fragment which contains two or more protein names and is syntactically similar to these frames is likely to describe an interaction between those proteins. The proteins and words describing their interaction are then extracted for the user.

1.3 Adapting domains The domain over which an IE system operates can be thought of as the genre and format of the content in documents from which information will be extracted. Examples include newswire articles, medical journal abstracts, job postings, and seminar announcements. The domain defines and constrains the way text in the documents can be accessed, how facts can be found in the documents, and what methods an IE system can use to extract the desired information. For example, the domain of seminar announcements often contains information in a standard "who, when, where" format, and a simple set of pattern-matching rules might easily be used to extract all pertinent information. The domain of medical abstracts, on the other hand, is likely to contain information expressed in a more complex manner, and pattern-matching rules might be insufficient. IE systems are necessarily tied to their domains (Cardie 1997, Riloff and Jones 1999). SUISEKI, for example, relies on frames which were created based on language constructions found in MEDLINE abstracts, which are likely to be quite different from those

found in job postings. Creating the domain-specific parts of an IE system can require months of work from a domain expert on top of the time taken to design the system itself (Glickman and Jones 1999). It is this need for domain-dependent design that precludes the reuse of IE systems for applications in new domains. Adapting an IE system to a new domain will still require expert knowledge of the new target domain, but how much? - Can an IE system be designed to require minimal expert knowledge? - Is there a minimal set of domain-dependent modules in an IE system? - Is there a way to design an IE system to ensure easy portability to new domains? This paper will focus on the desire for IE systems which can quickly and easily overcome the difficulties of adapting to new domains. The remainder of this paper will first concern itself with the general design of an IE system, its components and whether these components are dependent on the domain of the system, followed by some example IE systems. Previous work in domain adaptive IE systems will then be examined, as well as previous work from the author. Finally, open problems in domain adaptive IE will be discussed, along with possible avenues of research into these problems.

2. IE system modules

An IE system is composed of a number of NLP components, or modules, which make up the architecture of the system and support the IE task. Though there is no single standard makeup of an IE system's modules, the following is a typical architecture for an IE system operating on unstructured text (Cowie and Lehnert 1996, Grishman 1997, Glickman and Jones 1999, Neumann and Mazzini 1999). - Filtering of documents - Structural processing -- Tokenization

-- Part of speech (POS) tagging -- Sentence parsing / syntactic analysis - Semantic processing -- Named entity (NE) tagging -- Relationship extraction

These NLP modules support the IE task by annotating input text for information at the document, structural, or semantic level. The following sections will expand upon these modules, describe the annotation they perform, and comment on how they can or cannot be adapted to new domains.

2.1 Filtering of documents Before an IE system attempts to extract information from a document, it should know whether or not the document is relevant to its task. This is a practical step for both performance and efficiency; by eliminating irrelevant documents, the system is less likely to extract spurious information, and will not waste time with documents unlikely to produce good information. This module annotates the relevance of input documents to the IE task, and this annotation does not change drastically across domains. By definition, this module is domain dependent. This module, arguably, is a task to be solved by an IR system. Glickman (1999) disagrees with this assessment, stating that filtering is to be done on documents which have already been retrieved by an IR system. The crux of the argument seems to lie in at what stage filtering is done - at the document level, or on text inside the document. There exist IE systems which either explicitly (i.e. Riloff (1996)) or implicitly ( i.e. Lutsky (2004)) assume that input documents have already been filtered. These systems do no filtering after this level, so an IR system is all that is required. Other systems (i.e. Marcotte et. al. (2001)) compute relevance of text inside the document before attempting IE, confirming Glickman's stance.

2.2 Structural processing

These modules deal with processing the sturcture of the text from input documents. More specifically, this means analyzing the text on lexical, morphological, and syntactic levels without regard to the meaning of the text. The annotations made by structural processing modules will not change drastically across domains.

2.2.1 Tokenization Tokenization is the process of breaking documents into the smallest discrete units of information in the document relevant to the task at hand. In text, this is usually means breaking the text into individual words. For some domains tokenization must include extra-textual units, such as HTML tags in web pages. With the exception of extra-textual information, tokenization is widely regarded as domain independent (Neumann and Mazzini 1999), and for most domains simple rule-of-thumb tokenization independent of the domain will achieve near-human performance (Glickman 1999). However, this is not always the case. Hirschman et. al. (2002) pointed out that biology-specific rules for tokenization should be taken into account in the biomedical text domain due to the peculiarities of gene names. Arens (2004) further reinforced this point by showing that tokenization in this domain was more error-prone inside any named entity, not just gene names.

2.2.2 POS tagging This module assigns parts of speech (i.e. noun, verb, adjective, etc.) to tokens in the document. POS taggers differ in their implementation, but all require either training on a specific corpus (i.e. Brill (1992)) or knowledge of the distribution of POS tags in a specific corpus (i.e. Cutting et. al. (1992)). This module cannot be considered domain independent due to this reliance on previous training, but as with tokenization POS tagging systems can adapt to most domains with little or no trouble. Note also that some systems, i.e. Bikel (1997) do not use POS tagging, bypassing this module.

2.2.3 Sentence parsing / syntactic analysis This module acts at the sentence level, and can perform many different kinds of analysis based on the

needs of the system. Most systems will require the parsing of noun and verb phrases, as these will most likely contain the pieces of information to be extracted and their relationships. Systems such as Surdeanu et. al. (2003) might also require sentence parsing to analyze the predicate-argument structure of the sentence. This module is domain dependent (Briscoe 1996). Syntactic analysis and parsing usually relies on a grammar constructed manually or automatically (Brill 1993) which cannot be generally applied to a new domain. While existing parsers such as FASTUS (Hobbs et. al. 1997) and the Link Grammar Parser (Grinberg et. al. 1995) work well for domains such as newswire text, they require customization for use in domains such as biomedical text (Ding et. al. 2003).

2.3 Semantic processing Semantic processing of text uses the structure of the text along with domain and task-specific knowledge to discover meaningful items and relationships in text. Because these modules deal with the meaning of the input texts, their annotations will change drastically across domains.

2.3.1 NE tagging NE tagging is the task of identifying proper names and quantities of interest in a document (Chinchor 1998). NEs can be names of people, companies, genes, products, etc. Once identified, these entities are given one or more semantic labels based on the type of information to be extracted. This module is critical to the IE process, so much so that it is sometimes considered an IE task in and of itself, as in the MUC tasks. NE tagging relies on linguistic analysis to determine which tokens are likely to be named entities, but also requires domain-specific information to identify which tokens are truly names. Often, the linguistic analysis must also be trained to the domain, as in Tanabe and Wilbur (2002). This module is domain dependent, and no methodology for domain independent NE tagging exists. Section 3 will show, however, that NE taggers can be adapted to new domains with relative ease.

2.3.2 Relationship extraction Once names have been found, the relationships between them need to be ascertained. This can be done in a number of ways depending on the IE task. Relationships can be found using syntactic relationships involving NEs. For example, Niu et. al. (2004) constructed a system which notes that structures such as an adjective corresponding to a place modifying a noun corresponding to an organization indicates the location of that company, i.e. "Redmond-based Microsoft Corp." A "bag of words" approach can find relationships simply by looking for keywords indicating relationships along with NEs, as in Albert et. al. (2003) who found protein interactions using a dictionary of protein names and terms associated with protein interaction, and extracted sentences with a "trioccurrence" of two proteins and one interaction term. Wong (2001) and Ono et. al. (1999) used a combination of these methods, looking for instances of trioccurrences in some syntactically relevant structure, i.e. "protein_name interacts with protein_name". As this module is task dependent, speaking of its adaptability to new domains is moot since moving to a new domain implies a new task.

2.4 Error propagation It is important to note that modules in an IE system build upon one another - relationship extraction relies on named entity tagging, which relies on parsing, etc. An error at any module will propagate to other modules, introducing the same, if not more, error into the system. Error in this case refers to both facts that the IE system failed to extract (false negatives) and spurious facts extracted by the system (false positives). This is true in general for IE systems, but is especially important when porting a system to a new domain. Though we have asserted that all modules of an IE system are domain dependent, there have been some caveats; for example, we acknowledge that some authors maintain that tokenization is domain independent. This may be true for the domains that these authors have studied, but the introduction of a domain that is a counterexample, such as medical journal articles, may cause error propogation that severely degrades system performance. An example for this can be found in the protein interaction task. Suppose one wishes to find proteins which interact with a certain widely studied protein, NF-kappa B. If the tokenization module of an

IE system tokenizes this protein into two tokens, "NF-kappa" and "B", it may never be discovered as the protein "NF-kappa B" by a named entity tagger. A bag-of-words relationship extractor (as in Albert (2003)) would then miss sentences containing NF-kappa B, introducing extraction error for every sentence missed from this single systemic tokenization error. Because of this error propagation, it is important that all modules in an IE system be either adapted to a new target domain, or evaluated on the new domain to ensure that any error they introduce is minimal.

3. Adapting Domains

3.1 Previous Work Previous work in domain adaptive IE systems can be grouped according to how much human input is required to adapt the system to a new domain: supervised systems, unsupervised systems, and bootstrapped systems.

3.1.1 Supervised systems These methods revolve around finding patterns for extraction from training text that

requires some

human IE annotation, and are called supervised learning methods due to their reliance on human guidance. Early work in domain adaptation focused on the automatic collection of domain knowledge dictionaries, which could then be used for extraction, or leveraged into other extraction methods. Riloff (1993) introduced AutoSlog, a system to automatically generate a domain-specific dictionary for IE tasks. AutoSlog used a corpus of texts and answer keys from MUC-4 to compare filled templates with the sentences they were taken from. By comparing the linguistic parse of the sentence to a set of linguistic patterns used as a heuristic, AutoSlog built a syntactic pattern to extract noun phrases similar to the slot-

fillers from the template. For example, given the slot-filler "bombed", Auto-Slog might find a sentence with the phrase "public buildings were bombed" and associate the pattern passive-verb with it, extracting "public buildings" as a dictionary term. Another domain knowledge building system was developed by Cardie (1993). Cardie's system required the user to create taxonomies of word senses and concept types for a set of training sentences. These taxonomies, along with part of speech tags, were used to create definitions comprising 39 attributes for each instance of a word in the training sentences. Machine learning methods were then used to determine which attributes were most important for the extraction of new definitions, and for the actual extraction of the definitions. A similar technique was used by Ciravegna (2001) in the LearningPinocchio (LP2) system, which was an adaptive IE application for generic uses (the system has been adapted for use with the seminar announcement corpus (Freitag 1998), the job announcements corpus (Califf 1998), resumés (Ciravegna 2001), etc.). LP2 required a small number of user-annotated texts tagged with items to be extracted, and a gazetteer of semantic categories for extracted items if available. The syntactic (and semantic, if a gazetteer was available) attributes of the words to be extracted and the words surrounding the extracted words were then used to induce generalized rules for extraction. An example from Ciravegna (2001) follows. In this example, "lemma" is the stemmed word (i.e. the lemma of "companies" would be "company"), "LexCat" is the part of speech of the word, "case" is the lower/upper case information of the word, "SemCat" is the semantic category of the word found in the gazetteer, and "Tag" is the tag associated with the word, if any.

Fig. 1: Example of rule creation and generalization from Ciravegna (2001)

Nymble, a named entity tagger presented by Bikel et. al. (1997), took similar training data for training, but did not look at the context of the tagged terms. It looked at instead at attributes from the tagged terms such as its case, whether it contained both numbers and letters, whether it was all capital letters, etc. and used these features in a statistical learning method (a Hidden Markov Model) to learn how to tag new text. While not intentionally created to be a domain adaptive system, Nymble learned named entity tagging based on the tags in its training data, allowing it to be ported to any domain with a proper training text.

3.1.2 Towards less supervision The reliance of supervised systems on human annotation of a training set can be a disadvantage, as the

porting of such a system to a new domain can go no faster than the human annotating the training set; Nymble, for example, required over 100,000 words of training data. Unsupervised and weakly-supervised methods requiring less human input have also been developed in response to this information bottleneck, making them more attractive for domain porting. A further refinement of AutoSlog, AutoSlog-TS (Riloff 1996) required only training documents classified as relevant and not relevant. Instead of drawing on filled templates, AutoSlog-TS analyzed the linguistic parse of every sentence from the corpus and proposed extraction patterns for every noun phrase in the sentences, i.e. "The World Trade Center was bombed by terrorists" would produce the extraction pattern candidate " was bombed by ". By statistically computing the likelihood that an extraction pattern would be used in relevant documents, extraction patterns produced by AutoSlog-TS would be retained or discarded to create the dictionary. Kushmerick et. al (1997) introduced an unsupervised system for adaptive information extraction on the Web using a technique known as wrapper induction. Using knowledge of how a web site encodes data inside of HTML tags, such as
  • for a list item or ... for data in a table, a set of six rules for finding classes of wrappers encoding information to be extracted were defined. These rules were applied to the HTML encoding of the web site, which produce wrapper patterns for the web site content. AutoSlog-TS had performance comparable to the original AutoSlog system, showing that this weakly supervised system did just as well as its supervised counterpart. However, Cardie (1993) and LP2 far outperformed the system on their own domains. Nymble's performance was well over 90%, and is considered to be near human accuracy. Kushmerick et. al. (1997) point out that there is a trade-off between learning time and accuracy of wrapper induction; successfully wrapping 70% of sites surveyed requires learning time that grows exponentially large as the site gets more complex. The lack of human guidance in systems with less supervision often results in a drop in performance when compared to supervised systems, and learning time required to counteract this drop can be unwieldy for system use.

    3.1.3 Bootstrapping To bridge this gulf between the time-consuming human input required for supervised systems and the

    possible degradation in performance of unsupervised systems, a compromise can be reached by bootstrapping. This involves giving the system a small amount of human guidance in the form of example extractions, or seeds, to "bootstrap" the learning process. Riloff and Jones (1999) used AutoSlog-TS along with multi-level bootstrapping for dictionary creation. Noting that most IE systems require both a semantic dictionary and a set of extraction patterns, this system created both at once. Input to the system was a set of unannotated training texts and a set of seed words from the user. Extraction pattern candidates from Autoslog-TS were ranked by relevance to the seed words, and the best pattern was applied to the documents. The second level of bootstrapping occurred when the best extractions from this pattern were added to the seed words, and the process was repeated. Stevenson (2004) used the WordNet ontology (Fellbaum 1998) along with extraction pattern seeds for domain knowledge acquisition. Pattern seeds were constructed as subject-verb-object triples, and used semantic types to define the relations for extraction, i.e. "NAMCOMPANY + elect + NAMPERSON" would extract instances of people being elected the head of a company. Document content was then parsed into subject-object-verb components and then into semantic patterns using WordNet. Patterns from the document that were semantically similar to the seeds presumably extracted similar information, and were therefore incorporated as new patterns for extraction. Niu et. al. (2004) used bootstrapping for multiple machine learning systems in their named entity tagger. The user would give the system with concept-based seeds, i.e. PERSON: he, she, his, her, etc. The system would then parse a corpus of input documents into binary relationships based on their role in the sentence, i.e. subject-verb, object-verb, noun-modifier, etc. The system learned how to name entities in the corpus by finding seeds in relationships; for example, the phrase "he said" includes the PERSON "he" in a subject-verb relationship with "said", indicating that the subject of the verb "said" might be labeled as a PERSON later on. A second corpus, created from names extracted from the input corpus, was then used as training for a Hidden Markov Model for named entity tagging. Bootstrapping methods performed well with small sets of seed data. Methods presented above performed below, but comparably to supervised systems. Bootstrapping seems to be the most attractive

    method of adapting an IE system to a new domain, as it does not require as much human effort as supervised methods, and does not suffer the same performance degradation as unsupervised or weakly supervised methods.

    3.2 Personal Work Our current work has focused on an IE system to be incorporated into the Machete project (Bradshaw and Light, 2004) named PBR. Initially patterned after Albert et. al. (2003), the system took a user query to the MEDLINE system to find medical journal abstracts in a local database from which sentences containing a trioccurrence of two protein names and an interaction term would be extracted. The system used dictionaries to find both protein names and interaction terms. Since it was developed as a webbased application, the system was designed to operate efficiently as well as accurately. PBR has since evolved into an IE system suitable for a number of IE tasks in the biomedical domain without sacrificing efficiency or accuracy, and ready for porting into other domains. The first major change in PBR was the introduction of a named entity tagging system, LingPipe (Baldwin and Carpenter 2003), to find protein names rather than using a dictionary. This system was already being used for tokenization and sentence boundary finding. Both tokenization and named entity tagging in LingPipe can operate on many domains. The system reads in a domain model before processing any text, allowing it to be tailored to any domain. We currently have models suitable for the newswire domain as well as the biomedical domain. The LingPipe named entity tagger identifies many names in biomedical text apart from proteins, including virii, genes, molecules, etc. By allowing users to decide what names to look for, PBR can be used as a customizable online relationship extraction system. We also removed the interaction term dictionary in order to allow users to input their own interaction terms. Instead of looking for trioccurrences of two proteins and an interaction term, the user could look for trioccurrences of a gene and a protein and an interaction term, or two atoms and an interaction term, or an animal and a tissue and an interaction term. Future versions of the PBR system will allow one or both of the named entities in a trioccurrence to be empty, facilitating IR needs such as browsing abstracts by keywords.

    Adapting PBR to a new domain would require the alteration of three systems: document filtering, tokenization, and named entity finding. Document filtering is clearly necessary, as the new domain may not be comprised of medical journal abstracts. This can be accomplished through any number of existing IR techniques or systems, and would need to be tailored to the new domain. Tokenization and named entity finding can be ported in one step by acquiring or generating a new domain model for use in the LingPipe system. We are unsure of how much effort this would require as we have not attempted this, but the functionality for creating a new model is preexistent in LingPipe, and would most likely be more efficient than implementing new tokenization and named entity modules. Since interaction terms used to generate trioccurrences are input by the user, we can assume that these will be suitable for the target domain.

    4. Open problems and future work

    4.1 Porting structural processing modules across domains We have asserted that all modules of an IE system are domain dependent. When porting an IE system to a new domain, it is necessary to port all modules to the new domain to avoid error propagation from one module to the next. The simplest way to achieve this aim would be to have preexisting modules for every domain available, and simply use these modules to build an IE system whenever one is required. If this were the case, however, domain porting would no longer be necessary. NLP modules can be roughly divided into two types: those whose annotations change little across domains, and those whose annotations change greatly across domains. Modules of the latter type, such as NE taggers and relationship extractors, will have to be ported on a case-by-case basis, since their annotations change based on the domain and task to which they will be applied. Various machine learning methods can be used to port these modules to new domains. Bootstrapping methods, as seen in section 3.1.3, have been used to solve this problem. Another approach, similar to bootstrapping, is active

    learning, as in Thompson et. al. (1999). This method involves the creation of a small number of annotated examples which the learner will use to classify a large number of unannotated examples as informative or uninformative. Informative examples are then annotated by the user, and submitted for retraining.

    Fig. 2: Porting structural processing modules to a new domain, including error correction.

    Modules whose annotations do not vary greatly across domains, however, can be ported to a new domain as a unit. Modules of this kind include tokenizers, POS taggers, syntactic analyzers, coreference engines, etc. Porting would be done by first collecting a set of modules from a number of existing IE systems, each operating on one of a number of domains. These modules would then be evaluated on the new domain, and the modules with the best performance would be selected. To improve performance and reduce the number of errors introduced by moving the existing modules to a new domain, correction rules (as in Ciravegna (2001)) would be induced to improve the annotations of each module. Induction of correction rules would be done concurrently with domain evaluation, and would be saved for future use or incorporated into existing modules.

    4.2 Evaluation standards There currently exists no independent standard for evaluating the portability of an IE system. As Lavelli et. al. (2004) point out, IE has a long history of evaluation, and we believe that domain adaptive IE should not be left behind. Without some standard of evaluation, it is not possible to compare one system with another. For example, Niu et. al. (2004) do not evaluate their system on more than one domain, and do

    not give any mention to how much work was involved in adapting their system. Ciravegna (2001) gives an evaluation of his system on two IE data sets, as do Neumann and Mazzinni (2001), but on two different data sets. A meaningful standard methodology for evaluating IE systems ported to new domains must include two key features: - Evaluations of the systems on a number of domains. Popular IE testbed domains, such as job postings and seminar announcements, should be used along with domains that have proven difficult, such as medical journal abstracts. - An estimate of the amount of work needed to port the system to a new domain. This should include a description of the tasks involved, the degree of expertise in the domain needed to complete these tasks, and an example of the number of person hours required to port from one domain to another. Evaluating IE systems across a range of both well-covered and difficult domains will give a more accurate picture of how the IE system will perform on an unknown domain. Limiting evaluation to a wellcovered domain, where IE systems routinely achieve near-human performance, may give unrealistically high results which degrade sharply when applied to a new domain (Arens (2004)). Limiting evaluation to difficult domains may result in "nearsighted" systems which do well on these more difficult domains but are unable to perform well on domains which should prove less difficult. Since a domain adaptive system is useful only if it can be adapted to a new domain with a reasonable amount of effort, an estimate of the work required to port to a new domain is necessary for good evaluation. A domain adaptive IE system requiring many hours of annotation work or a high degree of technical knowledge for porting will not be as useful as a simpler bootstrapping system if the user does not have the time or expertise necessary, regardless of how well it performs.

    4.3 Domain clustering Is it the case that some domains are similar to others, so much so that little to no adaptation is necessary to port from one to another? Intuitively, this seems likely; announcements for seminars and local music acts tend to share the same "who, when, where" format, for example. Clustering a group of domains that

    share such a similarity into a "super-domain" would aid domain porting if one could define a set modules which can be applied cross-domain to any domain in the super-domain. This reduced set of modules would allow for greater system reuse when porting to a domain inside the super-domain of an existing system, and define a set of modules to be changed for porting to a domain outside the super-domain, eliminating the guesswork of choosing which modules to replace and which to retain.

    5. Conclusion

    Information extraction systems fill the gap between the growing amount of texts

    containing useful

    information in many domains, and the ability of humans to get at that information. This paper has defined the task of information extraction, and outlined the modules required to solve such a task, noting how they affect the task of adapting an IE system to a new domain. Though domain adaptive IE systems are heterogeneous in their design, they can be grouped according to how they adapt into supervised, unsupervised or weakly supervised, and bootstrapped systems. Example systems from each of the three groups were presented, and the three groups were compared. Future work in domain adaptive IE will address open problems such as systematic approaches to evaluating domain adaptive IE systems, finding similarities across domains, and allowing every module in an IE system to adapt to new domains.

    References

    Albert, S., S. Gaudan, H. Knigge, A. Raetsch, A. Delgado, B. Huhse, H. Kirsch, M. Albers, D. RebholzSchuhmann and M. Koegl. 2003. "Computer-Assisted Generation of a Protein-Interaction Database for Nuclear Receptors". Molecular Endocrinology, 17: 8, pp. 1555-1567.

    Arens, R. 2004. "A Preliminary Look into the Use of Named Entity Information Tokenization". In Proceedings of HLT/NAACL 2004: Companion Volume.

    for Bioscience Text

    Baeza-Yates, R. and B. Ribeiro-Neto. 1998. Modern Information Retrieval. Boston: Addison-Wesley.

    Baldwin, B. and Carpenter, B. (2003). Alias-i LingPipe software. http://www.alias-i.com/lingpipe

    Bikel, D., S. Miller, R. Schwartz and R. Weischedel. 1997. "Nymble: a High Performance Learning Namefinder". In Proceedings of the Conference on Applied Natural Language Processing 1997.

    Blaschke, C. and A. Valencia. 2002. "The frame-based module of the SUISEKI information extraction system". IEEE Intelligent Systems, 17, pp. 14-20.

    S. Bradshaw and M. Light. 2004. "Knowledge Management and Text Mining for Bioscience Literature Search". Invited Talk, National Science Foundation Workshop on Insights in Protist Evolutionary Biology.

    Brill, E. 1992. A Corpus-Based Approach to Language Learning. PhD Dissertation, University of Pennsylvania.

    Brill, E. 1993. "Automatic grammar induction and parsing free text: a transformation-based approach". In Proceedings of the ACL 1993.

    Briscoe, T. 1996. "Robust Parsing". In R. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, and V. Zue (eds.) Survey in the State of the Art in Human Language Technology. Cambridge, UK: Cambridge University Press.

    Califf, M. 1998. Relational Learning Techniques for Natural Language Information Extraction. Ph.D. Dissertation, University of Texas at Austin.

    Cardie, C. 1993. "A Case-Based Approach to Knowledge Acquisition for Domain-Specific Sentence Analysis". In Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-93).

    Cardie, C. 1997. "Empirical Methods in Information Extraction". AI Magazine, 18: 4, pp. 65-79.

    Chinchor, N. 1998. "Overview of MUC-7/MET-2". In Proceedings of the Seventh Message Understanding Conference (MUC-7).

    Ciravegna, F. 2001. "Adaptive information extraction from text by rule induction and generalisation". In Proceedings of 17th International Joint Conference on Artificial Intelligence (IJCAI-01).

    Cowie, J. and W. Lehnert. 1996. "Information Extraction". Communications of the ACM, January, pp. 80 -91.

    Cowie, J. and Y. Wilks. 2000. "Information Extraction". In R. Dale, H. Moisl and H.

    Somers (eds.)

    Handbook of Natural Language Processing. New York: Marcel Dekker.

    Cutting, D., J. Kupiec, J. Pedersen and P. Sibun. 1992. "A Practical Part-of-Speech Tagger". In Proceedings of the Third Conference on Applied Natural Language Processing.

    Ding, J., D. Berleant, J. Xu and A. Fulmer. 2003. "Extracting Biochemical Interactions from MEDLINE Using a Link Grammar Parser". In Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'03).

    Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. Cambridge: MIT Press.

    Freitag, D. 1998. Machine Learning for Information Extraction in Informal Domains. Ph.D. Dissertation, Carnegie Mellon University.

    Friedman, C., P. Kra, M. Krauthammer, H. Yu, and A. Rzhetsky. 2001. "GENIES: a natural-langauge processing system for the extraction of molecular pathways from journal articles". Bioinformatics, 17, S74 -82.

    Glickman, O. and R. Jones. 1999. "Examining Machine Learning for Adaptable End-to-End Information Extraction Systems." In AAAI 1999 Workshop on Machine Learning for Information Extraction.

    Grinberg, D., J. Lafferty and D. Sleator. 1995. "A robust parsing algorithm for link grammars". In Proceedings of the Fourth International Workshop on Parsing Technologies.

    Grishman, R. 1997. "Information Extraction: Techniques and Challenges". SCIE, pp. 10–27.

    Hirschman, L., A. Morgan, and A. Yeh. 2002. "Rutabaga by any other name: extracting biological names". Journal of Biomedical Informatics, 35: 4, pp. 247-259.

    Hobbs, J., D. Appelt, J. Bear, D. Israel, M. Kameyama, M. Stickel, and M. Tyson. 1997. "FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text", in E. Roche and Y. Schabes (eds.) Finite State Devices for Natural Language Processing. Cambridge: MIT Press.

    Jurafsky, D. and J. Martin. 2000. Speech and Language Processing. Upper Saddle River, New Jersey: Prentice-Hall.

    Kushmerick, N. 1997. Wrapper Induction for Information Extraction. PhD thesis, University of Washington.

    Lavelli, A., M. E. Califf, F. Ciravegna, D. Freitag, C. Giuliano, N. Kushmerick, and L. Romano. 2004. "IE Evaluation: Criticisms and Recommendations". In Proceedings from ATEM-2004: The AAAI-04 Workshop on Adaptive Text Extraction and Mining.

    Lutsky, P. 2004. "Lexical Semantics Domain Model for Information Extraction". In Proceedings from ATEM-2004: The AAAI-04 Workshop on Adaptive Text Extraction and Mining.

    MEDLINE Fact Sheet. 2005. Retrieved March 8, 2005 from http://www.nlm.nih.gov/pubs/factsheets/ medline.html

    Marcotte, E, I. Xenarios, and D. Eisenberg. 2001. "Mining literature for protein-protein interactions". Bioinformatics, 17: 4, pp. 359-363.

    Mitchell, T. 1997. Machine Learning. New York: McGraw-Hill.

    Neumann, G. and G. Mazzini. 1998. Domain-Adaptive Information Extraction. Technical report, DFKI, Saarbrucken.

    Niu, C., W. Li and R. Srihari. 2004. "A Bootstrapping Approach to Information Extraction Domain Porting". In Proceedings from ATEM-2004: The AAAI-04 Workshop on Adaptive Text Extraction and Mining.

    Ono, T., H. Hisigaki, A. Tanigami, T. Tagaki. 1999. "Automatic Extraction of Protein-Protein Interactions from Scientific Literature". In Proceedings of The Tenth Workshop on Genome Informatics.

    Pan, H., L. Zuo, V. Choudhary, Z. Zhang, S. Leow, F. Chong, Y. Huang, V. Ong, B. Mohanty, S. Tan, S. Krishnan, and V. Bajic. 2004. "Dragon TF Association Miner: a system for exploring transcription factor associations through text-mining". Nucleic Acids Research, 32, pp. 230-234.

    Riloff, E. 1993 "Automatically Constructing a Dictionary for Information Extraction Tasks". In Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-93).

    Riloff, E. 1996 "Automatically Generating Extraction Patterns from Untagged Text". In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96).

    Riloff, E. and R. Jones. (1999) "Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping". In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).

    Stevenson, M. 2004. "An Unsupervised WordNet-based Algorithm for Relation Extraction". In Proceedings of the Fourth International Conference on Language Resources and Evaluation Workshop Beyond Named Entity: Semantic Labelling for NLP tasks.

    Surdeanu, M., S. Harabagiu, J. Williams, and P. Aarseth. 2003. "Using predicate-argument structures for information extraction". In Proceedings of the ACL 2003.

    Tanabe, L. and W. Wilbur. 2002. "Tagging Gene and Protein Names in Full Text Articles". In Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain.

    Thompson, C., M. Califf, and R. Mooney. 1999. "Active Learning for Natural Language Parsing and Information Extraction". In Proceedings of the Sixteenth International Machine Learning Conference.

    Wong, L.

    2001. "PIES: A Protein Interaction Extraction System". In Proceedings of the Pacific

    Symposium on Biocomputing.