Adaptive Text Extraction and Mining (ATEM 2006)

EACL-2006

11th Conference of the European Chapter of the Association for Computational Linguistics

Proceedings of the Workshop on

Adaptive Text Extraction and Mining (ATEM 2006)

April 4, 2006 Trento, Italy

The conference, the workshop and the tutorials are sponsored by:

Celct c/o BIC, Via dei Solteri, 38 38100 Trento, Italy http://www.celct.it

Xerox Research Centre Europe 6 Chemin de Maupertuis 38240 Meylan, France http://www.xrce.xerox.com

CELI s.r.l. Corso Moncalieri, 21 10131 Torino, Italy http://www.celi.it

Thales 45 rue de Villiers 92526 Neuilly-sur-Seine Cedex, France http://www.thalesgroup.com

EACL-2006 is supported by Trentino S.p.a.

and Metalsistem Group

© April 2006, Association for Computational Linguistics Order copies of ACL proceedings from: Priscilla Rasmussen, Association for Computational Linguistics (ACL), 3 Landmark Center, East Stroudsburg, PA 18301 USA Phone +1-570-476-8006 Fax +1-570-476-0860 E-mail: [email protected] On-line order form: http://www.aclweb.org/

INTRODUCTION With the explosive growth of the Web and intranets, the amount of information that is available in unstructured and semistructured documents keeps increasing at an unprecedented rate. These terabytes of text contain valuable information for virtually every domain of activity, from education to business to counter-terrorism. However, existing tools for accessing and exploiting this data are just not effective enough to satisfy user expectations. Recent years have brought significant interest and progress in developing techniques for the automatic extraction and mining of information from text. In contrast to the previous generation of extraction systems which relied on (expensive and brittle) hand-written rules, most recent approaches use machine learning techniques to uncover the structure of text. To further reduce the users’ burden, researchers are also investigating a variety of active, semi-supervised, and unsupervised learning algorithms that minimize the amount of labeled documents required for training. The increasing interest of adaptive extraction and mining from texts is also demonstrated by several recent initiatives: • The Automatic Content Extraction program was started a few years ago by NIST (www.itl.nist.gov/iad/894.01/tests/ace). • The PASCAL challenge on Machine Learning for Information Extraction, whose aim was to assess the current state of the art, dentifying future challenges and to foster additional research in the field (nlp.shef.ac.uk/pascal). • Various initiatives related to the life sciences, such as BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology; www.pdg.cnb.uam.es/BioLINK/BioCreative.eval.html), and the Shared Task proposed at the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications; research.nii.ac.jp/ collier/workshops/JNLPBA04st.htm). Adaptive extraction and mining from texts is an extremely active area of research that lies at the intersection of diverse fields such as information extraction, text mining, machine learning, data mining, link analysis and relationship discovery, information retrieval, natural language processing, information integration, distributed databases, and knowledge capture. Developments in any of these fields have an immediate effect on the others, so it is important to foster the free exchange of ideas among researchers that work on the various aspects of adaptive text extraction and mining. The purpose of this workshop is to bring together researchers and practitioners from all these communities, so that they can discuss recent results and foster new directions of research in the field. The workshop builds on the success of previous workshops on the same topic at AAAI-1999, ECAI-2000, IJCAI-2001, ECML 2003, and AAAI2004 (see www.isi.edu/info-agents/RISE/Resources.html for details). The workshop will also serve to follow up ideas discussed at the 2005 Dagstuhl Workshop on Machine Learning for the Semantic Web (www.smi.ucd.ie/Dagstuhl-MLSW), much of which focused on adaptive information extraction. The program includes nine papers: eight papers come from from Europe (Germany, Italy, Ireland, Spain, Sweden, The Netherlands, United Kingdom), one from Japan/USA. Each paper was reviewed by at least two reviewers. We thank the reviewers for their cooperation in the reviewing process, especially considering the very short interval between submission and notification.

Fabio Ciravegna Claudio Giuliano Nicholas Kushmerick Alberto Lavelli Ion Muslea February 2006

iii

ORGANISING COMMITTEE

Fabio Ciravegna (University of Sheffield, UK) Claudio Giuliano (ITC-irst, Italy) Nicholas Kushmerick (University College Dublin, Ireland) Alberto Lavelli (ITC-irst, Italy) Ion Muslea (Language Weaver, USA) PROGRAMME COMMITTEE

Mary Elaine Califf (Illinois State University, USA) Fabio Ciravegna (University of Sheffield, UK) Mark Craven (University of Wisconsin, USA) Valter Crescenzi (Universita’ Rome Tre, Italy) Walter Daelemans (University of Antwerp, Belgium) Dayne Freitag (Fair Isaac Corporation, USA) Claudio Giuliano (ITC-irst, Italy) Nicholas Kushmerick (University College Dublin, Ireland) Alberto Lavelli (ITC-irst, Italy) Ion Muslea (Language Weaver, USA) Un Yong Nahm (University of Texas at Austin, USA) Ellen Riloff (University of Utah, USA) Roman Yangarber (University of Helsinki, Finland)

iv

Workshop Program 9:00-10:30 Section 1 Learning Effective Surface Text Patterns for Information Extraction Gijs Geleijnse and Jan Korst A Hybrid Approach for the Acquisition of Information Extraction Patterns Mihai Surdeanu, Jordi Turmo and Alicia Ageno 10:30-11:00 Coffee Break 11:00-12:30 Section 2 An Experimental Study on Boundary Classification Algorithms for Information Extraction using SVM Jose Iria, Neil Ireson and Fabio Ciravegna Simple Information Extraction (SIE): A Portable and Effective IE System Claudio Giuliano, Alberto Lavelli and Lorenza Romano Transductive Pattern Learning for Information Extraction Brian McLernon and Nicholas Kushmerick 12:30-14:30 Lunch 14:30-16:00 Section 3 Spotting the ’Odd-one-out’: Data-Driven Error Detection and Correction in Textual Databases Caroline Sporleder, Marieke van Erp, Tijn Porcelijn and Antal van den Bosch Recognition of synonyms by a lexical graph Peter Siniakov Active Annotation Andreas Vlachos 16:00-16:30 Coffee Break 16:30-18:00 Section 4 Expanding the Recall of Relation Extraction by Bootstrapping Junji Tomita, Stephen Soderland and Oren Etzioni PANEL

v

Table of Contents Learning Effective Surface Text Patterns for Information Extraction Gijs Geleijnse and Jan Korst. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Simple Information Extraction (SIE): A Portable and Effective IE System Claudio Giuliano, Alberto Lavelli and Lorenza Romano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 An Experimental Study on Boundary Classification Algorithms for Information Extraction using SVM Jose Iria, Neil Ireson and Fabio Ciravegna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Transductive Pattern Learning for Information Extraction Brian McLernon and Nicholas Kushmerick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Recognition of synonyms by a lexical graph Peter Siniakov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Spotting the ’Odd-one-out’: Data-Driven Error Detection and Correction in Textual Databases Caroline Sporleder, Marieke van Erp, Tijn Porcelijn and Antal van den Bosch . . . . . . . . . . . . . . . . . . . . . . . . 40 A Hybrid Approach for the Acquisition of Information Extraction Patterns Mihai Surdeanu, Jordi Turmo and Alicia Ageno . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Expanding the Recall of Relation Extraction by Bootstrapping Junji Tomita, Stephen Soderland and Oren Etzioni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Active Annotation Andreas Vlachos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

vii

Learning Effective Surface Text Patterns for Information Extraction Gijs Geleijnse and Jan Korst Philips Research Laboratories Prof. Holstlaan 4, 5656 AA Eindhoven, The Netherlands {gijs.geleijnse,jan.korst}@philips.com

Abstract

some class. In a general information extraction setting, we cannot assume that all relations are functional.

We present a novel method to identify effective surface text patterns using an internet search engine. Precision is only one of the criteria to identify the most effective patterns among the candidates found. Another aspect is frequency of occurrence. Also, a pattern has to relate diverse instances if it expresses a non-functional relation. The learned surface text patterns are applied in an ontology population algorithm, which not only learns new instances of classes but also new instancepairs of relations. We present some £rst experiments with these methods.

1

The second drawback is that the criterion for selecting patterns, precision, is not the only issue for a pattern to be effective. We call a pattern effective, if it links many different instance-pairs in the excerpts found with a search engine. We use an ontology to model the information domain we are interested in. Our goal is to populate an ontology with the information extracted. In an ontology, instances of one class can be related by some relation R to multiple instances of some other class. For example, we can identify the classes ‘movie’ and ‘actor’ and the ‘acts in’relation, which is a many-to-many relation. In general, multiple actors star in a single movie and a single actor stars in multiple movies.

Introduction

Ravichandran and Hovy (2002) present a method to automatically learn surface text patterns expressing relations between instances of classes using a search engine. Their method, based on a training set, identi£es natural language surface text patterns that express some relation between two instances. For example, “was born in” proved to be a precise pattern expressing the relation between instances Mozart (of class ‘person’) and 1756 (of class ‘year’). We address the issue of learning surface text patterns, since we observed two drawbacks of Ravichandran and Hovy’s work with respect to the application of such patterns in a general information extraction setting. The £rst drawback is that Ravichandran and Hovy focus on the use of such surface text patterns to answer so-called factoid questions (Voorhees, 2004). They use the assumption that each instance is related by R to exactly one other instance of

In this paper we present a domain-independent method to learn effective surface text patterns representing relations. Since not all patterns found are highly usable, we formulate criteria to select the most effective ones. We show how such patterns can be used to populate an ontology. The identi£cation of effective patterns is important, since we want to perform as few queries to a search engine as possible to limit the use of its services. This paper is organized as follows. After de£ning the problem (Section 2) and discussing related work (Section 3), we present an algorithm to learn effective surface text patterns in Section 4. We discuss the application of this method in an ontology population algorithm in Section 5. In Section 6, we present some of our early experiments. Sections 7 and 8 handle conclusions and future work.

1

2

Problem description

to identify instances that are similarly related. In (Agichtein and Gravano, 2000), such a system is combined with a named-entity recognizer. In (Craven et al., 2000) an ontology is populated by crawling a website. Based on tagged web pages from other sites, rules are learned to extract information from the website. Research on named-entity recognition was addressed in the nineties at the Message Understanding Conferences (Chinchor, 1998) and is continued for example in (Zhou and Su, 2002). Automated part of speech tagging (Brill, 1992) is a useful technique in term extraction (Frantzi et al., 2000), a domain closely related to namedentity recognition. Here, terms are extracted with a prede£ned part-of-speech structure, e.g. an adjective-noun combination. In (Nenadić et al., 2002), methods are discussed to extract information from natural language texts with the use of both part of speech tags and hyponym patterns. As referred to in the introduction, Ravichandran and Hovy (2002) present a method to identify surface text patterns using a web search engine. They extract patterns expressing functional relations in a factoid question answering setting. Selection of the extracted patterns is based on the precision of the patterns. For example, if the pattern ‘was born in’ is identi£ed as a pattern for the pair (‘Mozart’, ‘Salzburg’), they compute precision as the number of excerpts containing ‘Mozart was born in Salzburg’ divided by the number of excerpts with ‘Mozart was born in’. Information extraction and ontologies creation are two closely related £elds. For reliable information extraction, we need background information, e.g. an ontology. On the other hand, we need information extraction to generate broad and highly usable ontologies. An overview on ontology learning from text can be found in (Buitelaar et al., 2005). Early work (Hearst, 1998), describes the extraction of text patterns expressing WordNet-relations (such as hyponym relations) from some corpus. This work focusses merely on the identi£cation of such text patterns (i.e. phrases containing both instances of some related pair). Patterns found by multiple pairs are suggested to be usable patterns. KnowItAll is a hybrid named-entity extraction system (Etzioni et al., 2005) that £nds lists of instances of some class from the web using a search engine. It combines Hearst patterns and learned

We consider two classes cq and ca and the corresponding non-empty sets of instances Iq and Ia . Elements in the sets Iq and Ia are instances of cq and ca respectively, and are known to us beforehand. However, the sets I do not have to be complete, i.e. not all possible instances of the corresponding class have to be in the set I. Moreover, we consider some relation R between these classes and give a non-empty training set of instance-pairs TR = {(x, y) | x ∈ Iq ∧ y ∈ Ia }, which are instance-pairs that are known to be R-related. Problem: Given the classes cq and ca , the sets of instances Iq and Ia , a relation R and a set of R-related instance-pairs TR , learn effective surface text patterns that express the relation R. Say, for example, we consider the classes ‘author’ and ‘book title’ and the relation ‘has written’. We assume that we know some related instancepairs , e.g. (‘Leo Tolstoy’, ‘War and Peace’) and (‘Günter Grass’, ‘Die Blechtrommel’). We then want to £nd natural language phrases that relate authors to the titles of the books they wrote. Thus, if we query a pattern in combination with the name of an author (e.g. ‘Umberto Eco wrote’), we want the search results of this query to contain the books by this author. The population of an ontology can be seen as a generalization of a question-answering setting. Unlike question-answering, we are interested in £nding all possible instance-pairs, not only the pairs with one £xed instance (e.g. all ‘author’‘book’ pairs instead of only the pairs containing a £xed author). Functional relations in an ontology correspond to factoid questions, e.g. the population of the classes ‘person’ and ‘country’ and the ‘was born in’-relation. Non-functional relations can be used to identify answers to list questions, for example “name all books written by Louis-Ferdinand Céline” or “which countries border Germany?”.

3

Related work

Brin identi£es the use of patterns in the discovery of relations on the web (Brin, 1998). He describes a website-dependent approach to identify hypertext patterns that express some relation. For each web site, such patterns are learned and explored

2

the bodies of the documents rather than in the titles. We query both allintext:" x * y " and allintext:" y * x ". The * is a regular expression operator accepted by Google. It is a placeholder for zero or more words.

patterns for instances of some class to identify and extract named-entities. Moreover, it uses adaptive wrapper algorithms (Crescenzi and Mecca, 2004) to extract information from html markup such as tables. Cimiano and Staab descibe a method to use a search engine to verify a hypothesis relation (2004). For example, if we are interested in the ‘is a’ or hyponym relation and we have a candidate instance pair (‘river’, ‘Nile’) for this relation, we can use a search engine to query phrases expressing this relation (e.g. ‘rivers such as the Nile’). The number of hits to such queries can then be used as a measure to determine the validity of the hypothesis. In (Geleijnse and Korst, 2005), a method is described to populate an ontology with the use of queried text patterns. The algorithm presented extracts instances from search results after having submitted a combination of an instance and a pattern as a query to a search engine. The extracted instances from the retrieved excerpts can thereafter be used to formulate new queries – and thus identify and extract other instances.

4

- Step 2: Send the queries to Google and collect the excerpts of the at most 1,000 pages it returns for each query. - Step 3: Extract all phrases matching the queried expressions and replace both x and y by the names of their classes. - Step 4: Remove all phrases that are not within one sentence. - Step 5: Normalize all phrases by removing all mark-up that is ignored by Google. Since Google is case-insensitive and ignores punctuation, double spaces and the like, we translate all phrases found to a normal form: the simplest expression that we can query that leads to the document retrieved. - Step 6: Update the frequencies of all normalized phrases found.

The algorithm

We present an algorithm to learn surface text patterns for relations. We use GoogleTM to retrieve such patterns. The algorithm makes use of a training set TR of instance-pairs that are R-related. This training set should be chosen such the instance-pairs are typical for relation R. We £rst discover how relation R is expressed in natural language texts on the web (Section 4.1). In Section 4.2 we address the problem of selecting effective patterns from the total set of patterns found.

- Step 7: Repeat the procedure for any unqueried pair (x , y ) ∈ TR . We now have generated a list with relation patterns and their frequencies within the retrieved Google excerpts. 4.2 Selecting relation patterns From the list of relation patterns found, we are interested in the most effective ones. We are not only interested in the most precise ones. For example, the retrieved pattern “född 30 mars 1853 i” proved to a 100% precise pattern expressing the relation between a person (‘Vincent van Gogh’) and his place of birth (‘Zundert’). Clearly, this rare phrase is unsuited to mine instance-pairs of this relation in general. On the other hand, high frequency of some pattern is no guarantee for effectiveness either. The frequently occurring pattern “was born in London” (found when querying for Thomas Bayes * England) is well-suited to be used to £nd London-born persons, but in general the pattern is unsuited – since too narrow – to express the relation between a person and his or her country of origin.

4.1 Identifying relation patterns We £rst generate a list of surface text patterns with the use of the following algorithm. For evaluation purposes, we also compute the frequency of each pattern found. - Step 1: Formulate queries using an instancepair (x, y) ∈ TR . Since we are interested in phrases within sentences rather than in keywords or expressions in telegram style that often appear in titles of webpages, we use the allintext: option. This gives us only search results with the queried expression in

3

We £nally calculate the score of the patterns by multiplying the individual scores:

Taking these observations into account, we formulate three criteria for selecting effective relation patterns.

score(s) = ffreq (s) · fprec (s) · fspr (s)

1. The patterns should frequently occur on the web, to increase the probability of getting any results when querying the pattern in combination with an instance.

For ef£ciency reasons, we only compute the scores of the patterns with the highest frequencies. The problem remains how to recognize a (possible multi-word) instance in the Google excerpts. For an ontology alignment setting – where the sets Ia and Iq are not to be expanded – these problems are trivial: we determine whether t ∈ Ia is accompanied by the queried expression. For a setting where the instances of ca are not all known (e.g. it is not likely that we have a complete list of all books written in the world), we solve this problem in two stages. First we identify rules per class to extract candidate instances. Thereafter we use an additional Google query to verify if a candidate is indeed an instance of class ca .

2. The pattern should be precise. When we query a pattern in combination with an instance in Iq , we want to have many search results containing instances from ca . 3. If relation R is not functional, the pattern should be wide-spread, i.e. among the search results when querying a combination of the pattern and an instance in Iq there must be as many distinct R-related instances from ca as possible. To measure these criteria, we use the following scoring functions for relation patterns s.

Identifying a candidate instance The identi£cation of multi-word terms is an issue of research on its own. However, in this setting we can allow ourselves to use less elaborate techniques to identify candidate instances. We can do so, since we additionally perform a check on each extracted term. So, per class we create rules to identify candidate instances with a focus on high recall. In our current experiments we thus use very simple term recognition rules, based on regular expressions. For example, we identify a candidate instance of class ‘person’ if the queried expression is accompanied by two or three capitalized words.

1. ffreq (s) = “number of occurrences of s in the excerpts as found by the algorithm described in the previous subsection” 2. fprec (s) =

x∈Iq

P (s,x)

|Iq |

, where

for instances x ∈ Iq , Iq ⊆ Iq , we calculate P (s, x) as follows. P (s, x) = FFOI (s,x) (s,x) and FI (s, x) = the number of Google excerpts after querying s in combination with x containing instances of ca . FO (s, x) = the total number of excerpts found (at most 1,000). 3. fspr (s) =

x∈Iq

Identifying an instance-class relation We are interested in the question whether some extracted term t is an instance of class ca . For example, given the term ‘The Godfather’, does this term belong to the class ‘movie’? The instanceclass relation can be viewed of as a hyponym relation. We therefore verify the hypothesis of t being an instance of ca by Googling hyponym relation patterns. We use a £xed set H of common patterns expressing the hyponym relation (Hearst, 1992; Cimiano and Staab, 2004), see Table 1. For the class names, we use plurals. We use these patterns in the following acceptance function acceptcq (t) := ( h(p, cq , t) ≥ n),

B(s, x), where

B(s, x) = the number of distinct instances of class ca found after querying pattern s in combination with x. The larger we choose the testset, the subset Iq of Iq , the more reliable the measures for precision and spreading. However, the number of Google queries increases with the number of patterns found for each instance we add to Iq .

p∈H

4

"cq "cq "cq "cq

(c1 , c2 , ...) and corresponding instance sets (I1 , I2 , ..). On these classes, relations R(i,j) 1 are de£ned, with i and j the index number of the classes. The non-empty sets T(i,j) contain the training set of instance-pairs of the relations R(i,j) . Per instance, we maintain a list of expressions that already have been used as a query. Initially, these are empty. The £rst step of the algorithm is to learn surface text patterns for each relation in O. The following steps of the algorithm are performed until either some stop criterion is reached, or no more new instances and instance-pairs can be found.

including t and" for example t and" like t and" such as t and"

Table 1: Hearst patterns for instance-class relation. where h(p, cq , t) is the number of Google hits for query with pattern p combined with term t and the plural form of the class name cq . The threshold n has to be chosen beforehand. We can do so, by calculating the sum of Google hits for queries with known instances of the class. Based on these £gures, a threshold can be chosen e.g. the minimum of these sums. Note that term t is both preceded and followed by a £xed phrase in the queries. We do so, to guarantee that t is indeed the full term we are interested in. For example, if we had extracted the term ‘Los’ instead of ‘Los Angeles’ as a Californian City, we would falsely identify ‘Los’ as a Californian City, when we do not let ‘Los’ follow by the £xed expression and. The number of Google hits for some expression x is at least the number of Google hits when querying the same expression followed by some expression y. If we identify a term t as being an instance of class ca , we can add this term to the set Ia . However, we cannot relate t to an instance in Iq , since the pattern used to £nd t has not proven to be effective yet (e.g. the pattern could express a different relation between one of the instance-pairs in the training set). We reduce the amount of Google queries by using a list of terms found that do not belong to ca . Terms that occur multiple times in the excerpts can then be checked only once. Moreover, we use the OR-clause to combine the individual queries into one. We then check if the number of hits to this query exceeds the threshold. The amount of Google queries in this phase thus equals the amount of distinct terms extracted.

5

- Step 1: Select a relation R(i,j) , and an instance v from either Ii or Ij such that there exists at least one pattern expressing R(i,j) we have not yet queried in combination with v. - Step 2: Construct queries using the patterns with v and send these queries to Google. - Step 3: Extract instances from the excerpts. - Step 4: Add the newly found instances to the corresponding instance set and add the instance-pairs found (thus with v) to T(i,j) . - Step 5: If there exists an instance that we can use to formulate new queries, then repeat the procedure. Else, learn new patterns using the extracted instance-pairs and then repeat the procedure. Note that instances of class cx learned using the algorithm applied on relation R(x,y) can be used as input for the algorithm applied to some relation R(x,z) to populate the sets Iz and T(x,z) .

6 Experiments In this section, we discuss two experiments that we have conducted. The £rst experiment involves the identi£cation of effective hyponym patterns. The second experiment is an illustration of the application of learned surface text patterns in information extraction.

The use of surface text patterns in information extraction

Having a method to identify relation patterns, we now focus on utilizing these patterns in information extraction from texts found by a search engine. We use an ontology to represent the information extracted. Suppose we have an ontology O with classes

1

Assuming one relation per pair of classes. We can use another index k in R(i,j,k) to distinct multiple relations between ci and cj .

5

country and only for countries, for example the existence of a country code for dialing, are not trivially identi£ed manually but are useful and reliable patterns. The combination of ‘is a’, ‘is an’ or ‘is the’ with an adjective is a common pattern, occurring 2,400 times in the list. In future work, we plan to identify such adjectives in Google excerpts using a Part of Speech tagger (Brill, 1992).

6.1 Learning effective hyponym patterns We are interested whether the effective surface text patterns are indeed intuitive formulations of some relation R. As a test-case, we compute the most effective patterns for the hyponym relation using a test set with names of all countries. Our experiment was set up as follows. We collected the complete list of countries in the world from the CIA World Factbook2 . Let Iq be this set of countries, and let Ia be the set { ‘countries’, ‘country’ }. The set TR consists of all pairs (a, ‘countries’) and (a, ‘country’) , for a ∈ Ia . We apply the surface text pattern learning algorithm on this set TR . The algorithm identi£ed almost 40,000 patterns. We computed fspr and fprec for the 1,000 most frequently found patterns. In table 2, we give the 25 most effective patterns found by the algorithm. We consider the patterns in boldface true hyponym patterns. Focussing on these patterns, we observe two groups: ‘is a’ and Hearst-like patterns. pattern (countries) like (countries) such as is a small (country) (country) code for (country) map of (countries) including is the only (country) is a (country) (country) ¤ag of and other (countries) and neighboring (countries) (country) name republic of (country) book of is a poor (country) is the £rst (country) (countries) except (country) code for calling is an independent (country) and surrounding (countries) is one of the poorest (countries) and several other (countries) among other (countries) is a sovereign (country) or any other (countries) (countries) namely

freq 645 537 142 342 345 430 138 339 251 279 164 83 59 63 53 146 157 62 84 61 65 84 48 87 58

prec 0.66 0.54 0.69 0.36 0.34 0.21 0.55 0.22 0.63 0.34 0.43 0.93 0.77 0.73 0.70 0.37 0.95 0.55 0.40 0.75 0.59 0.38 0.69 0.58 0.44

6.2 Applying learned patterns in information extraction The Text Retrieval Conference (TREC) question answering track in 2004 contains list question, for example ‘Who are Nirvana’s band members?’ (Voorhees, 2004). We illustrate the use of our ontology population algorithm in the context of such list-question answering with a small case-study. Note that we do not consider the processing of the question itself in this research. Inspired by one of the questions (‘What countries is Burger King located in?’), we are interested in populating an ontology with restaurants and the countries in which they operate. We identify the classes ‘country’ and ‘restaurant’ and the relation ‘located in’ between the classes. We hand the algorithm the instances of ‘country’, as well as two instances of ‘restaurant’: ‘McDonald’s’ and ‘KFC’. Moreover, we add three instance-pairs of the relation to the algorithm. We use these pairs and a subset Icountry of size eight to compute a ranked list of the patterns. We extract terms consisting of one up to four capitalized words. In this test we set the threshold for the number of Google results for the queries with the extracted terms to 50. After a small test with names of international restaurant branches, this seemed an appropriate threshold. The algorithm learned, besides a ranked list of 170 surface text patterns (Table 3), a list of 54 instances of restaurant (Table 4). Among these instances are indeed the names of large international chains, Burger King being one of them. Less expected are the names of geographic locations and names of famous cuisines such as ‘Chinese’ and ‘French’. The last category of false instances found that have not be £ltered out, are a number of very common words (e.g. ‘It’ and ‘There’). We populate the ontology with relations found between Burger King and instances from country using the 20 most effective patterns.

spr 134 126 110 84 78 93 102 99 46 72 92 76 118 106 112 76 26 114 107 78 90 97 89 58 109

Table 2: Learned hyponym patterns and their scores. The Hearst-patterns ‘like’ and ‘such as’ show to be the most effective. This observation is useful, when we want to minimize the amount of queries for hyponym patterns. Expressions of properties that hold for each 2

http://www.cia.gov/cia/publications/factbook

6

pattern ca restaurants of cq ca restaurants in cq ca hamburger chain that occupies villages throughout modern day cq ca restaurant in cq ca restaurants in the cq ca hamburger restaurant in southern cq

prec 0.24 0.07

spr 15 19

freq 21 9

1.0 0.06 0.13 1.0

1 16 16 1

7 6 2 4

The method presented can be used for arbitrary relations, thus also relations that link an instance to multiple other instances. These patterns can be used in information extraction. We combine patterns with an instance and offer such an expression as a query to a search engine. From the excerpts retrieved, we extract instances and simultaneously instance-pairs. Learning surface text patterns is ef£cient with respect to the number of queries if we know all instances of the classes concerned. The £rst part of the algorithm is linear to the size of the training set. Furthermore, we select the n most frequent patterns and perform |Iq | · n queries to compute the score of these n patterns. However, for a setting where Ia is incomplete, we have to perform a check for each unique term identi£ed as a candidate instance in the excerpts found by the |Iq | · n queries. The number of queries, one for each extracted unique candidate instance, thus fully depends on the rules that are used to identify a candidate instance. We apply the learned patterns in an ontology population algorithm. We combine the learned high quality relation patterns with an instance in a query. In this way we can perform a range of effective queries to £nd instances of some class and simultaneously £nd instance-pairs of the relation. A £rst experiment, the identi£cation of hyponym patterns, showed that the patterns identi£ed indeed intuitively re¤ect the relation considered. Moreover, we have generated a ranked list of hyponym patterns. The experiment with the restaurant ontology illustrated that a small training set suf£ces to learn effective patterns and populate an ontology with good precision and recall. The algorithm performs well with respect to recall of the instances found: many big international restaurant branches were found. The identi£cation of the instances however is open to improvement, since the additional check does not £lter out all falsely identi£ed candidate instances.

Table 3: Top learned patterns for the restaurantcountry (ca - cq ) relation. Chinese Denny’s Subway Holywood HOTEL OR Japanese You World Leo These FELIX Marks Friendly New York Louis XV Good That Italia

Bank Pizza Hut Taco Bell Wendy’s This West BP Brazil Victoria Lyons Roy Cities Harvest Vienna Greens It Mark French

Outback Steakhouse Kentucky Fried Chicken Continental Long John Silver’s Burger King Keg Steakhouse Outback San Francisco New York Starbucks California Pizza Kitchen Emperor Friday Montana Red Lobster There Dunkin Donuts Tim Hortons

Table 4: Learned instances for restaurant.

The algorithm returned 69 instance-pairs with countries related to ‘Burger King’. On the Burger King website3 a list of the 65 countries can be found in which the hamburger chain operates. Of these 65 countries, we identi£ed 55. This implies that our results have a precision of 55 69 = 80% and 55 recall of 65 = 85%. Many of the falsely related countries – mostly in eastern Europe – are locations where Burger King is said to have plans to expand its empire.

7

Conclusions

We have presented a novel approach to identify useful surface text patterns for information extraction using an internet search engine. We argued that the selection of patterns has to be based on effectiveness: a pattern has to occur frequently, it has to be precise and has to be wide-spread if it represents a non-functional relation. These criteria are combined in a scoring function which we use to select the most effective patterns. 3

8 Future work Currently we check whether an extracted term is indeed an instance of some class by querying hyponym patterns. However, if we £nd two instances related by some surface text pattern, we always accept these instances as instance pair. Thus, if we both £nd ‘Mozart was born in Germany’ and ‘Mozart was born in Austria’, both extracted

http://www.whopper.com

7

instance-pairs are added to our ontology. We thus need some post-processing to remove falsely found instance-pairs. When we know that a relation is functional, we can select the most frequently occurring instance-pair. Moreover, the process of identifying an instance in a text needs further research especially since the method to identify instance-class relations by querying hyponym patterns is not ¤awless. The challenge thus lies in the area of improving the precision of the output of the ontology population algorithm. With additional £ltering techniques and more elaborated identi£cation techniques we expect to be able to improve the precision of the output. We plan to research check functions based on enumerations of candidate instances with known instances of the class. For example, the enumeration ‘KFC, Chinese and McDonald’s’ is not found by Google, where ‘KFC, Burger King and McDonald’s’ gives 31 hits. Our experiment with the extraction of hyponym patterns, suggests a ranking of Hearst-patterns based on the effectiveness. Knowledge on the effectiveness of each of the Hearst-patterns can be utilized to minimize the amount of queries. Finally we will investigate ways to compare our methods with other systems in a TREC like setting with the web as a corpus.

N. A. Chinchor, editor. 1998. Proceedings of the Seventh Message Understanding Conference (MUC-7). Morgan Kaufmann, Fairfax, Virginia. P. Cimiano and S. Staab. 2004. Learning by googling. SIGKDD Explorations Newsletter, 6(2):24–33. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. 2000. Learning to construct knowledge bases from the World Wide Web. Arti£cial Intelligence, 118:69–113. V. Crescenzi and G. Mecca. 2004. Automatic information extraction from large websites. Journal of the ACM, 51(5):731–779. O. Etzioni, M. J. Cafarella, D., A. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. 2005. Unsupervised named-entity extraction from the web: An experimental study. Arti£cial Intelligence, 165(1):91–134. K. Frantzi, S. Ananiado, and H. Mima. 2000. Automatic recognition of multi-word terms: the cvalue/nc-value method. International Journal on Digital Libraries, 3:115–130. G. Geleijnse and J. Korst. 2005. Automatic ontology population by googling. In Proceedings of the Seventeenth Belgium-Netherlands Conference on Arti£cial Intelligence (BNAIC 2005), pages 120 – 126, Brussels, Belgium. M. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, pages 539–545, Morristown, NJ, USA.

Acknowledgments

M. Hearst. 1998. Automated discovery of wordnet relations. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.

We thank our colleagues Bart Bakker and Dragan Sekulovski and the anonymous reviewers for their useful comments on earlier versions of this paper.

G. Nenadić, I. Spasić, and S. Ananiadou. 2002. Automatic discovery of term similarities using pattern mining. In Proceedings of the second international workshop on Computational Terminology (CompuTerm’02), Taipei, Taiwan.

References E. Agichtein and L. Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM International Conference on Digital Libraries.

D. Ravichandran and E. Hovy. 2002. Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pages 41–47, Philadelphia, PA.

E. Brill. 1992. A simple rule-based part-of-speech tagger. In Proceedings of the third Conference on Applied Natural Language Processing (ANLP’92), pages 152–155, Trento, Italy.

E. Voorhees. 2004. Overview of the trec 2004 question answering track. In Proceedings of the 13th Text Retrieval Conference (TREC 2004), Gaithersburg, Maryland.

S. Brin. 1998. Extracting patterns and relations from the world wide web. In WebDB Workshop at sixth International Conference on Extending Database Technology (EDBT’98).

G. Zhou and J. Su. 2002. Named entity recognition using an hmm-based chunk tagger. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pages 473 – 480, Philadelphia, PA.

P. Buitelaar, P. Cimiano, and B. Magnini, editors. 2005. Ontology Learning from Text: Methods, Evaluation and Applications, volume 123 of Frontiers in Arti£cial Intelligence and Applications. IOS Press.

8

Simple Information Extraction (SIE): A Portable and Effective IE System Claudio Giuliano and Alberto Lavelli and Lorenza Romano ITC-irst Via Sommarive, 18 38050, Povo (TN) Italy {giuliano,lavelli,romano}@itc.it

Abstract

highly unbalanced (i.e., the number of positive examples constitutes only a small fraction with respect to the number of negative examples). This fact has important consequences. In some machine learning algorithms the unbalanced distribution of examples can yield a significant loss in classification accuracy. Moreover, very large data sets can be problematic to process due to the complexity of many supervised learning techniques. For example, using kernel methods, such as word sequence and tree kernels, can become prohibitive due to the difficulty of kernel based algorithms, such as Support Vector Machines (SVM) (Cortes and Vapnik, 1995), to scale to large data sets. As a consequence, reducing the number of instances without degrading the prediction accuracy is a crucial issue for applying advanced machine learning techniques in IE, especially in the case of highly unbalanced data sets. In this paper, we present SIE (Simple Information Extraction), an information extraction system based on a supervised machine learning approach for extracting domain-specific entities from documents. In particular, IE is cast as a classification problem by applying SVM to train a set of classifiers, based on a simple and general-purpose feature representation, for detecting the boundaries of the entities to be extracted. SIE was designed with the goal of being easily and quickly portable across tasks and domains. To support this claim, we conducted a set of experiments on several tasks in different domains and languages. The results show that SIE is competitive with the state-of-the-art systems, and it often outperforms systems customized to a specific domain. SIE resembles the “Level One” of the ELIE algorithm (Finn and Kushmerick, 2004). How-

This paper describes SIE (Simple Information Extraction), a modular information extraction system designed with the goal of being easily and quickly portable across tasks and domains. SIE is composed by a general purpose machine learning algorithm (SVM) combined with several customizable modules. A crucial role in the architecture is played by Instance Filtering, which allows to increase efficiency without reducing effectiveness. The results obtained by SIE on several standard data sets, representative of different tasks and domains, are reported. The experiments show that SIE achieves performance close to the best systems in all tasks, without using domain-specific knowledge.

1

Introduction

In designing Information Extraction (IE) systems based on supervised machine learning techniques, there is usually a tradeoff between carefully tuning the system to specific tasks and domains and having a ”generic” IE system able to obtain good (even if not the topmost) performance when applied to different tasks and domains (requiring a very reduced porting time). Usually, the former alternative is chosen and system performance is often shown only for a very limited number of tasks (sometimes even only for a single task), after a careful tuning. For example, in the Bio-entity Recognition Shared Task at JNLPBA 2004 (Kim et al., 2004) the best performing system obtained a considerable performance improvement adopting domain specific hacks. A second important issue in designing IE systems concerns the fact that usually IE data sets are

9

ever, a key difference between the two algorithms is the capability of SIE to drastically reduce the computation time by exploiting Instance Filtering (Gliozzo et al., 2005a). This characteristic allows scaling from toy problems to real-world data sets making SIE attractive in applicative fields, such as bioinformatics, where very large amounts of data have to be analyzed.

2

A Simple IE system

SIE has a modular system architecture. It is composed by a general purpose machine learning algorithm combined with several customizable components. The system components are combined in a pipeline, where each module constrains the data structures provided by the previous ones. This modular specification brings significant advantages. Firstly, a modular architecture is simpler to implement. Secondly, it allows to easily integrate different machine learning algorithms. Finally, it allows, if necessary, a fine tuning to a specific task by simply specializing few modules. Furthermore, it is worth noting that we tested SIE across different domains using the same basic configuration without exploiting any domain specific knowledge, such as gazetteers, and ad-hoc pre/post-processing.

Training Corpus

New Documents

Instance Filtering

Instance Filtering Filter Model

Feature Extraction Lexicon

Extraction Script

Feature Extraction

Learning Algorithm

Extraction

Classification Algorithm

Script

Data Model

Tag Matcher

Tagged Documents

Figure 1: The SIE Architecture. The architecture of the system is shown in Figure 1. The information extraction task is performed in two phases. SIE learns off-line a set of data models from a specified labeled corpus, then the models are applied to tag new documents. In both phases, the Instance Filtering module (Section 3) removes certain tokens from the data set in order to speed-up the whole process, while

Feature Extraction module (Section 4) is used to extract a pre-defined set of features from the tokens. In the training phase, the Learning Module (Section 5) learns two distinct models for each entity, one for the beginning boundary and another for the end boundary (Ciravegna, 2000; Freitag and Kushmerick, 2000). In the recognition phase, as a consequence, the Classification module (Section 5) identifies the entity boundaries as distinct token classifications. A Tag Matcher module (Section 6) is used to match the boundary predictions made by the Classification module. Tasks with multiple entities are considered as multiple independent single-entity extraction tasks (i.e. SIE only extracts one entity at a time).

3

Instance Filtering

The purpose of the Instance Filtering (IF) module is to reduce the data set size and skewness by discarding harmful and superfluous instances without degrading the prediction accuracy. This is a generic module that can be exploited by any supervised system that casts IE as a classification problem. Instance Filtering (Gliozzo et al., 2005a) is based on the assumption that uninformative words are not likely to belong to entities to recognize, being their information content very low. A naive implementation of this assumption consists in filtering out very frequent words in corpora because they are less likely to be relevant than rare words. However, in IE relevant entities can be composed by more than one token and in some domains a few of such tokens can be very frequent in the corpus. For example, in the field of bioinformatics, protein names often contain parentheses, whose frequency in the corpus is very high. To deal with this problem, we exploit a set of Instance Filters (called Stop Word Filters), included in a Java tool called jInFil1 . These filters perform a “shallow” supervision to identify frequent words that are often marked as positive examples. The resulting filtering algorithm consists of two stages. First, the set of uninformative tokens is identified by training the term filtering algorithm on the training corpus. Second, instances describing “uninformative” tokens are removed from both the training and the test sets. Note that instances are not really removed from the data set, but just 1 http://tcc.itc.it/research/textec/ tools-resources/jinfil/

10

marked as uninformative. In this way the learning algorithm will not learn from these instances, but they will still appear in the feature description of the remaining instances. A Stop Word Filter is fully specified by a list of stop words. To identify such a list, different feature selection methods taken from the text categorization literature can be exploited. In text categorization, feature selection is used to remove noninformative terms from representations of texts. In this sense, IF is closely related to feature selection: in the former non-informative words are removed from the instance set, while in the latter they are removed from the feature set. Below, we describe the different metrics used to collect a stop word list from the training corpora. Information Content (IC) The most commonly used feature selection metric in text categorization is based on document frequency (i.e, the number of documents in which a term occurs). The basic assumption is that very frequent terms are non-informative for document indexing. The frequency of a term in the corpus is a good indicator of its generality, rather than of its information content. From this point of view, IF consists of removing all tokens with a very low information content2 . Correlation Coefficient (CC) In text categorization the χ2 statistic is used to measure the lack of independence between a term and a category (Yang and Pedersen, 1997). The correlation coefficient CC 2 = χ2 of a term with the negative class can be used to find those terms that are less likely to express relevant information in texts. Odds Ratio (OR) Odds ratio measures the ratio between the odds of a term occurring in the positive class, and the odds of a term occurring in the negative class. In text categorization the idea is that the distribution of the features on the relevant documents is different from the distribution on non-relevant documents (Raskutti and Kowalczyk, 2004). Following this assumption, a term is non-informative when its probability of being a negative example is sensibly higher than its probability of being a positive example (Gliozzo et al., 2005b).

An Instance Filter is evaluated by using two metrics: the Filtering Rate (ψ), the total percentage of filtered tokens in the data set, and the Positive Filtering Rate (ψ + ), the percentage of positive tokens (wrongly) removed. A filter is optimized by maximizing ψ and minimizing ψ + ; this allows us to reduce as much as possible the data set size preserving most of the positive instances. We fixed the accepted level of tolerance () on ψ + and found the maximum ψ by performing 5-fold cross-validation on the training set.

4

Feature Extraction

The Feature Extraction module is used to extract for each input token a pre-defined set of features. As said above, we consider each token an instance to be classified as a specific entity boundary or not. To perform Feature Extraction an application called jFex3 was implemented. jFex generates the features specified by a feature extraction script, indexes them, and returns the example set, as well as the mapping between the features and their indices (lexicon). If specified, it only extracts features for the instances not marked as “uninformative” by instance filtering. jFex is strongly inspired by FEX (Cumby and Yih, 2003), but it introduces several improvements. First of all, it provides an enriched feature extraction language. Secondly, it makes possible to further extend this language through a Java API, providing a flexible tool to define task specific features. Finally, jFex can output the example set in formats directly usable by LIBSVM (Chang and Lin, 2001), SVMlight (Joachims, 1998) and SNoW (Carlson et al., 1999). 4.1 Corpus Format The corpus must be prepared in IOBE notation, a extension of the IOB notation. Both notations do not allow nested and overlapping entities. Tokens outside entities are tagged with O, while the first token of an entity is tagged with B-entity-type, the last token is tagged E-entity-type, and all the tokens inside the entity boundaries are tagged with I-entity-type, where entity-type is the type of the marked entity (e.g. protein, person). Beside the tokens and their types, the notation allows to represent general purpose and taskspecific annotations defining new columns. Blank

2

The information content of a word w can be measured by estimating its probability from a corpus by the equation I(w) = −p(w) log p(w).

3 http://tcc.itc.it/research/textec/ tools-resources/jfex.html.

11

-1 inc loc: -1 inc loc: -1 inc loc: -1 inc loc: -1 inc loc:

lines can be used to specify sentence or document boundaries. Table 1 shows an example of a prepared corpus. The columns are: the entity-type, the PoS tag, the actual token, the token index, and the output of the instance filter (the “uninformative” tokens are marked with 0) respectively.

w [-3, 3] coloc(w,w) [-3, 3] t [-3, 3] coloc(t,t) [-3, 3] sh [-3, 3]

Table 2: The extraction script used in all tasks. O O O O B-cell type O O B-protein O B-protein I-protein I-protein E-protein O

TO VB IN DT NN NN IN NN ( NN NN NN NN )

To investigate whether the tumor expression of Beta-2-Microglobulin ( Beta 2 M )

2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.22 2.22 2.23

0 0 0 0 1 1 0 1 1 1 1 1 1 1

Table 1: A corpus fragment represented in IOBE notation.

4.2

Extraction Language

As input to the begin and end classifiers, we use a bit-vector representation. Each instance is represented encoding all the following basic features for the actual token and for all the tokens in a context window of fixed size (in the reported experiments, 3 words before and 3 words after the actual token):

5

Learning and Classification Modules

As already said, we approach IE as a classification problem, assigning an appropriate classification label to each token in the data set except for the tokens marked as irrelevant by the instance filter. As learning algorithm we use SVM-light5 . In particular, we identify the boundaries that indicate the beginning and the end of each entity as two distinct classification tasks, following the approach adopted in (Ciravegna, 2000; Freitag and Kushmerick, 2000). All tokens that begin(end) an entity are considered positive instances for the begin(end) classifier, while all the remaining tokens are negative instances. In this way, two distinct models are learned, one for the beginning boundary and another for the end boundary. All the predictions produced by the begin and end classifiers are then paired by the Tag Matcher module. When we have to deal with more than one entity (i.e., with a multi-class problem) we train 2n binary classifiers (where n is the number of entitytypes for the task). Again, all the predictions are paired by the Tag Matcher module.

Token The actual token.

6 POS The Part of Speech (PoS) of the token. Token Shapes This feature maps each token into equivalence classes that encode attributes such as capitalization, numerals, single character, and so on. Bigrams of tokens and PoS tags. The Feature Extraction language allows to formally encode the above problem description through a script. Table 2 provides the extraction script used in all the tasks4 . More details about the Extraction Language are provided in (Cumby and Yih, 2003; Giuliano et al., 2005). 4 In JNLPBA shared task we added some orthographic features borrowed from the bioinformatics literature.

Tag Matcher

All the positive predictions produced by the begin and end classifiers are paired by the Tag Matcher module. If nested or overlapping entities occur, even if they are of different types, the entity with the highest score is selected. The score of each entity is proportional to the entity length probability (i.e., the probability that an entity has a certain length) and the scores assigned by the classifiers to the boundary predictions. Normalizing the scores makes it possible to consider the score function as a probability distribution. The entity length distribution is estimated from the training set. For example, in the corpus fragment of Table 3 the begin and end classifiers have identified four possible entity boundaries for the speaker of a seminar. In the table, the left column shows the 5

12

http://svmlight.joachims.org/

Table 3: A corpus fragment with multiple predictions. O O O O B-speaker I-speaker E-speaker O

The speaker will be Mr. John Smith .

B-speaker (0.23) B-speaker (0.1), E-speaker (0.12) E-speaker (0.34)

Table 4: The length distribution for the entity speaker. entity len P(entity len)

1 0.10

2 0.33

3 0.28

4 0.02

5 0.01

... ...

actual label, while the right column shows the predictions and their normalized scores. The matching algorithm has to choose among three mutually exclusive candidates: “Mr. John”, “Mr. John Smith” and “John Smith”, with scores 0.23 × 0.12 × 0.33 = 0.009108, 0.23 × 0.34 × 0.28 = 0.021896 and 0.1 × 0.34 × 0.33 = 0.01122, respectively. The length distribution for the entity speaker is shown in Table 4. In this example, the matcher, choosing the candidate that maximizes the score function, namely the second one, extracts the actual entity.

7

Evaluation

In order to demonstrate that SIE is domain and language independent we tested it on several tasks using exactly the same configuration. The tasks and the experimental settings are described in Section 7.1. The results (Section 7.2) show that the adopted filtering technique decreases drastically the computation time while preserving (and sometimes improving) the overall accuracy of the system. 7.1

The Tasks

SIE was tested on the following IE benchmarks: JNLPBA Shared Task This shared task (Kim et al., 2004) is an open challenge task proposed at the “International Joint Workshop on Natural Language Processing in Biomedicine and its Applications”6 . The data set consists of 2, 404 MEDLINE abstracts from the GENIA project (Kim et 6 http://research.nii.ac.jp/∼collier/ workshops/JNLPBA04st.htm.

al., 2003), annotated with five entity types: DNA, RNA, protein, cell-line, and cell-type. The GENIA corpus is split into two partitions: training (492,551 tokens), and test (101,039 tokens). The fraction of positive examples with respect to the total number of tokens in the training set varies from 0.2% to 6%. CoNLL 2002 & 2003 Shared Tasks These shared tasks (Tjong Kim Sang, 2002; Tjong Kim Sang and De Meulder, 2003)7 concern language-independent named entity recognition. Four types of named entities are considered: persons (PER), locations (LOC), organizations (ORG) and names of miscellaneous (MISC) entities that do not belong to the previous three groups. SIE was applied to the Dutch and English data sets. The Dutch corpus is divided into three partitions: training and validation (on the whole 258, 214 tokens), and test (73, 866 tokens). The fraction of positive examples with respect to the total number of tokens in the training set varies from 1.1% to 2%. The English corpus is divided into three partitions: training and validation (on the whole 274, 585 tokens), and test (50, 425 tokens). The fraction of positive examples with respect to the total number of tokens in the training set varies from 1.6% to 3.3%. TERN 2004 The TERN (Time Expression Recognition and Normalization) 2004 Evaluation8 requires systems to detect and normalize temporal expressions occurring in English text (SIE did not address the normalization part of the task). The TERN corpus is divided into two partitions: training (249,295 tokens) and test (72,667 tokens). The fraction of positive examples with respect to the total number of tokens in the training set is about 2.1%. Seminar Announcements The Seminar Announcements (SA) collection (Freitag, 1998) consists of 485 electronic bulletin board postings. The purpose of each document in the collection is to announce or relate details of an upcoming talk or seminar. The documents were annotated for four entities: speaker, location, stime, and etime. The corpus is composed by 156, 540 tokens. The fraction of positive examples varies from about 1% to 7 http://www.cnts.ua.ac.be/conll2002/ ner/, http://www.cnts.ua.ac.be/conll2003/ ner/. 8 http://timex2.mitre.org/tern.html.

13

ψtrain/test 0 CC 1 64.1/62.3 2.5 80.1/78.0 5 88.9/86.4 OR 1 70.7/68.9 2.5 81.0/79.1 5 87.8/85.6 IC 1 37.3/36.9 2.5 38.4/38.0 5 39.5/38.9 Zhou and Su (2004) baseline Metric

R 66.4 67.5 66.6 64.8 68.3 67.5 65.4 58.5 56.9 55.6 76.0 52.6

P 67.0 67.3 69.1 68.1 67.3 68.3 68.2 65.7 65.4 65.5 69.4 43.6

F1 66.7 67.4 67.8 66.4 67.8 67.9 66.8 61.9 60.9 60.1 72.6 47.7

T 615 420 226 109 308 193 114 570 558 552

ψtrain/test 0 OR 1 70.4/83.9 2.5 83.6/95.6 5 90.5/97.2 Florian et al. (2003) baseline Metric

ψtrain/test 0 CC 1 64.4/64.4 2.5 75.1/73.3 5 88.6/84.2 OR 1 71.5/71.6 2.5 82.1/80.7 5 90.5/86.1 IC 1 47.3/47.5 2.5 51.3/51.5 5 55.7/56.0 Carreras et al. (2002) baseline

R 73.6 71.6 72.8 66.6 72.0 73.6 66.8 67.0 65.9 63.8 76.3 45.4

P 78.7 79.9 80.3 64.7 78.3 78.9 64.5 79.2 79.3 78.9 77.8 81.3

F1 76.1 75.5 76.4 65.6 75.0 76.2 65.6 72.6 72.0 70.5 77.1 58.3

T 134 70 50 24 61 39 19 101 95 89

Table 6: Filtering Rate, Micro-averaged Recall, Precision, F1 and total computation time for CoNLL-2002 (Dutch). about 2%. The entire document collection is randomly partitioned five times into two sets of equal size, training and test (Lavelli et al., 2004). For each partition, learning is performed on the training set and performance is measured on the corresponding test set. The resulting figures are averaged over the five test partitions. 7.2

Results

The experimental results in terms of filtering rate, recall, precision, F1 , and computation time for JNLPBA, CoNLL-2002, CoNLL-2003, TERN and SA are given in Tables 5, 6, 7, 8 and 9 respectively. To show the differences among filtering strategies for JNLPBA, CoNLL-2002, TERN 2004 we used CC, OR and IC filters, while the results for SA and CoNLL-2003 are reported only for OR filter (which usually produces the best performance). For all filters we report results obtained by setting four different values for parameter , the maximum value allowed for the Filtering Rate of positive examples. = 0 means that no filter is used.

P 90.5 88.1 62.6 66.5 89.0 71.9

F1 83.1 82.8 68.8 70.7 88.8 59.6

T 228 74 33 14

Table 7: Filtering Rate, Micro-averaged Recall, Precision, F1 and total computation time for CoNLL-2003 (English). Metric

Table 5: Filtering Rate, Micro-averaged Recall, Precision, F1 and Time for JNLPBA. Metric

R 76.7 78.2 76.4 75.3 88.5 50.9

CC OR IC

0 1 2.5 5 1 2.5 5 1 2.5 5

ψtrain/test 41.8/41.2 64.5/62.8 86.9/81.7 56.4/54.6 69.4/66.7 82.9/79.0 17.8/17.4 24.0/23.3 27.6/27.1

R 77.9 76.6 60.3 59.7 77.5 59.8 59.5 74.9 74.8 75.0

P 89.8 90.7 88.6 76.0 91.1 88.1 88.6 91.2 91.5 91.5

F1 83.4 83.1 71.7 66.9 83.8 71.2 71.2 82.3 82.3 82.5

T 82 57 41 14 48 36 20 48 36 20

Table 8: Filtering Rate, Micro-averaged Recall, Precision, F1 and total computation time for TERN. The results indicate that both CC and OR do exhibit good performance and are far better than IC in all the tasks. For example, in the JNLPBA data set, OR allows to remove more than 70% of the instances, losing less than 1% of the positive examples. These results pinpoint the importance of using a supervised metric to collect stop words. The results also highlight that both CC and OR are robust against overfitting, because the difference between the filtering rates in the training and test sets is minimal. We also report a significant reduction of the data skewness. Table 10 shows that all the IF techniques reduce sensibly the skewness ratio, the ratio between the number of negative and positive examples, on the JNLPBA data set9 . As expected, both CC and OR consistently outperform IC. The computation time10 reported includes the time to perform the overall process of training and testing the boundary classifiers for each entity11 . The results indicate that both CC and OR are far superior to IC, allowing a drastic reduction of the time. Supervised IF techniques are then particu9

We only report results for this data set as it exhibits the highest skewness ratios. 10 All the experiments have been performed using a dual 1.66 GHz Power Mac G5. 11 Execution time for filter optimization is not reported because it is negligible.

14

Metric OR

0 1 2.5 5

ψtrain/test 53.6/86.2 69.1/90.8 74.7/90.8

R 81.3 81.5 81.6 81.0

P 92.5 92.1 90.5 85.0

F1 86.6 86.5 85.9 83.0

T 179 91 44 31

Table 9: Filtering Rate, Micro-averaged Recall, Precision, F1 and total computation time for SA. entity protein

DNA

RNA

cell type

cell line

Table 10: JNLPBA.

0 1 2.5 5 0 1 2.5 5 0 1 2.5 5 0 1 2.5 5 0 1 2.5 5

CC 17.1 7.5 3.0 1.5 59.3 26.4 14.7 8.3 596.2 250.7 170.4 92.4 72.9 13.8 6.3 3.4 146.4 40.4 24.2 13.6

OR 17.1 3.8 2.5 1.4 59.3 18.5 12.6 8.6 596.2 253.1 170.1 111.1 72.9 13.4 6.5 4.4 146.4 41.6 25.9 14.6

IC 17.1 9.6 9.0 8.8 59.3 33.2 31.7 32.4 596.2 288.4 274.5 280.7 72.9 43.2 43.9 44.5 146.4 87.7 87.5 89.6

8

Conclusion and Future Work

The portability, the language independence and the efficiency of SIE suggest its applicability in practical problems (e.g. semantic web, information extraction from biological data) in which huge collections of texts have to be processed efficiently. In this perspective we are pursuing the recognition of bio-entities from several thousands of MEDLINE abstracts. In addition, the effectiveness of instance filtering will allow us to experiment with complex kernel methods. For the future, we plan to implement more aggressive instance filtering schemata for Entity Recognition, by performing a deeper semantic analysis of the texts.

Skewness ratio of each entity for

larly convenient when dealing with large data sets. For example, using the CC metric the time required by SIE to perform the JNLPBA task is reduced from 615 to 109 minutes (see Table 5). Both OR and CC allow to drastically reduce the computation time and maintain the prediction accuracy12 with small values of . Using OR, for example, with = 2.5% on JNLPBA, F1 increases from 66.7% to 67.9%. On the contrary, for CoNLL-2002 and TERN, for > 2.5% and > 1% respectively, the performance of all the filters rapidly declines. The explanation for this behavior is that, for the last two tasks, the difference between the filtering rates on the training and test sets becomes much larger for > 2.5% and > 1%, respectively. That is, the data skewness changes significantly from the training to the test set. It is not surprising that an extremely aggressive filtering step reduces too much the information available to the classifiers, leading the overall 12

performance to decrease. SIE achieves results close to the best systems in all tasks13 . It is worth noting that state-of-the-art IE systems often exploit external, domain-specific information (e.g. gazetteers (Carreras et al., 2002) and lexical resources (Zhou and Su, 2004)) while SIE adopts exactly the same feature set and does not use any external or task dependent knowledge source.

For JNLPBA, CoNLL 2002 & 2003 and Tern 2004, results are obtained using the official evaluation software made available by the organizers of the tasks.

Acknowledgments SIE was developed in the context of the ISTDot.Kom project (http://www.dot-kom. org), sponsored by the European Commission as part of the Framework V (grant IST-2001-34038). Claudio Giuliano and Lorenza Romano have been supported by the ONTOTEXT project, funded by the Autonomous Province of Trento under the FUP-2004 research program.

References Andrew J. Carlson, Chad M. Cumby, Jeff L. Rosen, and Dan Roth. 1999. SNoW user’s guide. Technical Report UIUCDCS-DCS-R-99-210, Department of Computer Science, University of Illinois at UrbanaChampaign, April. Xavier Carreras, Llu´ıs Márques, and Llu´ıs Padró. 2002. Named entity extraction using adaboost. In Proceedings of CoNLL-2002, Taipei, Taiwan. Chih-Chung Chang and Chih-Jen Lin, 2001. LIBSVM: a library for support vector machines. 13 Note that the TERN results cannot be disclosed, so no direct comparison can be provided. For the reasons mentioned in (Lavelli et al., 2004), direct comparison cannot be provided for Seminar Announcements as well.

15

Software available at http://www.csie.ntu. edu.tw/∼cjlin/libsvm. Fabio Ciravegna. 2000. Learning to tag for information extraction. In F. Ciravegna, R. Basili, and R. Gaizauskas, editors, Proceedings of the ECAI workshop on Machine Learning for Information Extraction, Berlin. Corinna Cortes and Vladimir Vapnik. 1995. Supportvector networks. Machine Learning, 20(3):273– 297. Chad Cumby and W. Yih. 2003. FEX user guide. Technical report, Department of Computer Science, University of Illinois at Urbana-Champaign, April. Aidan Finn and Nicholas Kushmerick. 2004. Multilevel boundary classification for information extraction. In Proceedings of the 15th European Conference on Machine Learning, Pisa, Italy. Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. 2003. Named entity recognition through classifier combination. In Walter Daelemans and Miles Osborne, editors, Proceedings of CoNLL-2003, pages 168–171. Edmonton, Canada. Dayne Freitag and Nicholas Kushmerick. 2000. Boosted wrapper induction. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI 2000), pages 577–583. Dayne Freitag. 1998. Machine Learning for Information Extraction in Informal Domains. Ph.D. thesis, Carnegie Mellon University. Claudio Giuliano, Alberto Lavelli, and Lorenza Romano. 2005. Simple information extraction (SIE). Technical report, ITC-irst. Alfio Massimiliano Gliozzo, Claudio Giuliano, and Raffaella Rinaldi. 2005a. Instance filtering for entity recognition. SIGKDD Explorations (special issue on Text Mining and Natural Language Processing), 7(1):11–18, June.

J. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier. 2004. Introduction to the bio-entity recognition task at JNLPBA. In N. Collier, P. Ruch, and A. Nazarenko, editors, Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), pages 70–75, Geneva, Switzerland, August 28–29. A. Lavelli, M. Califf, F. Ciravegna, D. Freitag, C. Giuliano, N. Kushmerick, and L. Romano. 2004. IE evaluation: Criticisms and recommendations. In AAAI-04 Workshop on Adaptive Text Extraction and Mining (ATEM-2004), San Jose, California. Bhavani Raskutti and Adam Kowalczyk. 2004. Extreme re-balancing for SVMs: a case study. SIGKDD Explor. Newsl., 6(1):60–69. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Walter Daelemans and Miles Osborne, editors, Proceedings of CoNLL-2003, pages 142–147. Edmonton, Canada. Erik F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-2002, pages 155–158. Taipei, Taiwan. Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Douglas H. Fisher, editor, Proceedings of the 14th International Conference on Machine Learning (ICML-97), pages 412–420, Nashville, US. Morgan Kaufmann Publishers, San Francisco, US. Guo Dong Zhou and Jian Su. 2004. Exploring deep knowledge resources in biomedical name recognition. In Proceedings of 2004 Joint Workshop on Natural Processing in Biomedicine and its Applications, Geneva, Switzerland.

Alfio Massimiliano Gliozzo, Claudio Giuliano, and Raffaella Rinaldi. 2005b. Instance pruning by filtering uninformative words: an Information Extraction case study. In Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2005), Mexico City, Mexico, 13-19 February. T. Joachims. 1998. Making large-scale support vector machine learning practical. In A. Smola B. Schölkopf, C. Burges, editor, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA. J. Kim, T. Ohta, Y. Tateishi, and J. Tsujii. 2003. Genia corpus - a semantically annotated corpus for biotextmining. Bioinformatics, 19(Suppl.1):180–182.

16

! " #

"

! "

# $ !

%

!

$

$ &' # $ ! $ # # #

(

! ) ) * ) ! &+ # ,--.) / ! ,--0) 1 ** ! ,--0' & ' & % ,---'! % ( ) & ! ,--2') ! %

& ,--3) + # ,---' !

4

5 * &+ # ,---') % #

% % # $) ) ! /

%

! % $ $

#) ! 7 ) % %

#

%

! ! +) %

!

#

%

! )

# $ ! ) % % % # % 7 #

! % # % $ 7 )

) ! +) %

% ! %

$ )

17

) ! * %! % % #! 2 % % # $) % % #! . $ ! 8 ) $ $ ! 9 ! : ! 8 # % $ ) ; % ! +) %

% # !

$!

4 / $ 5 * >8 8/ % # & ! ,--0'! 1 ** ! &,--0' & ' / ! ! #% * % ! 7 ) % & ' !

%

)

& '(

%

* %

! 8

, & ,--3'! ) * ! * $ * + # &,--.'! *

.!3) ) F ! )

) % * .!3 % ! ) * $ % .!3! $ F) >F) "F) 1F /-)

, *(

% # $! $ $ % A % #G ( #) # % #G # # ! /-/

*(' ,1

D

# #

) 4% %5 #

! $ * % % ! /-2

!

1 ** ! &,--0' % 7 $ % ! % #) $ # % $ * $

19

! % $ * % % ) # %

! $ ) ) 7 ! + & 3HHH'!

2 2-$

.

$ % % # A

&485' &+ 3HH;' % # &4@ +>5' & ! ,--0'! 8 .;0

! % #) ) ! @ +> >8 8/ 4 / $ 5! 33-- % # ) 9-- % % ! + ) ,-- % *

! $ .-- ! 33 % # ) ) ) ) ! $ 0-A0- 8 :0A,0 @ +> ! 2-%

+3 ) % ) !!) +3 I &, $ $ '! 4 5 )

$ 8! + @ +> # 8) #

& % '! % # % * ! @ # % % * ! # % # * $! $ 8) % $

! !

@ +> * ! % ) % # #

8 %

;.!H 52-4 ;3!2 *

H2!3 H-!, 4/-5 F ! H2!9 4/-3 H,!: % 7 # 52-4 ;.!H 9H % - 54-/ ;;!H ;.!0 ! B %) % / ! " $ * ) % 0-

# %

) %

!$ &" $

23

" % @ +> ) /! )

/ 7 % ! 18 %

7 % ! @ %

!

!

"

0;!3

00!0

37-3

"!

99!9

9;!2

34-6

"

65-%

:-!H

:9!;

"

92!;

9,!;

35-2

"

36

00!0

99!H

"

::!.

:-!0

64-)

"

:;!H

:3!H

57-4

"

:,!;

9;!:

62-4

9,!2

33-2

99

!

9-!9

34-$

99!3

,H!:

/)-)

22!3

-

90

92!H

36-3

1 ! " 0-

4

) % $

! $ %

! @ # % % ! %

& # '

) ) $

# ! % #) %

)

$

%

! @ % !

& ) +! ,--3! - ) ) ! ! > 3: ! L 8 ! ) C! % ) L! ,---! 2 3 0 4 # 5! >! +) 8! M #) C! ,--.! 0 - * ! 2 6)! > 30 /) >) ! +) ! 3HH;! 0 7 2 6) 2 8! > ) ! +) ! #) C! ,---! * " ! > 3: C! 8 ! 1 ** ) 8! ! ,--0! 2 6! 9 8 1 $ ) C / > $ ) :&3') ! 333;! /) 8! ! ,--.! -! 0 ! 26 6- ! > . / ?

&/? ,--.') 3900 390;) / ) > ! /) N! ! ! ,--0! 30 * 7 ! 2 6)! > / @ # ! / C ! ! ) C! ! ,--0! 6- 0 7 2 6)! > ,, / ) D ) 1! ) L! ,--0! 9 6) 0 ! > / @) ) 1! ) L! ! ,--2! : ! ! > C//,--2) ! 3;.3;:! ) ! ) +! 3HHH! 0 7 ) ;! 8 2.!

24

Transductive Pattern Learning for Information Extraction Brian McLernon Nicholas Kushmerick School of Computer Science and Informatics University College Dublin, Ireland {brian.mclernon,nick}@ucd.ie

Abstract The requirement for large labelled training corpora is widely recognized as a key bottleneck in the use of learning algorithms for information extraction. We present T PLEX, a semi-supervised learning algorithm for information extraction that can acquire extraction patterns from a small amount of labelled text in conjunction with a large amount of unlabelled text. Compared to previous work, T PLEX has two novel features. First, the algorithm does not require redundancy in the fragments to be extracted, but only redundancy of the extraction patterns themselves. Second, most bootstrapping methods identify the highest quality fragments in the unlabelled data and then assume that they are as reliable as manually labelled data in subsequent iterations. In contrast, T PLEX’s scoring mechanism prevents errors from snowballing by recording the reliability of fragments extracted from unlabelled data. Our experiments with several benchmarks demonstrate that T PLEX is usually competitive with various fully-supervised algorithms when very little labelled training data is available.

1

Introduction

Information extraction is a form of shallow text analysis that involves identifying domain-specific fragments within natural language text. Most recent research has focused on learning algorithms that automatically acquire extraction patterns from manually labelled training data (e.g., (Riloff, 1993; Califf and Mooney, 1999; Soderland, 1999; Freitag and Kushmerick, 2000; Ciravegna, 2001; Finn and Kushmerick, 2004)). This training data takes the form of the original text, annotated with the fragments to be extracted. Due to the expense and tedious nature of this labelling process, it is widely recognized that a key bottleneck in deploying such algorithms is the need to create a sufficiently large training corpus for each new domain. In response to this challenge, many researchers have investigated semi-supervised learning algorithms that learn from a (relatively small) set of labelled texts

in conjunction with a (relatively large) set of unlabelled texts (e.g., (Riloff, 1996; Brin, 1998; Yangarber et al., 2002)). In this paper, we present T PLEX, a semi-supervised algorithm for learning information extraction patterns. The key idea is to exploit the following recursive definitions: good patterns extract good fragments, and good fragments are extracted by good patterns. To operationalize this recursive definition, we initialize the pattern and fragment scores with labelled data, and then iterate until the scores have converged. Most prior semi-supervised approaches to information extraction assume that fragments are essentially named entities, so that there will be many occurrences of any given fragment. For example, for the task of discovering diseases (“influenza”, “Ebola”, etc), prior algorithms assume that each disease will be mentioned many times, and that every occurrence of such a disease in unlabelled text should be extracted. However, it may not be the case that fragments to be extracted occur more than once in the corpus, or that every occurrence of a labelled fragment should be extracted. For example, in the well-known CMU Seminars corpus, any given person usually gives just one seminar, and a fragment such as “3pm” sometimes indicates the start time, other occurrences indicate the end time, and some occurrence should not be extracted at all. Rather than relying on redundancy of the fragments, T PLEX exploits redundancy of the learned extraction patterns. T PLEX is a transductive algorithm (Vapnik, 1998), in that the goal is to perform extraction from a given unlabelled corpus, given a labelled corpus. This is in contrast to the typical machine learning framework, where the goal is a set of extraction patterns (which can of course then be applied to new unlabelled text). As a side-effect, T PLEX does generate a set of extraction patterns which may be a useful in their own right, depending on the application. We have compared T PLEX with various competitors on a variety of real-world extraction tasks. We have observed that T PLEX’s performance matches or exceeds these in several benchmark tasks. The remainder of this paper is organized as follows. After describing related work in more detail (Sec. 2), we describe the T PLEX algorithm (Sec. 3). We then discuss a series of experiments to compare T PLEX with various supervised al-

25

gorithms (Sec. 4). We conclude with a summary of observations made during the evaluation, and a discussion of future work (Sec. 5).

2

Related work

A review of machine learning for information extraction is beyond the scope of this paper; see e.g. (Cardie, 1997; Kushmerick and Thomas, 2003). A number of researchers have previously developed bootstrapping or semi-supervised approaches to information extraction, named entity recognition, and related tasks (Riloff, 1996; Brin, 1998; Riloff and Jones, 1999; Agichtein et al., 2001; Yangarber et al., 2002; Stevenson and Greenwood, 2005; Etzioni et al., 2005). Several approaches for learning from both labeled and unlabeled data have been proposed (Yarowsky, 1995; Blum and Mitchell, 1998; Collins and Singer, 1999) where the unlabeled data is utilised to boost the performance of the algorithm. In (Collins and Singer, 1999) Collins and Singer show that unlabeled data can be used to reduce the level of supervision required for named entity classification. However, their approach is reliant on the presence of redundancy in the named entities to be identified. T PLEX is most closely related to the N OMEN algorithm (Yangarber et al., 2002). N OMEN has a very simple iterative structure: at each step, a very small number of high-quality new fragments are extracted, which are treated in the next step as equivalent to seeds from the labeled documents. N OMEN has a number of parameters which must be carefully tuned to ensure that it does not over-generalise. Erroneous additions to the set of trusted fragments can lead to a snowballing of errors. Also, N OMEN uses a binary scoring mechanism, which works well in dense corpora with substantial redundancy. However, many information extraction tasks feature sparse corpora with little or no redundancy. We have extended N OMEN by allowing it to make finergrained (as opposed to binary) scoring decisions at each iteration. Instead of definitively assigning a position to a given field, we calculate the likelihood that it belongs to the field over multiple iterations.

3

The T PLEX algorithm

The goal of the algorithm is to identify the members of the target fields within unlabelled texts by generalizing from seed examples in labelled training texts. We achieve this by generalizing boundary detecting patterns and scoring them with a recursive scoring function. As shown in Fig. 1, T PLEX bootstraps the learning process from a seed set of labelled examples. The examples are used to populate initial pattern sets for each target field, with patterns that match the start and end positions of the seed fragments. Each pattern is then generalised to produce more patterns, which are in turn

VFRUH VFRUH

^VFRUH _ o ` ^VFRUH _ o `

Figure 1: An overview of T PLEX. A key idea of the algorithm is the following recursive scoring method: pattern scores are a function of the scores of the positions they extract, and position scores are a function of the scores of the patterns that extract them. applied to the corpus in order to identify more base patterns. This process iterates until no more patterns can be learned. T PLEX employs a recursive scoring metric in which good patterns reinforce good positions, and good positions reinforce good patterns. Specifically, we calculate confidence scores for positions and patterns. Our scoring mechanism calculates the score of a pattern as a function of the scores of the positions that it matches, and the score of a position as a function of the scores of the patterns that extract it. T PLEX is a multi-field extraction algorithm in that it extracts multiple fields simultaneously. By doing this, information learned for one field can be used to constrain patterns learned for others. Specifically, our scoring mechanism ensures that if a pattern scores highly for one field, its score for all other fields is reduced. In the remainder of this section, we describe the algorithm by formalizing the space of learned patterns, and then describing T PLEX’s scoring mechanism. 3.1 Boundary detection patterns T PLEX extracts fragments of text by identifying probable fragment start and end positions, which are then assembled into complete fragments. T PLEX’s patterns are therefore boundary detectors which identify one end of a fragment or the other. T PLEX learns patterns to identify the start and end of target occurrences independently of each other. This strategy has previously been employed successfully (Freitag and Kushmerick, 2000; Ciravegna, 2001; Yangarber et al., 2002; Finn and Kushmerick, 2004). T PLEX’s boundary detectors are similar to those learned by BWI (Freitag and Kushmerick, 2000). A boundary detector has two parts, a left pattern and a right pattern. Each of these patterns is a sequence of tokens, where each is either a literal or a generalized token. For example, the boundary detector

26

would correctly find the start of a name in an utterance such as “will be: Dr Robert Boyle” and “will be, Sandra Frederick”, but it will fail to identify the start of the name in “will be Dr. Robert Boyle”. The boundary detectors that find the beginnings of fragments are called the pre-patterns, and the detectors that find the ends of fragments are called the post-patterns.

data, but this rapidly leads to overgeneralization.) The locations in the corpus where the patterns match are regarded as potential target positions. Pre-patterns indicate potential start positions for target fragments while post-patterns indicate end positions. When all of the patterns have been matched against the corpus, each field will have a corresponding set of potential start and end positions.

3.2 Pattern generation

3.3

As input, T PLEX requires a set of tagged seed documents for training, and an untagged corpus for learning. The seed documents are used to initialize the pre-pattern and post-pattern sets for each of the target fields. Within the seed documents each occurrence of a fragment belonging to any of the target categories is surrounded by a set of special tags that denote the field to which it belongs. The algorithm parses the seed documents and identifies the tagged fragments in each document. It then generates patterns for the start and end positions of each fragment based on the surrounding tokens. Each pattern varies in length from 2 to n tokens. For a given pattern length `, the patterns can then overlap the position by zero to ` tokens. For example, a pre-pattern of length four with an overlap of one will match the three tokens immediately preceeding the start position of a fragment, and the one token immediately Pn following that position. In this way, we generate i=2 (i + 1) patterns from each seed position. In our experiments, we set the maximum pattern length to be n = 4. T PLEX then grows these initial sets for each field by generalizing the initial patterns generated for each seed position. We employ eight different generalization tokens when generalizing the literal tokens of the initial patterns. The wildcard token matches every literal. The second type of generalization is , which matches punctuation such as commas and periods. Similarly, the token matches literals with an initial capital letter, matches a sequence of digits, matches a literal consisting of letters followed by digits, and matches a literal consisting of digits followed by letters. The final two generalisations are and , which match literals that appear in a list of first and last names (respectively) taken from US Census data. All patterns are then applied to the entire corpus, including the seed documents. When a pattern matches a new position, the tokens at that position are converted into a maximally-specialized pattern, which is added to the pattern set. Patterns are repeatedly generalized until only one literal token remains. This whole process iterates until no new patterns are discovered. We do not generalize the new maximally-specialized patterns discovered in the unlabelled data. This ensures that all patterns are closely related to the seed data. (We experimented with generalizing patterns from the unlabelled

Positions are denoted by r, and patterns are denoted by p. Formally, a pattern is equivalent to the set of positions that it extracts.The notation p → r indicates that pattern p matches position r. Fields are denoted by f , and F is the set of all fields. The labelled training data consists of a set of positions R = {. . . , r, . . .}, and a labelling function T : R → F ∪ {X} for each such position. T (r) = f indicates that position r is labelled with field f in the training data. T (r) = X means that r is not labelled in the training data (i.e. r is a negative example for all fields). The unlabelled test data consists of an additional set of positions U . Given this notation, the learning task can be stated concisely as follows: extend the domain of T to U , i.e. generalize from T (r) for r ∈ R, to T (r) for r ∈ U .

[will be ][ ]

Notation & problem statement

3.4 Pattern and position scoring When the patterns and positions for the fields have been identified we must score them. Below we will describe in detail the recursive manner in which we define scoref (r) in terms of scoref (p), and vice versa. Given that definition, we want to find fixed-point values for scoref (p) and scoref (r). To achieve this, we initialize the scores, and then iterate through the scoring process (i.e. calculate scores at step t + 1 from scores at step t). This process repeats until convergence. Initialization. As the scores of the patterns and positions of a field are recursively dependant, we must assign initial scores to one or the other. Initially the only elements that we can classify with certainty are the seed fragments. We initialise the scoring function by assigning scores to the positions for each of the fields. In this way it is then possible to score the patterns based on these initial scores. From the labelled training data, we derive the prior probability π(f ) that a randomly selected position belongs to field f ∈ F : π(f ) = |{r ∈ R | T (r) = f }|/|R|. P Note that 1 − f π(f ) is simply the prior probability that a randomly selected position should not be extracted at all; typically this value is close to 1. Given the priors π(f ), we score each potential posi-

27

tion r in field f :   π(f ) if r ∈ U, 1 if r ∈ R ∧ T (r) = f, and score0f (r) =  0 if r ∈ R ∧ T (r) 6= f.

The first case handles positions in the unlabelled documents; at this point we don’t know anything about them and so fall back to the prior probabilities. The second and third cases handle positions in the seed documents, for which we have complete information. Iteration. After initializing the scores of the positions, we begin the iterative process of scoring the patterns and the positions. To compute the score of a pattern p for field f we compute a positive score, posf (p); a negative score, negf (p); and an unknown score, unk(p). posf (p) can be thought of as a measure of the benefit of p to f , while negf (p) measures the harm of p to f , and unk(p) measures the uncertainty about the field with which p is associated. These quantities are defined as follows: posf (p) is the average score for field f of positions extracted by p. We first compute: 1 X posf (p) = scoretf (r), Zp p→r t p→r P scoref (r)

P P

where Zp = is a normalizing f constant to ensure that f posf (p) = 1. For each field f and pattern p, negf (p) is the extent to which p extracts positions whose field is not f : negf (p) = 1 − posf (p). Finally, unk(p) measures the degree to which p extract positions whose field is unknown: X 1 unk(p) = unk(r), |{p → r}| p→r where unk(r) measures the degree to which position r is unknown. To be completely ignorant of a position’s field is to fall back on the prior field probabilities π(f ). Therefore, we calculate unk(r) by computing the sum of squared differences between scoretf (r) and π(f ): unk(r) SSD(r)

1 1 − SSD(r), Z X 2 = scoretf (r) − π(f ) , =

f

Z

=

max SSD(r). r

The normalization constant Z ensures that unk(r) = 0 for the position r whose scores are the most different from the priors—ie, r is the “least unknown” position. For each field f and pattern p, scoretf (p) is defined in terms of posf (p), negf (p) and unk(p) as follows: scoret+1 f (p)

= =

posf (p) posf (p)+negf (p)+unk(p) posf (p)2 1+unk(p)

· posf (p)

This definition penalizes patterns that are either inaccurate or have low coverage. Finally, we complete the iterative step by calculating a revised score for each position: ( t if r ∈ R P scoreft(r) t+1 scoref (r) = scoref (p)−min p→r if r ∈ U, max − min P where min = minf,p→r p→r scoretf (p) and max = P maxf,p→r p→r scoretf (p), are used to normalize the scores to ensure that the scores of unlabelled positions never exceed the scores of labelled positions. The first case in the function for scoret+1 f (r) handles positive and negative seeds (i.e. positions in labelled texts), the second case is for unlabelled positions. We iterate this procedure until the scores of the patterns and positions converge. Specifically, we stop when X P |scoretf (p)−scoret−1 (p)|2 + f Pp < θ. t−1 t 2 f

r

|scoref (r)−scoref

(r)|

In our experiments, we fixed θ = 1.

3.5 Position filtering & fragment identification Due to the nature of the pattern generation strategy, many more candidate positions will be identified than there are targets in the corpus. Before we can proceed with matching start and end positions to form fragments, we filter the positions to remove the weaker candidates. We rank all of positions for each field according to their score. We then select positions with a score above a threshold β as potential positions. In this way we reduce the number of candidate positions from tens of thousands to a few hundred. The next step in the process is to identify complete fragments within the corpus by matching pre-positions with post-positions. To do this we compute the length probabilities for the fragments of field f based on the lengths of the seed fragments of f . Suppose that position r1 has been identified as a possible start for field f , and position r2 has been identified as a possible field f end, and let Pf (`) be the fraction of field f seed fragments with length `. Then the fragment e = (r1 , r2 ) is assigned a score scoref (e) = scoref (r1 ) · scoref (r2 ) · Pf (r2 − r1 + 1). Despite these measures, overlapping fragments still occur. Since the correct fragments can not overlap, we know that if two extracted fragments overlap, at least one must be wrong. We resolve overlapping fragments by calculating the set of non-overlapping fragments that maximises the total score while also accounting for the expected rate of occurrence of fragments from each field in a given document. In more detail, let E be the set of all fragments extracted from some particular document D. We are interested in the score of some subset G ⊆ E of D’s

28

fragments. Let score(G) be the chance that G is the correct set of fragments for D. Assuming that the correctness of the fragments can be determined independently given that the correct number of fragments have been identified for each field, then score(G) can be defined zero if ∃ (r1 , r1 ), (s2 , r2 ) ∈ G Qsuch that (s1 , r1 ) overlaps (s2 , r2 ), and score(G) = f score(Gf ) otherwise, where Gf ⊆ G is the fragments in G for field f . The score of GfQ= {e1 , e2 , . . .} is defined as score(Gf ) = Pr(|Gf |)· j score(ej ), where Pr(|Gf |) is the fraction of training documents that have |Gf | instances of field f . It is infeasible to enumerate all subsets G ⊆ E, so we perform a heuristic search. The states in the search space are pairs of the form (G, P ), where G is a list of good fragments (i.e. fragments that have been accepted), and P is a list of pending fragments (i.e. fragments that haven’t yet been accepted or rejected). The search starts in the state ({}, E), and states of the form (G, {}) are terminal states. The children of state (G, P ) are all ways to move a single fragment from P to G. When forming a child’s pending set, the moved fragment along with all fragments that it overlaps are removed (meaning that the moved fragment is selected and all fragments with which it overlaps are rejected). More precisely, the children of state (G, P ) are: e∈P ∧ G0 =G∪{e} ∧ 0 0 (G , P ) 0 0 . 0 0 P ={e ∈P | e doesn t overlap e}

The search proceeds as follows. We maintain a set S of the best K non-terminal states that are to be expanded, and the best terminal state B encountered so far. Initially, S contains just the initial state. Then, the children of each state in S are generated. If there are no such children, the search halts and B is returned as the best set of fragments. Otherwise, B is updated if appropriate, and S is set to the best K of the new children. Note that K = ∞ corresponds to an exhaustive search. In our experiments, we used K = 5.

4

Experiments

To evaluate the performance of our algorithm we conducted experiments with four widely used corpora: the CMU seminar set, the Austin jobs set, the Reuters acquisition set [www.isi.edu/info-agents/RISE], and the MUC-7 named entity corpus [www.ldc.upenn.edu]. We randomly partitioned each of the datasets into two evenly sized subsets. We then used these as labeled and unlabeled sets. In each experiment the algorithm was presented with a document set comprised of the test set and a randomly selected percentage of the documents from the training set. For example, if an experiment involved providing the algorithm with a 5% seed set, then 5% of the documents in the training set (2.5% of the documents in the entire dataset) would be selected at random and used in conjunction with the documents in the test set.

For each training set size, we ran five iterations with a randomly selected subset of the documents used for training. Since we are mainly motivated by scenarios with very little training data, we varied the size of the training set from 1–10% (1–16% for MUC-7NE) of the available documents. Precision, recall and F1 were calculated using the BWI (Freitag and Kushmerick, 2000) scorer. We used all occurrences mode, which records a match in the case where we extract all of the valid fragments in a given document, but we get no credit for partially correct extractions. We compared T PLEX to BWI (Freitag and Kushmerick, 2000), LP2 (Ciravegna, 2001), E LIE (Finn and Kushmerick, 2004), and an approach based on conditional random fields (Lafferty et al., 2001). The data for BWI was obtained using the T IES implementation [tcc.itc.it/research/textec/toolsresources/ties.html]. The data for the LP2 learning curve was obtained from (Ciravegna, 2003). The results for E LIE were generated by the current implementation [http://smi.ucd.ie/aidan/Software.html]. For the CRF results, we used MALLET’s SimpleTagger (McCallum, 2002), with each token encoded with a set of binary features (one for each observed literal, as well as the eight token generalizations). Our results in Fig. 2 indicate that in Acquisitions dataset, our algorithm substantially outperforms the competitors at all points on the learning curve. For the other datasets, the results are mixed. For SA and Jobs, T PLEX is the second best algorithm at the low end of the learning curve, and steadily loses ground as more labelled data is available. T PLEX is the least accurate algorithm for the MUC data. In Sec. 5, we discuss a variety of modifications to the T PLEX algorithm that we anticipate may improve its performance. Finally, the graph in Fig. 3 compares T PLEX for the SA dataset, in two configurations: with a combination of labelled and unlabelled documents as usual, and with only labelled documents. In both instances the algorithm was given the same seed and testing documents. In the first case the algorithm learned patterns using both the labeled and unlabeled documents. However, in the second case, only the labeled documents were used to generate the patterns. These data confirm that T PLEX is indeed able to improve performance from unlabelled data.

5

Discussion

We have described T PLEX, a semi-supervised algorithm for learning information extraction patterns. The key idea is to exploit the following recursive definition: good patterns are those that extract good fragments, and good fragments are those that are extracted by good patterns. This definition allows T PLEX to perform well with very little training data in domains where other approaches that assume fragment redundancy would fail.

29

0.8 0.7

4 +

0.6

4 3 + × 2

0.4 0.3

+ 4 3 2

× 4 2 3 +

0.5 F1

× 3 2

0.7

×

+ 4

3 2

2 3

0.6 3

0.5 F1

3 3

0.4 0.3

Tplex Elie BWI CRF LP2

0.2 0.1

0.8

+ 4

0

2

4 6 Percentage of data used for training

3 + 2 × 4

8

+

+

0.1 10

3 +

3 +

+

0.2

0

3 +

Trained with unlabeled data Trained with only labeled data 0

2


8

3 + 10

0.4 0.35 0.3 0.25 F1

0.2 0.15

3

0.1

+ × 2

0.05 0

0

3 × +

3 ×

+

+

2

2 2

3

2

3 +

×

3 +

Tplex Elie BWI CRF

2

3 + 2 ×

4 6 8 Percentage of data used for training

2

10

Figure 3: F1 averaged across all fields, for the Seminar dataset trained on only labeled data and trained on labeled and unlabeled data

patterns that extract a given position to determine the most likely field for that position. With more fields in the problem domain there is potentially more information on each of the candidate positions to constrain these decisions.

0.8 0.7 ×

0.6 0.5 F1

× ×

0.4 0.3

2

0.2

3 + ×

2 3 +

2 3 +

3 2 +

2 + 3

2 3 +

Tplex Elie BWI CRF

0.1 0

0

2


3 + 2 ×

8

10

0.8 0.7

×

×

+

+ 2

2

3

3

×

+ 2

0.6 0.5 F1

×

0.4

+

2

0.3

2 3

3

+ 2 3

0.2

Tplex Elie BWI CRF

0.1 0

0

3

2

4 6 8 10 12 Percentage of data used for training

14

3 + 2 × 16

Figure 2: F1 averaged across all fields, for the Seminar (top), Acquisitions (second), Jobs (third) and MUC-NE (bottom) corpora. Conclusions. From our experiments we have observed that our algorithm is particularly competitive in scenarios where very little labelled training data is available. We contend that this is a result of our algorithm’s ability to use the unlabelled test data to validate the patterns learned from the training data. We have also observed that the number of fields that are being extracted in the given domain affects the performance of our algorithm. T PLEX extracts all fields simultaneously and uses the scores from each of the

Future work. We are currently extending T PLEX in several directions. First, position filtering is currently performed as a distinct post-processing step. It would be more elegant (and perhaps more effective) to incorporate the filtering heuristics directly into the position scoring mechanism. Second, so far we have focused on a BWI-like pattern language, but we speculate that richer patterns permitting (for example) optional or reordered tokens may well deliver substantial increases in accuracy. We are also exploring ideas for semi-supervised learning from the machine learning community. Specifically, probabilistic finite-state methods such hidden Markov models and conditional random fields have been shown to be competitive with more traditional pattern-based approaches to information extraction (Fuchun and McCallum, 2004), and these methods can exploit the Expectation Maximization algorithm to learn from a mixture of labelled and unlabelled data (Lafferty et al., 2004). It remains to be seen whether this approach would be effective for information extraction. Another possibility is to explore semi-supervised extensions to boosting (d’Alché Buc et al., 2002). Boosting is a highly effective ensemble learning technique, and B WI uses boosting to tune the weights of the learned patterns, so if we generalize boosting to handle unlabelled data, then the learned weights may well be more effective than those calculated by T PLEX. Acknowledgements. This research was supported by grants SFI/01/F.1/C015 from Science Foundation Ireland, and N00014-03-1-0274 from the US Office of Naval Research.

30

References E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, and A. Voskoboynik. 2001. Snowball: A prototype system for extracting relations from large text collections. In Proc. Int. Conf. Management of Data. A. Blum and T. Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proc. 11th Annual Conference on Computational Learning Theory, pages 92–100. S. Brin. 1998. Extracting patterns and relations from the World Wide Web. In WebDB Workshop at the Int. Conf. Extending Database Technology.

J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. Int. Conf. Machine Learning. J. Lafferty, X. Zhu, and Y. Liu. 2004. Kernel conditional random fields: Representation, clique selection, and semi-supervised learning. In Proc. Int. Conf. Machine Learning. A. McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. E. Riloff and R. Jones. 1999. Learning dictionaries for information extraction by multi-level bootstrapping. In Proc. American Nat. Conf. Artificial Intelligence.

M. E. Califf and R. Mooney. 1999. Relational learning of pattern-match rules for information extraction. In Proc. American Nat. Conf. Artificial Intelligence.

E. Riloff. 1993. Automatically constructing a dictionary for information extraction tasks. In Proc. American Nat. Conf. Artificial Intelligence.

C. Cardie. 1997. Empirical methods in information extraction. AI Magazine, 18(4).

E. Riloff. 1996. Automatically generating extraction patterns from untagged text. In Proc. American Nat. Conf. Artificial Intelligence.

F. Ciravegna. 2001. Adaptive information extraction from text by rule induction and generalisation. In Proc. Int. J. Conf. Artificial Intelligence. F. Ciravegna. 2003. LP2 : Rule induction for information extraction using linguistic constraints. Technical report, Department of Computer Science, University of Sheffield. M. Collins and Y. Singer. 1999. Unsupervised models for named entity classification. In Proc. joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 100–110. F. d’Alché Buc, Y. Grandvalet, and C. Ambroise. 2002. Semi-supervised MarginBoost. In Proc. Neural Information Processing Systems.

S. Soderland. 1999. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233–272. M. Stevenson and M. Greenwood. 2005. Automatic learning of information extraction patterns. In Proc. 19th Int. J. Conf. on Artificial Intelligence. V. Vapnik. 1998. Statistical learning theory. Wiley. R. Yangarber, W. Lin, and R. Grishman. 2002. Unsupervised learning of generalised names. In Proc. Int. Conf. Computational Linguistics. D. Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Meeting of the Association for Computational Linguistics, pages 189–196.

O. Etzioni, M. Cafarella, D. Downey, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. 2005. Unsupervised named-entity extraction from the Web: An experimental study. Artificial Intelligence. In press. A. Finn and N. Kushmerick. 2004. Multi-level boundary classification for information extraction. In Proc. European Conf. Machine Learning. D. Freitag and N . Kushmerick. 2000. Boosted wrapper induction. In Proc. American Nat. Conf. Artificial Intelligence. P. Fuchun and A. McCallum. 2004. Accurate information extraction from research papers using conditional random fields. In Proc. Human Language Technology Conf. N. Kushmerick and B. Thomas. 2003. Adaptive information extraction: Core technologies for information agents. Lecture Notes in Computer Science, 2586.

31

Recognition of synonyms by a lexical graph Peter Siniakov [email protected] Database and Information Systems Group, Freie Universität Berlin Takustr. 9, 14195 Berlin, Germany

Abstract Semantic relationships between words comprised by thesauri are essential features for IR, text mining and information extraction systems. This paper introduces a new approach to identification of semantic relations such as synonymy by a lexical graph. The graph is generated from a text corpus by embedding syntactically parsed sentences in the graph structure. The vertices of the graph are lexical items (words), their connection follows the syntactic structure of a sentence. The structure of the graph and distances between vertices can be utilized to define metrics for identification of semantic relations. The approach has been evaluated on a test set of 200 German synonym sets. Influence of size of the text corpus, word generality and frequency has been investigated. Conducted experiments for synonyms demonstrate that the presented methods can be extended to other semantic relations.

1

Introduction

Once predominantly used by human authors to improve their style avoiding repetitions of words or phrases, thesauri now serve as an important source of semantic and lexical information for automatic text processing. The electronic online thesauri such as WordNet (2005) and OpenThesaurus (2005) have been increasingly employed for many IR and NLP problems. However, considerable human effort is required to keep up with the evolving language and many subdomains are not sufficiently covered (Turney, 2001). Many domainspecific words or word senses are not included;

inconsistency and bias are often cited as further major deficiencies of hand-made thesauri (Curran and Moens, 2002), (Senellart and Blondel, 2003). There is a continuous demand for automatic identification of semantic relations and thesaurus generation. Such tools do not only produce thesauri that are more adapted to a particular application in a certain domain, but provide also assistance for lexicographers in manual creation and keeping the hand-written thesauri up to date. Numerous applications in IR (e.g. query expansion) and text mining (identification of relevant content by patterns) underline their usefulness.

2

Related work

Identification of semantic relations has been approached by different communities as a component of a knowledge management system or application of a developed NLP framework. Many approaches are guided by the assumption that similar terms occur in similar context and obtain a context representation of terms as attribute vectors or relation tuples (Curran and Moens, 2002), (Ruge, 1997), (Lin, 1998). A similarity metric defined on the context representations is used to cluster similar terms (e.g. by the nearest neighbor method). The actual definitions of context (whole document (Chen and Lynch, 1992), textual window, some customized syntactic contexts, cf. (Senellart and Blondel, 2003)) and similarity metric (cf. (Manning and Sch¨ utze, 1999), (Curran and Moens, 2002)) are the essential distinguishing features of the approaches. A pattern-based method is proposed by Hearst (Hearst, 1998). Existing relations in the WordNet database are used to discover regular linguistic patterns that are characteristic for these relations. The patterns contain lexical and syntactic elements and are acquired

32

from a text corpus by identifying common context of word pairs for which a semantic relation holds. Identified patterns are applied to a large text corpus to detect new relations. The method can be enhanced by applying filtering steps and iterating over new found instances (Phillips and Riloff, 2002). Lafourcade and Prince base their approach on reduction of word semantics to conceptual vectors (vector space is spanned by a hierarchy of concepts provided by a thesaurus, (Lafourcade, 2001)). Every term is projected in the vector space and can be expressed by the linear combination of conceptual vectors. The angle between the vectorial representations of two terms is used in calculation of thematic closeness (Lafourcade and Prince, 2001). The approach is more closely related to our approach since it offers a quantitative metric to measure the degree of synonymy between two lexical items. In contrast, Turney (Turney, 2001) tries to solve a quite simpler “TOEFL-like” task of selecting a synonym to a given word from a set of words. Mutual information related to the co-occurrence of two words combined with information retrieval is used to assess the degree of their statistical independency. The least independent word is regarded synonymous. Blondell et al. (Blondel et al., 2004) encode a monolingual dictionary as a graph and identify synonyms by finding subgraphs that are similar to the subgraph corresponding to the queried term. The common evaluation method for similarity metrics is comparing their performance on the same test set with the same context representations with some manually created semantic source as the gold standard (Curran and Moens, 2002). Abstracting from results for concrete test sets, Weeds et al. (2004) try to identify statistical and linguistic properties on that the performance of similarity metrics generally depends. Different bias towards words with high or low frequency is recognized as one reason for the significant variance of k-nearest neighbors sets of different similarity metrics.

3

Construction of the lexical graph

The assumption that similar terms occur in similar context leads to the establishing of ex-

plicit context models (e.g. in form of vectors or relation tuples) by most researchers. We build an implicit context representation connecting lexical items in a way corresponding to the sentence structure (as opposed to (Blondel et al., 2004)), where a term is linked to every word in its definition). The advantage of the graph model is its transitivity: not only terms in the immediate context but also semantically related terms that have a short path to the examined term (but perhaps have never occurred in its immediate context) can contribute to identification of related terms. The similarity metric can be intuitively derived from the distance between the lexical vertices in the graph.

Figure 1: Main steps during graph construction To construct the lexical graph articles from five volumes of two German computer journals have been chunk-parsed and POS tagged using TreeTagger (2004). To preserve the semantic structure of the sentences during the graph construction, i.e. to connect words that build the actual statement of the sentence, parsed sentences are preprocessed before being inserted in the graph (fig. 1). The punctuation signs and parts of speech that do not carry a self-contained semantics (such as conjunctions, pronouns, articles) are removed in a POS filtering step. Tokenization errors are heuristically removed and the words are replaced by their normal forms (e.g. infinitive form for verbs, nominative singular for nouns). German grammar is characterized by a very frequent use of auxiliary and modal verbs that in most cases immediately precede or follow the semantically related sentence parts such as direct object or prepositional phrase while the main verb is often not adjacent to the related parts in a sentence. Since the direct edge between the main verb and non-adjacent related sentence parts cannot be drawn, the

33

sentence is syntactically reorganized by replacing the modal or auxiliary verbs by the corresponding main verb. Another syntactic rearrangement takes place when detachable prefixes are attached to the corresponding main verb. In German some prefixes of verbs are detached and located at the end of the main clause. Since verbs without a prefix have a different meaning prefixes have to be attached to the verb stem. The reorganized sentence

Figure 2: An example of a sentence transformed in a lexical graph can be added to the graph inserting the normalized words in a sentence as vertices and connecting the adjacent words by a directed edge. However, some adjacent words are not semantically related to each other, therefore the lexical graph features two types of edges (see an example in fig. 2). A property edge links the head word of a syntactic chunk (verb or noun phrase) with its modifiers (adverbs or adjectives respectively) that characterize the head word and is bidirectional. A sequential edge connects the head words (e.g. main verbs, head nouns) of syntactic chunks reflecting the “semantic backbone” of the sentence. The length of an edge represents how strong two lexical items are related to each other and depends therefore on the frequency of their co-occurrence. It is initialized with a maximum length M . Every time an existing edge is found in the currently processed sentence, its current length CurLen is modified according to CurLen = MM +1 ; hence the length CurLen

of an edge is inversely proportional to the frequency of co-occurrence of its endpoints. After all sentences from the text corpus have been added to the lexical graph, vertices (words) with a low frequency (≤ θ) are removed from the graph to primarily accelerate the distance calculation. Such rarely oc-

curring words are usually proper nouns, abbreviations, typos etc. Because of the low frequency semantic relations for these words cannot be confidently identified. Therefore removing such vertices reduces the size of the graph significantly without performance penalty (the graph generated from 5 journal volumes contained ca. 300000 vertices and 52191 after frequency filtering with θ = 8). Experimental results feature even a slightly better performance on filtered graphs. To preserve semantic consistency of the graph and compensate removal of existing paths the connections between the predecessors and successors of removed vertices have to be taken into account: the edge length e(p, s) between the predecessor p to the successor s of the removed vertex r can incorporate the length of the path length(p, r, s) from p to s through r by calculating the halved harmonic mean: e(p,s)∗l e(p, s) = e(p,s)+lprs . e(p, s) is the more reprs duced the smaller length(p, r, s) is and if they are equal, e(p, s) is half as long after merging. Beside direct edges an important indication of semantic closeness is the distance, i.e. the length of the shortest path between two vertices. Distances are calculated by the Dijkstra algorithm with an upper threshold Θ. Once the distances from a certain vertex reach the threshold, the calculation for this vertex is aborted and the not calculated distances are considered infinite. Using the threshold reduces the runtime and space considerably while the semantic relation between the vertices with distances > Θ is negligible. The values of M , θ and Θ depend on the particular text corpus and are chosen to keep the size of the graph feasible. θ can be determined experimentally incrementing it as long as the results on the test set are improving. The resulting graph generated from five computer journals volumes with M = 220 , θ = 8, Θ = 60000 contained 52191 vertices, 4,927,365 edges and 376,000,000 distances.

4

Identification of synonyms

The lexical graph is conceived as an instrument to identify semantic relations such as synonymy and hypernymy between lexical items represented by its vertices. The main

34

focus of our research was finding synonyms albeit some results can be immediately transferred for identification of hyponyms. To provide a quantitative measure of synonymy different similarity metrics were defined on the lexical graph. Given a word, the system uses the metric to calculate the closest vertices to the vertex that represents this word. The result is a ranked list of words sorted by the degree of synonymy in descending order. Every metric sim is normalized to be a probability measure so that given a vertex vi the value sim(vi , vj ) can be interpreted as the probability of vj being synonym to vi . The normalization is performed for each metric sim by the following functions: nmin (sim(vi , vj )) =

min(sim(vi ,v1 ),...,sim(vi ,vn )) sim(vi ,vj )

for metrics that indicate maximum similarity to a vertex vi by a minimum value and nmax (sim(vi , vj )) =

the similarity metric. The property neighbors of a vertex vi (adjacent vertices connected with vi by the property edge) play significant role in characterizing similar senses. If two terms share many characteristic properties, there is a strong evidence of their synonymy. A shared property can be regarded as a witness of the similarity of two word senses. There are other potential witnesses, e.g. transitive verbs shared by their direct objects; however, we restricted this investigation to the property neighbors as the most reliable witnesses. The simple method to incorporate the concept of the witnesses into the metric is to determine the number of common property neighbors:

sim(vi ,vj ) max(sim(vi ,v1 ),...,sim(vi ,vn ))

for metrics that indicate maximum similarity to a vertex vi by a maximum value, where v1 . . . vn are the set of graph vertices. In both cases the top-ranked word has the maximum likelihood of 1 to be a synonym of vi . The normalized ranked lists are used for the comparison of different metrics and the evaluation of the approach (see sec. 5). A similarity metric is supposed to assess the semantic similarity between two vertices of the lexical graph. Since the distance metric DistanceM used for calculation of distances between the vertices in the graph indicates how semantically related two vertices are, it can be used as a similarity metric. As the graph is directed, the distance metric is asymmetric, i.e. the distance from vj to vi does not have to be equal to the distance from vi to vj . The major drawback of the DistanceM is that it takes into account only one path between the examined vertices. Even though the shortest path indicates a strong semantic relation between the vertices, it is not sufficient to conclude synonymy that presupposes similar word senses. Therefore more evidence for strong semantic relation with the particular aspect of similar word senses should be incorporated in

N aiveP ropM (vi , vj ) = |prop(vi ) ∩ prop(vj )| where prop(vi ) = {vk |e(i, k) is a property edge} This method disregards, however, the different degree of correlation between the vertices and their property neighbors that is reflected by the length of property edges. A property is the more significant, the stronger the correlation between the property and the vertex is, that is the shorter the property edge is. The degree of synonymy of two terms depends therefore on the number of common properties and the lengths of paths between these terms leading through the properties. Analogously to the electric circuit one can see the single paths through different shared properties as channels in a parallel connection and path lengths as ”synonymy resistances”. Since a bigger number of channels and smaller single resistances contribute to the decreasing of the total resistance (i.e. the evidence of synonymy increases), the idea of WeiPropM metric is to determine the similarity value analogously to the total resistance in a parallel connection: 0

W eiP ropM (vi , vj ) =

n X

1 length(vi , pk , vj ) k=1

!−1

where length(vi , pk , vj ) = e(vi , pk ) + e(pk , vj ) is the length of the path from vi to vj through pk and pk ∈ prop(vi ) ∩ prop(vj ). Another useful observation is that some properties are more valuable witnesses than the others. There are very general properties that are shared by many different terms and

35

some properties that are characteristic only for certain word senses. Thus the number of property neighbors of a property can be regarded as a measure of its quality (in the sense of characterizing the specific word meaning). WeiPropM integrates the quality of a property by weighting the paths leading through it by the number of its property neighbors:

has to be weighted stronger (the examined ratio was 2:1). The corresponding metric can be defined as follows: F irstCompM (vi , vj ) =

n X

1 (e(vi , pk ) + e(pk , vj )) ∗ |prop(pk )| k=1

where pk ∈ prop(vi ) ∩ prop(vj ). WeiPropM measures the correlation between two terms based on the path lengths. Frequently occurring words tend to be ranked higher because the property edge lengths indirectly depend on the absolute word frequency. Because of high absolute frequency of words the frequency of their co-occurrence with different properties is generally also higher and the property edges are shorter. Therefore to compensate this deficiency (i.e. to eliminate the bias discussed in (Weeds et al., 2004)) an edge length from a property to a ranked by the square root term e(pk , vj ) is weighted q

of its absolute frequency f req(vj ). Using the weighted edge length between the property and the ranked term we cannot any longer calculate the path length between vi and vj as the sum q length(vi , pk , vj ) = e(vi , pk ) + e(pk , vj ) ∗

f req(vj ) because the multiplied second component significantly outweighs the first summand. Relative path length can be used instead where both components are adequately taken into account and added relatively to the minimum of the respective component: let min1 be min(e(vi , pa ), . . . , e(vi , pn )) ∈ prop(v = where pk q i ) and min2 min(. . . , e(pk , vj ) ∗ f req(vj ), . . .) where pk ∈ prop(vi ) ∩ prop(vj ). Relative path length √ e(p ,v )∗

f req(v )

j k j i ,pk ) would be e(v . Further min1 + min2 experimental observation suggests that when searching for synonyms of vi the connection between vi and the property is more significant than the second component of the path – the connection between the property and the ranked term vj . Therefore when calculating the relative path length the first component

−1

where RelP athLength(x) =

W eiP ropM (vi , vj ) = !−1

1 √ k=1 RelP athLength(k)∗ |prop(p )| k

Pn

2 e(vi , px ) 1 ∗ + ∗ 3 min1 3

e(px , vj ) ∗

q

f req(vj )

min2

As opposed to NaivePropM and WeiPropM FirstCompM is not symmetric because of the emphasis on the first component.

5

Experiments

For evaluation purposes a test corpus of 200 synonym sets was prepared consulting (OpenThesaurus, 2005). The corpus consists of 75 everyday words (e.g. “Pr¨ asident” (president), “Eingang” (entrance) “Gruppe” (group)), 60 abstract terms (e.g. “Ursache” (reason), “Element”, “Merkmal” (feature)) and 65 domain-specific words (e.g. “Software”, “Prozessor” (CPU)). The evaluation strategy is similar to that pursued in(Curran and Moens, 2002). The similarity metrics do not distinguish between different word senses returning synonyms of all senses of the polysemous words in a single ranked list. Therefore the synonym set of a word in the test corpus is the union of synonym sets of its senses. To provide a measure for overall performance and to compare the different metrics a function measuring the similarity score (SimS) was defined that assigns a score to a metric for correctly found synonyms among the 25 topranked. The function assigns 25 points to the correctly found top-ranked synonym of vi (SimS(0, vi ) = 25) and 1 point to the synonym with the 25th rank (SimS(25, vi ) = 1). The rank of a synonym is decreased only by false positives that are ranked higher (i.e. each of correctly identified top n synonyms has rank 0). In order to reward the top-ranked synonyms stronger the scoring function features a hyperbolic descent. For a synonym of vi with the rank x: SimS(x, vi ) =

  

36

0, if x ∈ / synset(vi )

√ √ 24∗ 26 √ ( 26−1)∗ x+1

+1−

√ 24 26−1

  

To compare performance of different metrics the SimS values of the top 25 words in the ranked list were summed for each word of a test corpus. The total score of a similarity metric Sim is P200 P25 i=1 j=1 SimS(rank(RankedList(vi , j)), vi ) where RankedList(vi , j) returns the word at the position j from the ranked list produced by Sim for vi and v1 , . . . , v200 are the words of the test corpus. Besides, a combined precision and recall measure Π was used to evaluate the ranked lists. Given the word vi , we examined the first n words (n = 1, 5, 25, 100) of the ranked list returned by a similarity metric for vi whether they belong to the synset(vi ) of the test corpus. Π(n) will measure precision if n is less than the size of the synset(vi ) because the maximum recall can not be reached for such n and recall otherwise because maximum precision cannot be reached for n > |synset(vi )|. The Π values were averaged over 200 words. Table 1 presents the result of evaluating the similarity metrics introduced in sec. 4. The results of DistanceM confirm that regarding distance between two vertices alone is not sufficient to conclude their synonymy. DistanceM finds many related terms ranking general words with many outgoing and incoming edges higher, but it lacks the features providing the particular evidence of synonymy. NaivePropM is clearly outperformed by the both weighted metrics. The improvement relative to the DistanceM and acceptable precision of the top-ranked synonyms Π(1) show that considering shared properties is an adequate approach to recognition of synonyms. Ignoring the strength of semantic relation indicated by the graph and the quality of properties is the reason for the big gap in the total score and recall value (Π(100)). Both weighted metrics achieved results comparable with those reported by Curran and Moens in (Curran and Moens, 2002) and Turney in (Turney, 2001). Best results of FirstCompM confirm that the criteria identified in sec. 4 such as generality of a property, abstraction from the absolute word frequency etc. are relevant for identification of synonyms. FirstCompM performed particularly better in finding synonyms with the low frequency of occur-

rence. In another set of experiments we investigated the influence of the size of the text corpus (cf. fig. 3). The plausible assumption is the more texts are processed, the better the semantic connections between terms are reflected by the graph, the more promising results are expected. The fact that the number of vertices does not grow proportionally to the size of text corpus can be explained by word recurrence and growing filtering threshold θ. However, the number of edges increases linearly and reflects the improving semantic coverage. As expected, every metric performs considerably better on bigger graphs. While NaivePropM seems to converge after three volumes, the both weighted metrics behave strictly monotonically increasing. Hence an improvement of results can be expected on bigger corpora. On the small text corpora the results of single metrics do not differ significantly since there is not sufficient semantic information captured by the graph, i.e. the edge and path lengths do not fully reflect the semantic relations between the words. The scores of both weighted metrics grow, though, much faster than that of NaivePropM. FirstCompM achieves the highest gradient demonstrating the biggest potential of leveraging the growing graph for finding synonymy.

Figure 3: Influence of the size of the text corpus. To examine the influence of the word categories results on the subsets of the text corpus corresponding to a category are compared. All metrics show similar behavior, therefore we restrict the analysis to the Π values of

37

Metric DistanceM NaivePropM WeiPropM FirstCompM

Score 2990.7 6546.3 9411.7 11848

Π(1) 0.20 0.415 0.54 0.575

Π(5) 0.208 0.252 0.351 0.412

Π(25) 0.199 0.271 0.398 0.472

Π(100) 0.38 0.440 0.607 0.637

Table 1: Results of different metrics on the test corpus FirstCompM (fig. 4). Synonyms of domainspecific words are recognized better than those of abstract and everyday words. Their semantics are better reflected by the technically oriented texts. The Π values for abstract and everyday words are pretty similar except for the high precision of top-ranked abstract synonyms. Everyday words suffer from the fact that their properties are often too general to uniquely characterize them, which involves loss of precision. Abstract words can be extremely polysemous and have many subtle aspects that are not sufficiently covered by the texts of computer journals.

Figure 4: Dependency of Π(n) on word category (results of FirstCompM metric) To test whether the metrics perform better for the more frequent words the test set was divided in 9 disjunctive frequency clusters (table 2). FirstCompM achieved considerably better results for very frequently occurring words (≥ 4000 occurrences). This confirms indirectly the better results on the bigger text corpora: while low frequency does not exclude random influence, frequent occurrence involves adequate capturing of the word semantics in the graph by inserting and adjusting all relevant property edges. These results do not contradict the conclusion that FirstCompM is not biased towards words with a certain frequency because the mentioned bias

pertains to retrieval of synonyms with a certain frequency, whereas in this experiment the performance for different word frequencies of queried words is compared.

6

Conclusion

We have introduced the lexical graph as an instrument for finding semantic relations between lexical items in natural language corpora. The big advantage of the graph in comparison to other context models is that it captures not only the immediate context but establishes many transitive connections between related terms. We have verified its effectiveness searching for synonymy. Different metrics have been defined based on shortest path lengths and shared properties. Similarity metric FirstCompM that best leverages the graph structure achieved the best results confirming the significant role of number of shared properties, frequency of their co-occurrence and the degree of their generality for detecting of synonymy. Significantly improving results for bigger text corpora and more frequently occurring words are encouraging and promising for detection of other semantic relations. New methods that increasingly employ the graph structure e.g. regarding the lengths and number of short paths between two terms or extending the witness concept to other morphological types are the subject of further research.

Acknowledgements I would like to thank Heiko Kahmann for the valuable assistance in implementation and evaluation of the approach. This research is supported by NaF¨ oG scholarship of the federal state Berlin.

References Vincent D. Blondel, Anah Gajardo, Maureen Heymans, Pierre Senellart, and Paul Van Dooren.

38

Frequency Words/cluster Aver. score Π(1) Π(5) Π(25) Π(100)

9-249 27 53.23 0.556 0.381 0.447 0.561

250-499 25 51.52 0.52 0.392 0.432 0.645

500-999 44 45.80 0.432 0.342 0.446 0.618

1000-1499 30 60.75 0.567 0.395 0.494 0.610

1500-2499 27 56.51 0.667 0.393 0.474 0.690

2500-3999 15 58.75 0.667 0.413 0.419 0.623

4000-5499 11 97.21 0.818 0.600 0.531 0.705

5500-7499 8 106.11 0.75 0.675 0.550 0.642

>7500 13 73.85 0.615 0.503 0.600 0.748

Table 2: Influence of word frequency on the results of FirstCompM metric 2004. A measure of similarity between graph vertices. With applications to synonym extraction and web searching. In SIAM Review, pages 647–666. Hsinchun Chen and Kevin J. Lynch. 1992. Automatic construction of networks of concepts characterizing document databases. In IEEE Transactions on Systems, Man and Cybernetics, volume 22(5), pages 885–902. James R. Curran and Marc Moens. 2002. Improvements in automatic thesaurus extraction. In Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pages 59–66. Association for Computational Linguistics. M. A. Hearst. 1998. Automated discovery of Wordnet relations. In C. Fellbaum, editor, Wordnet An Electronic Lexical Database, pages 131–151. MIT Press, Cambridge, MA.

Pierre P. Senellart and Vincent D. Blondel. 2003. Automatic discovery of similar words. In Michael Berry, editor, Survey of Text Mining. Clustering, classification, and retrieval, pages 25–44. Springer Verlag, Berlin. TreeTagger. 2004. http://www.ims. uni-stuttgart.de/projekte/corplex/ TreeTagger/. Peter D. Turney. 2001. Mining the Web for synonyms: PMI–IR versus LSA on TOEFL. Lecture Notes in Computer Science, 2167:491–502. Julie Weeds, David Weir, and Diana McCarthy. 2004. Characterising measures of lexical distributional similarity. WordNet. 2005. http://wordnet.princeton. edu/w3wn.html.

Mathieu Lafourcade and Violaine Prince. 2001. Relative synonymy and conceptual vectors. In Proceedings of the NLPRS, Tokyo, Japan. Mathieu Lafourcade. 2001. Lexical sorting and lexical transfer by conceptual vectors. In Proceedings of the First International Workshop on MultiMedia Annotation, Tokyo, Japan. Dekang Lin. 1998. An information-theoretic definition of similarity. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 296–304, Madison, WI. Christopher D. Manning and Hinrich Sch¨ utze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA 2000. OpenThesaurus. 2005. - Deutscher Thesaurus. openthesaurus.de.

OpenThesaurus http://www.

William Phillips and Ellen Riloff. 2002. Exploiting strong syntactic heuristics and co-training to learn semantic lexicons. In Proceedings of the 2002 Conference on Empirical Methods in NLP. Gerda Ruge. 1997. Automatic detection of thesaurus relations for information retrieval applications. Foundations of Computer Science: Potential - Theory - Cognition, LNCS 1337:499– 506.

39

Spotting the ‘Odd-one-out’: Data-Driven Error Detection and Correction in Textual Databases Caroline Sporleder, Marieke van Erp, Tijn Porcelijn and Antal van den Bosch ILK / Language and Information Science Tilburg University, P.O. Box 90153, 5000 LE Tilburg, The Netherlands {C.Sporleder,M.G.J.vanErp,M.Porcelijn,Antal.vdnBosch}@uvt.nl

Abstract We present two methods for semiautomatic detection and correction of errors in textual databases. The first method (horizontal correction) aims at correcting inconsistent values within a database record, while the second (vertical correction) focuses on values which were entered in the wrong column. Both methods are data-driven and language-independent. We utilise supervised machine learning, but the training data is obtained automatically from the database; no manual annotation is required. Our experiments show that a significant proportion of errors can be detected by the two methods. Furthermore, both methods were found to lead to a precision that is high enough to make semi-automatic error correction feasible.

1

Introduction

Over the last decades, more and more information has become available in digital form; a major part of this information is textual. While some textual information is stored in raw or typeset form (i.e., as more or less flat text), a lot is semistructured in databases. A popular example of a textual database is Amazon’s book database,1 which contains fields for “author”, “title”, “publisher”, “summary” etc. Information about collections in the cultural heritage domain is also frequently stored in (semi-)textual databases. Examples of publicly accessible databases of this type are the University of St. Andrews’s photographic 1

http://www.amazon.com

collection2 or the Nederlands Soortenregister.3 Such databases are an important resource for researchers in the field, especially if the contents can be systematically searched and queried. However, information retrieval from databases can be adversely affected by errors and inconsistencies in the data. For example, a zoologist interested in finding out about the different biotopes (i.e., habitats) in which a given species was found, might query a zoological specimens database for the content of the BIOTOPE column for all specimens of that species. Whenever information about the biotope was entered in the wrong column, that particular record will not be retrieved by such a query. Similarly, if an entry erroneously lists the wrong species, it will also not be retrieved. Usually it is impossible to avoid errors completely, even in well maintained databases. Errors can arise for a variety of reasons, ranging from technical limitations (e.g., copy-and-paste errors) to different interpretations of what type of information should be entered into different database fields. The latter situation is especially prevalent if the database is maintained by several people. Manual identification and correction of errors is frequently infeasible due to the size of the database. A more realistic approach would be to use automatic means to identify potential errors; these could then be flagged and presented to a human expert, and subsequently corrected manually or semi-automatically. Error detection and correction can be performed as a pre-processing step for information extraction from databases, or it can be interleaved with it. In this paper, we explore whether it is possi2

http://special.st-andrews.ac.uk/ saspecial/ 3 http://www.nederlandsesoorten.nl

40

ble to detect and correct potential errors in textual databases by applying data-driven clean-up methods which are able to work in the absence of background knowledge (e.g., knowledge about the domain or the structure of the database) and instead rely on the data itself to discover inconsistencies and errors. Ideally, error detection should also be language independent, i.e., require no or few language specific tools, such as part-of-speech taggers or chunkers. Aiming for language independence is motivated by the observation that many databases, especially in the cultural heritage domain, are multi-lingual and contain strings of text in various languages. If textual data-cleaning methods are to be useful for such databases, they should ideally be able to process all text strings, not only those in the majority language. While there has been a significant amount of previous research on identifying and correcting errors in data sets, most methods are not particularly suitable for textual databases (see Section 2). We present two methods which are. Both methods are data-driven and knowledge-lean; errors are identified through comparisons with other database fields. We utilise supervised machine learning, but the training data is derived directly from the database, i.e., no manual annotation of data is necessary. In the first method, the database fields of individual entries are compared, and improbable combinations are flagged as potential errors. Because the focus is on individual entries, i.e., rows in the database, we call this horizontal error correction. The second method aims at a different type of error, namely values which were entered in the wrong column of the database. Potential errors of this type are determined by comparing the content of a database cell to (the cells of) all database columns and determining which column it fits best. Because the focus is on columns, we refer to this method as vertical error correction.

2

Related Work

There is a considerable body of previous work on the generic issue of data cleaning. Much of the research directed specifically at databases focuses on identifying identical records when two databases are merged (Hernández and Stolfo, 1998; Galhardas et al., 1999). This is a non-trivial problem as records of the same objects coming from different sources typically differ in their primary keys. There may also be subtle differences

in other database fields. For example, names may be entered in different formats (e.g., John Smith vs. Smith, J.) or there may be typos which make it difficult to match fields (e.g., John Smith vs. Jon Smith).4 In a wider context, a lot of research has been dedicated to the identification of outliers in datasets. Various strategies have been proposed. The earliest work uses probability distributions to model the data; all instances which deviate too much from the distributions are flagged as outliers (Hawkins, 1980). This approach is called distribution-based. In clustering-based methods, a clustering algorithm is applied to the data and instances which cannot be grouped under any cluster, or clusters which only contain very few instances are assumed to be outliers (e.g., Jiang et al. (2001)). Depth-based methods (e.g., Ruts and Rousseeuw (1996)) use some definition of depth to organise instances in layers in the data space; outliers are assumed to occupy shallow layers. Distance-based methods (Knorr and Ng, 1998) utilise a k-nearest neighbour approach where outliers are defined, for example, as those instances whose distance to their nearest neighbour exceeds a certain threshold. Finally, Marcus and Maletic (2000) propose a method which learns association rules for the data; records that do not conform to any rules are then assumed to be potential outliers. In principle, techniques developed to detect outliers can be applied to databases as well, for instance to identify cell values that are exceptional in the context of other values in a given column, or to identify database entries that seem unlikely compared to other entries. However, most methods are not particularly suited for textual databases. Some approaches only work with numeric data (e.g., distribution-based methods), others can deal with categorical data (e.g., distance-based methods) but treat all database fields as atoms. For databases with free text fields it can be fruitful to look at individual tokens within a text string. For instance, units of measurement (m, ft, etc.) may be very common in one column (such as ALTITUDE) but may indicate an error when they occur in another column (such as COLLECTOR).

4

The problem of whether two proper noun phrases refer to the same entity has also received attention outside the database community (Bagga, 1998).

41

3

Data

We tested our error correction methods on a database containing information about animal specimens collected by researchers at Naturalis, The the Dutch Natural History Museum.5 database contains 16,870 entries and 35 columns. Each entry provides information about one or several specimens, for example, who collected it, where and when it was found, its position in the zoological taxonomy, the publication which first described and classified the specimen, and so on. Some columns contain fairly free text (e.g., SPE CIAL REMARKS ), others contain textual content6 of a specific type and in a relatively fixed format, such as proper names (e.g., COLLECTOR or LO CATION ), bibliographical information ( PUBLICA TION ), dates (e.g., COLLECTION DATE ) or numbers (e.g., REGISTRATION NUMBER). Some database cells are left unfilled; just under 40% of all cells are filled (i.e., 229,430 cells). There is a relatively large variance in the number of different values in each column, ranging from three for CLASS (i.e., Reptilia, Amphibia, and a remark pointing to a taxonomic inconsistency in the entry) to over 2,000 for SPECIAL REMARKS, which is only filled for a minority of the entries. On the other hand there is also some repetition of cell contents, even for the free text columns, which often contain formulaic expressions. For example, the strings no further data available or (found) dead on road occur repeatedly in the special remarks field. A certain amount of repetition is characteristic for many textual databases, and we exploit this in our error correction methods. While most of the entries are in Dutch or English, the database also contains text strings in several other languages, such as Portuguese or French (and Latin for the taxonomic names). In principle, there is no limit to which languages can occur in the database. For example, the PUBLICATION column often contains text strings (e.g., the title of the publication) in languages other than Dutch or English.

4

Horizontal Error Correction

The different fields in a database are often not statistically independent; i.e., for a given entry, 5

http://www.naturalis.nl We use the term textual content in the widest possible sense, i.e., comprising all character strings, including dates and numbers. 6

the likelihood of a particular value in one field may be dependent on the values in (some of) the other fields. In our database, for example, there is an interdependency between the LOCATION and the COUNTRY columns: the probability that the COUNTRY column contains the value South Africa increases if the LOCATION column contains the string Tafel Mountain (and vice versa). Similar interdependencies hold between other columns, such as LOCATION and ALTITUDE, or COUNTRY and BIOTOPE, or between the columns encoding a specimen’s position in the zoological taxonomy (e.g., SPECIES and FAMILY). Given enough data, many of these interdependencies can be determined automatically and exploited to identify field values that are likely to be erroneous. This idea bears some similarity to the approach by Marcus and Maletic (2000) who infer association rules for a data set and then look for outliers relative to these rules. However, we do not explicitly infer rules. Instead, we trained TiMBL (Daelemans et al., 2004), a memory-based learner, to predict the value of a field given the values of other fields for the entry. If the predicted value differs from the original value, it is signalled as a potential error to a human annotator. We applied the method to the taxonomic fields (CLASS, ORDER, FAMILY, GENUS, SPECIES and SUB - SPECIES ), because it is possible, albeit somewhat time-consuming, for a non-expert to check the values of these fields against a published zoological taxonomy. We split the data into 80% training set, 10% development set and 10% test set. As not all taxonomic fields are filled for all entries, the exact sizes for each data set differ, depending on which field is to be predicted (see Table 1). We used the development data to set TiMBL’s parameters, such as the number of nearest neighbours to be taken into account or the similarity metric (van den Bosch, 2004). Ideally, one would want to choose the setting which optimised the error detection accuracy. However, this would require manual annotation of the errors in the development set. As this is fairly time consuming, we abstained from it. Instead we chose the parameter setting which maximised the value prediction accuracy for each taxonomic field, i.e. the setting for which the disagreement between the values predicted by TiMBL and the values in the database was smallest. The motivation for this was that a high prediction accuracy will minimise the num-

42

ber of potential errors that get flagged (i.e., disagreements between TiMBL and the database) and thus, hopefully, lead to a higher error detection precision, i.e., less work for the human annotator who has to check the potential errors.

are achieved; this applies even to the “free text” fields like SPECIAL REMARKS.

CLASS ORDER

CLASS ORDER FAMILY GENUS SPECIES SUB - SPECIES

training 7,495 7,493 7,425 7,891 7,873 1,949

devel. 937 937 928 986 984 243

test 937 937 928 986 984 243

Table 1: Data set sizes for taxonomic fields We also used the development data to perform some feature selection. We compared (i) using the values of all other fields (for a given entry) as features and (ii) only using the other taxonomic fields plus the author field, which encodes which taxonomist first described the species to which a given specimen belongs.7 The reduced feature set was found to lead to better or equal performance for all taxonomic fields and was thus used in the experiments reported below. For each taxonomic field, we then trained TiMBL on the training set and applied it to the test set, using the optimised parameter settings. Table 2 shows the value prediction accuracies for each taxonomic field and the accuracies achieved by two baseline classifiers: (i) randomly selecting a value from the values found in the training set (random) and (ii) always predicting the (training set) majority value (majority). The prediction accuracies are relatively high, even for the lowest fields in the taxonomy, SPECIES and SUB SPECIES , which should be the most difficult to predict. Hence it is in principle possible to predict the value of a taxonomic field from the values of other fields in the database. To determine whether the taxonomic fields are exceptional in this respect, we also tested how well non-taxonomic fields can be predicted. We found that all fields can be predicted with a relatively high accuracy. The lowest accuracy (63%) is obtained for the BIOTOPE field. For most fields, accuracies of around 70% 7 The author information provides useful cues for the prediction of taxonomic fields because taxonomists often specialise on a particular zoological group. For example, a taxonomist who specialises on Ranidae (frogs) is unlikely to have published a description of a species belonging to Serpentes (snakes).

FAMILY GENUS SPECIES SUB - SPECIES

TiMBL 99.87% 98.29% 98.02% 92.57% 89.93% 95.03%

random 50.00% 1.92% 0.35% 10.00% 0.20% 0.98%

majority 54.98% 18.59% 10.13% 44.76% 7.67% 21.35%

Table 2: Test set prediction accuracies for taxonomic field values (horizontal method) To determine whether this method is suitable for semi-automatic error correction, we looked at the cases in which the value predicted by TiMBL differed from the original value. There are three potential reasons for such a disagreement: (i) the value predicted by TiMBL is wrong, (ii) the value predicted by TiMBL is correct and the original value in the database is wrong, and (iii) both values are correct and the two terms are (zoological) synonyms. For the fields CLASS, ORDER, FAM ILY and GENUS , we checked the values predicted by TiMBL against two published zoological taxonomies8 and counted how many times the predicted value was the correct value. We did not check the two lowest fields (SUB SPECIES and SPECIES ), as the correct values for these fields can only be determined reliably by looking at the specimens themselves, not by looking at the other taxonomic values for an entry. For the evaluation, we focused on error correction rather than error detection, hence cases where both the value predicted by TiMBL and the original value in the database were wrong, were counted as TiMBL errors. Table 3 shows the results (the absolute numbers of database errors, synonyms and TiMBL errors are shown in brackets). It can be seen that TiMBL detects several errors in the database and predicts the correct values for them. It also finds several synonyms. For GENUS, however, the vast majority of disagreements between TiMBL and the database is due to TiMBL errors. This can be explained by the fact that GENUS is relatively low in the taxonomy (directly above SPECIES). As the values of higher fields only provide limited cues 8

We used the ITIS Catalogue of Life (http: //www.species2000.org/2005/search.php) and the EMBL Reptile Database (http://www. embl-heidelberg.de/˜uetz/LivingReptiles. html).

43

CLASS ORDER FAMILY GENUS

disagreements 2 26 33 135

database errors 50.00% (1) 38.00% (10) 9.09% (3) 5.93% (8)

synonyms 0% (0) 19.00% (5) 36.36% (12) 4.44% (6)

TiMBL errors 50.00% (1) 43.00% (11) 54.55% (18) 89.63% (121)

Table 3: Error correction precision (horizontal method) for the value of a lower field, the lower a field is in the taxonomy the more difficult it is to predict its value accurately. So far we have only looked at the precision of our error detection method (i.e., what proportion of flagged errors are real errors). Error detection recall (i.e., the proportion of real errors that is flagged) is often difficult to determine precisely because this would involve manually checking the dataset (or a significant subset) for errors, which is typically quite time-consuming. However, if errors are identified and corrected semiautomatically, recall is more important than precision; a low precision means more work for the human expert who is checking the potential errors, a low recall, however, means that many errors are not detected at all, which may severely limit the usefulness of the system. To estimate the recall obtained by the horizontal error detection method, we introduced errors artificially and determined what percentage of these artificial errors was detected. For each taxonomic field, we changed the value of 10% of the entries, which were randomly selected. In these entries, the original values were replaced by one of the other attested values for this field. The new value was selected randomly and with uniform probability for all values. Of course, this method can only provide an estimate of the true recall, as it is possible that real errors are distributed differently, e.g., some values may be more easily confused by humans than others. Table 4 shows the results. The estimated recall is fairly high; in all cases above 90%. This suggests that a significant proportion of the errors is detected by our method.

5

Vertical Error Correction

While the horizontal method described in the previous section aimed at correcting values which are inconsistent with the remaining fields of a database entry, vertical error correction is aimed at a different type of error, namely, text strings which were entered in the wrong column of the

CLASS ORDER FAMILY GENUS SPECIES SUB SPECIES

recall 95.56% 96.82% 96.15% 93.09% 96.75% 95.38%

Table 4: Recall for artificially introduced errors (horizontal method) database. For example, in our database, information about the biotope in which a specimen was found may have been entered in the SPECIAL RE MARKS column rather than the BIOTOPE column. Errors of this type are quite frequent. They can be accidental, i.e., the person entering the information inadvertently chose the wrong column, but they can also be due to misinterpretation, e.g., the person entering the information may believe that it fits the SPECIAL REMARKS column better than the BIOTOPE column or they may not know that there is a BIOTOPE column. Some of these errors may also stem from changes in the database structure itself, e.g., maybe the BIOTOPE column was only added after the data was entered.9 Identifying this type of error can be recast as a text classification task: given the content of a cell, i.e., a string of text, the aim is to determine which column the string most likely belongs to. Text strings which are classified as belonging to a different column than they are currently in, represent a potential error. Recasting error detection as a text classification problem allows the use of supervised machine learning methods, as training data (i.e., text strings labelled with the column they belong to) can easily be obtained from the database. We tokenised the text strings in all database fields10 and labelled them with the column they 9 Many databases, especially in the cultural heritage domain, are not designed and maintained by database experts. Over time, such database are likely to evolve and change structurally. In our specimens database, for example, several columns were only added at later stages. 10 We used a rule-based tokeniser for Dutch developed by

44

occur in. Each string was represented as a vector of 48 features, encoding the (i) string itself and some of its typographical properties (13 features), and (ii) its similarity with each of the 35 columns (in terms of weighted token overlap) (35 features). The typographical properties we encoded were: the number of tokens in the string and whether it contained an initial (i.e., an individual capitalised letter), a number, a unit of measurement (e.g., km), punctuation, an abbreviation, a word (as opposed to only numbers, punctuation etc.), a capitalised word, a non-capitalised word, a short word (< 4 characters), a long word, or a complex word (e.g., containing a hyphen). The similarity between a string, consisting of a set T of tokens t1 . . . tn , and a column colx was defined as: Pn ti × tf idfti ,colx sim(T, colx ) = i=1 |T | where tf idfti colx is the tfidf weight (term frequency - inverse document frequency, cf. (SparckJones, 1972)) of token ti in column colx . This weight encodes how representative a token is of a column. The term frequency, tfti ,colx , of a token ti in column colx is the number of occurrences of ti in colx divided by the number of occurrences of all tokens in colx . The term frequency is 0 if the token does not occur in the column. The inverse document frequency, idfti , of a token ti is the number of all columns in the database divided by the number of columns containing ti . Finally, the tfidf weight for a term ti in column colx is defined as: tf idfti ,colx = tfti ,colx log idfti A high tfidf weight for a given token in a given column means that the token frequently occurs in that column but rarely in other columns, thus the token is a good indicator for that column. Typically tfidf weights are only calculated for content words, however we calculated them for all tokens, partly because the use of stop word lists to filter out function words would have jeopardised the language independence of our method and partly because function words and even punctuation can be very useful for distinguishing different columns. For example, prepositions such as under often indicate BIOTOPE, as in under a stone. Sabine Buchholz. The inclusion of multi-lingual abbreviations in the rule set ensures that this tokeniser is robust enough to also cope with text strings in English and other Western European languages.

To assign a text string to one of the 35 database columns, we trained TiMBL (Daelemans et al., 2004) on the feature vectors of all other database cells labelled with the column they belong to.11 Cases where the predicted column differed from the current column of the string were recorded as potential errors. We applied the classifier to all filled database cells. For each of the strings identified as potential errors, we checked manually (i) whether this was a real error (i.e., error detection) and (ii) whether the column predicted by the classifier was the correct one (i.e., error correction). While checking for this type of error is much faster than checking for errors in the taxonomic fields, it is sometimes difficult to tell whether a flagged error is a real error. In some cases it is not obvious which column a string belongs to, for example because two columns are very similar in content (such as LO CATION and FINDING PLACE ), in other cases the content of a database field contains several pieces of information which would best be located in different columns. For instance, the string found with broken neck near Karlobag arguably could be split between the SPECIAL REMARKS and the LOCA TION columns. We were conservative in the first case, i.e., we did not count an error as correctly identified if the string could belong to the original column, but we gave the algorithm credit for flagging potential errors where part of the string should be in a different column. The results are shown in the second column (unfiltered) in Table 5. The classifier found 836 potential errors, 148 of these were found to be real errors. For 100 of the correctly identified errors the predicted column was the correct column. Some of the corrected errors can be found in Table 6. Note that the system corrected errors in both English and Dutch text strings without requiring language identification or any language-specific resources (apart from tokenisation). We also calculated the precision of error detection (i.e., the number of real errors divided by the number of flagged errors) and the error correction accuracy (i.e., the number of correctly corrected errors divided by the number correctly identified errors). The error detection precision is relatively low (17.70%). In general a low precision means relatively more work for the human expert check11 We used the default settings (IB1, Weighted Overlap Metric, Information Gain Ratio weighting) and k=3.

45

string op boom ongeveer 2,5 m boven grond (on a tree about 2.5 m above ground)

original column

corrected column

SPECIAL REMARKS

BIOTOPE

25 km N.N.W Antalya

SPECIAL REMARKS

LOCATION

1700 M

BIOTOPE

ALTITUDE

gestorven in gevangenschap 23 september 1994 (died in captivity 23 September 1994)

LOCATION

SPECIAL REMARKS

roadside bordering secondary forest

LOCATION

BIOTOPE

Suriname Exp. 1970 (Surinam Expedition 1970)

COLLECTION NUMBER

COLLECTOR

Table 6: Examples of automatically corrected errors (vertical method)

flagged errors real errors correctly corrected precision error detection accuracy error correction

unfiltered 836 148 100 17.70 % 67.57%

filtered 262 67 54 25.57% 80.60%

Table 5: Results automatic error detection and correction for all database fields (vertical method) ing the flagged errors. However, note that the system considerably reduces the number of database fields that have to be checked (i.e., 836 out of 229,430 filled fields). We also found that, for this type of error, error checking can be done relatively quickly even by a non-expert; checking the 836 errors took less than 30 minutes. Furthermore, the correction accuracy is fairly high (67.57%), i.e., for most of the correctly identified errors the correct column is suggested. This means that for most errors the user can simply choose the column suggested by the classifier. In an attempt to increase the detection precision we applied two filters and only flagged errors which passed these filters. First, we filtered out potential errors if the original and the predicted column were of a similar type (e.g., if both contained person names or dates) as we noticed that our method was very prone to misclassifications in these cases.12 For example, if the name M.S. Hoogmoed occurs several times in the COLLEC TOR column and a few times in the DONATOR column, the latter cases are flagged by the system as potential errors. However, it is entirely normal for a person to occur in both the COLLECTOR and the DONATOR column. What is more, it is impossible 12

Note, that this filter requires a (very limited) amount of background knowledge, i.e. knowledge about which columns are of a similar type.

to determine on the basis of the text string M.S. Hoogmoed alone, whether the correct column for this string in a given entry is DONATOR or COL LECTOR or both.13 Secondly, we only flagged errors where the predicted column was empty for the current database entry. If the predicted column is already occupied, the string is unlikely to belong to that column (unless the string in that column is also an error). The third column in Table 5 (filtered) shows the results. It can be seen that detection precision increases to 25.57% and correction precision to 80.60%, however the system also finds noticeably fewer errors (67 vs. 148).

BIOTOPE PUBLICATION SPECIAL REMARKS

Prec. 20.09% 6.90% 16.11%

Rec. 94.00% 100.00% 24.00%

Table 7: Precision and Recall for three free text columns (vertical method) Estimating the error detection recall (i.e., the number of identified errors divided by the overall number of errors in the database) would involve manually identifying all the errors in the database. This was not feasible for the database as a whole. Instead we manually checked three of the free text columns, namely, BIOTOPE, PUB LICATION and SPECIAL REMARKS , for errors and calculated the recall and precision for these. Table 7 shows the results. For BIOTOPE and PUB LICATION the recall is relatively high (94% and 100%, respectively), for SPECIAL REMARKS it is much lower (24%). The low recall for SPECIAL REMARKS is probably due to the fact that this col13 Note, however, that the horizontal error detection method proposed in the previous section might detect an erroneous occurrence of this string (based on the values of other fields in the entry).

46

umn is very heterogeneous, thus it is fairly difficult to find the true errors in it. While the precision is relatively low for all three columns, the number of flagged errors (ranging from 58 for PUBLICA TION to 298 for SPECIAL REMARKS ) is still small enough for manual checking.

6

Conclusion

We have presented two methods for (semi-)automatic error detection and correction in textual databases. The two methods are aimed at different types of errors: horizontal error correction attempts to identify and correct inconsistent values within a database record; vertical error correction is aimed at values which were accidentally entered in the wrong column. Both methods are data-driven and require little or no background knowledge. The methods are also language-independent and can be applied to multi-lingual databases. While we utilise supervised machine learning, no manual annotation of training data is required, as the training set is obtained directly from the database. We tested the two methods on an animal specimens database and found that a significant proportion of errors could be detected: up to 97% for horizontal error detection and up to 100% for vertical error detection. While the error detection precision was fairly low for both methods (up to 55% for the horizontal method and up to 25.57% for the vertical method), the number of potential errors flagged was still sufficiently small to check manually. Furthermore, the automatically predicted correction for an error was often the right one. Hence, it would be feasible to employ the two methods in a semi-automatic error correction set-up where potential errors together with a suggested correction are flagged and presented to a user. As the two error correction methods are to some extent complementary, it would be worthwhile to investigate whether they can be combined. Some errors flagged by the horizontal method will not be detected by the vertical method, for instance, values which are valid in a given column, but inconsistent with the values of other fields. On the other hand, values which were entered in the wrong column should, in theory, also be detected by the horizontal method. For example, if the correct FAM ILY for Rana aurora is Ranidae, it should make no difference whether the (incorrect) value in the FAMILY field is Bufonidae, which is a valid value

for FAMILY but the wrong family for Rana aurora, or Amphibia, which is not a valid value for FAM ILY but the correct CLASS value for Rana aurora; in both cases the error should be detected. Hence, if both methods predict an error in a given field this should increase the likelihood that there is indeed an error. This could be exploited to obtain a higher precision. We plan to experiment with this idea in future research. Acknowledgments The research reported in this paper was funded by NWO (Netherlands Organisation for Scientific Research) and carried out at the Naturalis Research Labs in Leiden. We would like to thank Pim Arntzen and Erik van Nieukerken from Naturalis for guidance and helpful discussions. We are also grateful to two anonymous reviewers for useful comments.

References A. Bagga. 1998. Coreference, Cross-Document Coreference, and Information Extraction Methodologies. Ph.D. thesis, Dept. of Computer Science, Duke University. W. Daelemans, J. Zavrel, K. van der Sloot, A. van den Bosch, 2004. TiMBL: Tilburg Memory Based Learner, version 5.1, Reference Guide, 2004. ILK Research Group Technical Report Series no. 04-02. H. Galhardas, D. Florescu, D. Shasha, E. Simon. 1999. An extensible framework for data cleaning. Technical Report RR-3742, INRIA Technical Report, 1999. D. M. Hawkins. 1980. Identification of outliers. Chapman and Hall, London. M. A. Hernández, S. J. Stolfo. 1998. Real-world data is dirty: Data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 2:1–31. M.-F. Jiang, S.-S. Tseng, C.-M. Su. 2001. Two-phase clustering process for outliers detection. Pattern Recognition Letters, 22:691–700. E. M. Knorr, R. T. Ng. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB’98). A. Marcus, J. I. Maletic. 2000. Utilizing association rules for identification of possible errors in data sets. Technical Report TR-CS-00-04, The University of Memphis, Division of Computer Science, 2000. I. Ruts, P. J. Rousseeuw. 1996. Computing depth contours of bivariate point clouds. Computational Statistics and Data Analysis, 23:153–168. K. Sparck-Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11–21. A. van den Bosch. 2004. Wrapped progressive sampling search for optimizing learning algorithm parameters. In Proceedings of the 16th Belgian-Dutch Conference on Artificial Intelligence, 219–226.

47

A Hybrid Approach for the Acquisition of Information Extraction Patterns Mihai Surdeanu, Jordi Turmo, and Alicia Ageno Technical University of Catalunya Barcelona, Spain {surdeanu,turmo,ageno}@lsi.upc.edu

Abstract In this paper we present a hybrid approach for the acquisition of syntacticosemantic patterns from raw text. Our approach co-trains a decision list learner whose feature space covers the set of all syntactico-semantic patterns with an Expectation Maximization clustering algorithm that uses the text words as attributes. We show that the combination of the two methods always outperforms the decision list learner alone. Furthermore, using a modular architecture we investigate several algorithms for pattern ranking, the most important component of the decision list learner.

1

Introduction

Traditionally, Information Extraction (IE) identifies domain-specific events, entities, and relations among entities and/or events with the goals of: populating relational databases, providing eventlevel indexing in news stories, feeding link discovery applications, etcetera. By and large the identification and selective extraction of relevant information is built around a set of domain-specific linguistic patterns. For example, for a “financial market change” domain one relevant pattern is . When this pattern is matched on the text “London gold fell $4.70 to $308.35”, a change of $4.70 is detected for the financial instrument “London gold”. Domain-specific patterns are either handcrafted or acquired automatically (Riloff, 1996; Yangarber et al., 2000; Yangarber, 2003; Stevenson and Greenwood, 2005). To minimize annotation costs, some of the latter approaches use lightly

supervised bootstrapping algorithms that require as input only a small set of documents annotated with their corresponding category label. The focus of this paper is to improve such lightly supervised pattern acquisition methods. Moreover, we focus on robust bootstrapping algorithms that can handle real-world document collections, which contain many domains. Although a rich literature covers bootstrapping methods applied to natural language problems (Yarowsky, 1995; Riloff, 1996; Collins and Singer, 1999; Yangarber et al., 2000; Yangarber, 2003; Abney, 2004) several questions remain unanswered when these methods are applied to syntactic or semantic pattern acquisition. In this paper we answer two of these questions: (1) Can pattern acquisition be improved with text categorization techniques? Bootstrapping-based pattern acquisition algorithms can also be regarded as incremental text categorization (TC), since in each iteration documents containing certain patterns are assigned the corresponding category label. Although TC is obviously not the main goal of pattern acquisition methodologies, it is nevertheless an integral part of the learning algorithm: each iteration of the acquisition algorithm depends on the previous assignments of category labels to documents. Hence, if the quality of the TC solution proposed is bad, the quality of the acquired patterns will suffer. Motivated by this observation, we introduce a co-training-based algorithm (Blum and Mitchell, 1998) that uses a text categorization algorithm as reinforcement for pattern acquisition. We show, using both a direct and an indirect evaluation, that the combination of the two methodologies always improves the quality of the acquired patterns.

48

(2) Which pattern selection strategy is best? While most bootstrapping-based algorithms follow the same framework, they vary significantly in what they consider the most relevant patterns in each bootstrapping iteration. Several approaches have been proposed in the context of word sense disambiguation (Yarowsky, 1995), named entity (NE) classification (Collins and Singer, 1999), pattern acquisition for IE (Riloff, 1996; Yangarber, 2003), or dimensionality reduction for text categorization (TC) (Yang and Pedersen, 1997). However, it is not clear which selection approach is the best for the acquisition of syntactico-semantic patterns. To answer this question, we have implemented a modular pattern acquisition architecture where several of these ranking strategies are implemented and evaluated. The empirical study presented in this paper shows that a strategy previously proposed for feature ranking for NE recognition outperforms algorithms designed specifically for pattern acquisition. The paper is organized as follows: Section 2 introduces the bootstrapping framework used throughout the paper. Section 3 introduces the data collections. Section 4 describes the direct and indirect evaluation procedures. Section 5 introduces a detailed empirical evaluation of the proposed system. Section 6 concludes the paper.

2

The Pattern Acquisition Framework

In this section we introduce a modular pattern acquisition framework that co-trains two different views of the document collection: the first view uses the collection words to train a text categorization algorithm, while the second view bootstraps a decision list learner that uses all syntacticosemantic patterns as features. The rules acquired by the latter algorithm, of the form p → y, where p is a pattern and y is a domain label, are the output of the overall system. The system can be customized with several pattern selection strategies that dramatically influence the quality and order of the acquired rules. 2.1 Co-training Text Categorization and Pattern Acquisition Given two views of a classification task, cotraining (Blum and Mitchell, 1998) bootstraps a separate classifier for each view as follows: (1) it initializes both classifiers with the same small amount of labeled data (i.e. seed documents in our case); (2) it repeatedly trains both classifiers using the currently labeled data; and (3) after each

learning iteration, the two classifiers share all or a subset of the newly labeled examples (documents in our particular case). The intuition is that each classifier provides new, informative labeled data to the other classifier. If the two views are conditional independent and the two classifiers generally agree on unlabeled data they will have low generalization error. In this paper we focus on a “naive” co-training approach, which trains a different classifier in each iteration and feeds its newly labeled examples to the other classifier. This approach was shown to perform well on real-world natural language problems (Collins and Singer, 1999). Figure 1 illustrates the co-training framework used in this paper. The feature space of the first view contains only lexical information, i.e. the collection words, and uses as classifier Expectation Maximization (EM) (Dempster et al., 1977). EM is actually a class of iterative algorithms that find maximum likelihood estimates of parameters using probabilistic models over incomplete data (e.g. both labeled and unlabeled documents) (Dempster et al., 1977). EM was theoretically proven to converge to a local maximum of the parameters’ log likelihood. Furthermore, empirical experiments showed that EM has excellent performance for lightly-supervised text classification (Nigam et al., 2000). The EM algorithm used in this paper estimates its model parameters using the Naive Bayes (NB) assumptions, similarly to (Nigam et al., 2000). From this point further, we refer to this instance of the EM algorithm as NB-EM. The feature space of the second view contains the syntactico-semantic patterns, generated using the procedure detailed in Section 3.2. The second learner is the actual pattern acquisition algorithm implemented as a bootstrapped decision list classifier. The co-training algorithm introduced in this paper interleaves one iteration of the NB-EM algorithm with one iteration of the pattern acquisition algorithm. If one classifier converges faster (e.g. NB-EM typically converges in under 20 iterations, whereas the acquisition algorithms learns new patterns for hundreds of iterations) we continue bootstrapping the other classifier alone. 2.2 The Text Categorization Algorithm ˆ inThe parameters of the generative NB model, θ, clude the probability of seeing a given category,

49

No

Labeled seed documents Unlabeled documents

Initialize NB−EM

Initialize pattern acquisition

NB−EM No converged?

Yes

NB−EM Iteration

Pattern acquisition iteration

Pattern acquisition terminated?

Yes Patterns

Ranking method

Figure 1: Co-training framework for pattern acquisition. 1. Initialization: • Initialize the set of labeled examples with n labeled seed documents of the form (di , yi ). yi is the label of the ith document di . Each document di contains a set of patterns {pi1 , pi2 , ..., pim }. • Initialize the list of learned rules R = {}. 2. Loop: • For each label y, select a small set of pattern rules r = p → y, r ∈ / R. • Append all selected rules r to R. • For all non-seed documents d that contain a pattern in R, set label(d) = arg maxp,y strength(p, y). 3. Termination condition: • Stop if no rules selected or maximum number of iterations reached.

Figure 2: Pattern acquisition meta algorithm ˆ and the probability of seeing a word given P (c|θ), ˆ We calculate both simia category, P (w|c; θ). larly to Nigam (2000). Using these parameters, the word independence assumption typical to the Naive Bayes model, and the Bayes rule, the probability that a document d has a given category c is calculated as: ˆ P (c|d; θ)

= =

ˆ (d|c; θ) ˆ P (c|θ)P ˆ P (d|θ)

Pq

(1)

ˆ |d| P (wi |c; θ) ˆ P (c|θ)Π i=1 |d| ˆ ˆ P (cj |θ)Π P (wi |cj ; θ)

j=1

(2)

i=1

2.3 The Pattern Acquisition Algorithm The lightly-supervised pattern acquisition algorithm iteratively learns domain-specific IE patterns from a small set of labeled documents and a much larger set of unlabeled documents. During each learning iteration, the algorithm acquires a new set of patterns and labels more documents based on the new evidence. The algorithm output is a list R of rules p → y, where p is a pattern in the set of patterns P , and y a category label in Y = {1...k}, k being the number of categories in the document collection. The list of acquired rules R is sorted in descending order of rule importance to guarantee that the most relevant rules are accessed first. This generic bootstrapping algorithm is formalized in Figure 2. Previous studies called the class of algorithms illustrated in Figure 2 “cautious” or “sequential”

because in each iteration they acquire 1 or a small set of rules (Abney, 2004; Collins and Singer, 1999). This strategy stops the algorithm from being over-confident, an important restriction for an algorithm that learns from large amounts of unlabeled data. This approach was empirically shown to perform better than a method that in each iteration acquires all rules that match a certain criterion (e.g. the corresponding rule has a strength over a certain threshold). The key element where most instances of this algorithm vary is the select procedure, which decides which rules are acquired in each iteration. Although several selection strategies have been previously proposed for various NLP problems, to our knowledge no existing study performs an empirical analysis of such strategies in the context of acquisition of IE patterns. For this reason, we implement several selection methods in our system (described in Section 2.4) and evaluate their performance in Section 5. The label of each collection document is given by the strength of its patterns. Similarly to (Collins and Singer, 1999; Yarowsky, 1995), we define the strength of a pattern p in a category y as the precision of p in the set of documents labeled with category y, estimated using Laplace smoothing: strength(p, y) =

count(p, y) + count(p) + k

(3)

where count(p, y) is the number of documents labeled y containing pattern p, count(p) is the overall number of labeled documents containing p, and k is the number of domains. For all experiments presented here we used = 1. Another point where acquisition algorithms differ is the initialization procedure: some start with a small number of hand-labeled documents (Riloff, 1996), as illustrated in Figure 2, while others start with a set of seed rules (Yangarber et al., 2000; Yangarber, 2003). However, these approaches are conceptually similar: the seed rules are simply used to generate the seed documents. This paper focuses on the framework introduced in Figure 2 for two reasons: (a) “cautious” al-

50

gorithms were shown to perform best for several NLP problems (including acquisition of IE patterns), and (b) it has nice theoretical properties: Abney (2004) showed that, regardless of the selection procedure, “sequential” bootstrapping algorithms converge to a local minimum of K, where K is an upper bound of the negative log likelihood of the data. Obviously, the quality of the local minimum discovered is highly dependent of the selection procedure, which is why we believe an evaluation of several pattern selection strategies is important. 2.4 Selection Criteria The pattern selection component, i.e. the select procedure of the algorithm in Figure 2, consists of the following: (a) for each category y all patterns p are sorted in descending order of their scores in the current category, score(p, y), and (b) for each category the top k patterns are selected. For all experiments in this paper we have used k = 3. We provide four different implementations for the pattern scoring function score(p, y) according to four different selection criteria. Criterion 1: Riloff This selection criterion was developed specifically for the pattern acquisition task (Riloff, 1996) and has been used in several other pattern acquisition systems (Yangarber et al., 2000; Yangarber, 2003; Stevenson and Greenwood, 2005). The intuition behind it is that a qualitative pattern is yielded by a compromise between pattern precision (which is a good indicator of relevance) and pattern frequency (which is a good indicator of coverage). Furthermore, the criterion considers only patterns that are positively correlated with the corresponding category, i.e. their precision is higher than 50%. The Riloff score of a pattern p in a category y is formalized as: ( score(p, y)

=

prec(p, y)

=

prec(p, y) log(count(p, y)), if prec(p, y) > 0.5; 0, otherwise.

count(p, y) count(p)

(4)

(5)

where prec(p, y) is the raw precision of pattern p in the set of documents labeled with category y. Criterion 2: Collins This criterion was used in a lightly-supervised NE recognizer (Collins and Singer, 1999). Unlike the previous criterion, which combines relevance and frequency in the same scoring function, Collins considers only patterns whose raw precision is

over a hard threshold T and ranks them by their global coverage: score(p, y) =

count(p), 0,

if prec(p, y) > T ; otherwise.

(6)

Similarly to (Collins and Singer, 1999) we used T = 0.95 for all experiments reported here. Criterion 3: χ2 (Chi) The χ2 score measures the lack of independence between a pattern p and a category y. It is computed using a two-way contingency table of p and y, where a is the number of times p and y co-occur, b is the number of times p occurs without y, c is the number of times y occurs without p, and d is the number of times neither p nor y occur. The number of documents in the collection is n. Similarly to the first criterion, we consider only patterns positively correlated with the corresponding category: score(p, y)

=

χ2 (p, y)

=

χ2 (p, y), 0,

if prec(p, y) > 0.5; (7) otherwise.

n(ad − cb)2 (a + c)(b + d)(a + b)(c + d)

(8)

The χ2 statistic was previously reported to be the best feature selection strategy for text categorization (Yang and Pedersen, 1997). Criterion 4: Mutual Information (MI) Mutual information is a well known information theory criterion that measures the independence of two variables, in our case a pattern p and a category y (Yang and Pedersen, 1997). Using the same contingency table introduced above, the MI criterion is estimated as: score(p, y)

=

M I(p, y)

=

log

≈

M I(p, y), 0,

if prec(p, y) > 0.5; (9) otherwise.

P (p ∧ y) P (p) × P (y) na log (a + c)(a + b)

(10) (11)

3 The Data 3.1 The Document Collections For all experiments reported in this paper we used the following three document collections: (a) the AP collection is the Associated Press (year 1999) subset of the AQUAINT collection (LDC catalog number LDC2002T31); (b) the LATIMES collection is the Los Angeles Times subset of the TREC5 collection1 ; and (c) the REUTERS collection is the by now classic Reuters-21578 text categorization collection2 . 1 2

51

http://trec.nist.gov/data/docs eng.html http://trec.nist.gov/data/reuters/reuters.html

Collection AP LATIMES REUTERS

# of docs 5000 5000 9035

# of categories 7 8 10

# of words 24812 29659 12905

Text

# of patterns 140852 69429 36608

Patterns

Table 1: Document collections used in the evaluation Similarly to previous work, for the REUTERS collection we used the ModApte split and selected the ten most frequent categories (Nigam et al., 2000). Due to memory limitations on our test machines, we reduced the size of the AP and LATIMES collections to their first 5,000 documents (the complete collections contain over 100,000 documents). The collection words were pre-processed as follows: (i) stop words and numbers were discarded; (ii) all words were converted to lower case; and (iii) terms that appear in a single document were removed. Table 1 lists the collection characteristics after pre-processing. 3.2 Pattern Generation In order to extract the set of patterns available in a document, each collection document undergoes the following processing steps: (a) we recognize and classify named entities3 , and (b) we generate full parse trees of all document sentences using a probabilistic context-free parser. Following the above processing steps, we extract Subject-Verb-Object (SVO) tuples using a series of heuristics, e.g.: (a) nouns preceding active verbs are subjects, (b) nouns directly attached to a verb phrase are objects, (c) nouns attached to the verb phrase through a prepositional attachment are indirect objects. Each tuple element is replaced with either its head word, if its head word is not included in a NE, or with the NE category otherwise. For indirect objects we additionally store the accompanying preposition. Lastly, each tuple containing more than two elements is generalized by maintaining only subsets of two and three of its elements and replacing the others with a wildcard. Table 2 lists the patterns extracted from one sample sentence. As Table 2 hints, the system generates a large number of candidate patterns. It is the task of the pattern acquisition algorithm to extract only the relevant ones from this complex search space.

4

The Evaluation Procedures

4.1 The Indirect Evaluation Procedure The goal of our evaluation procedure is to measure the quality of the acquired patterns. Intuitively, 3 We identify six categories: persons, locations, organizations, other names, temporal and numerical expressions.

The Minnesota Vikings beat the Arizona Cardinals in yesterday’s game. s(ORG) v(beat) v(beat) o(ORG) s(ORG) o(ORG) v(beat) io(in game) s(ORG) io(in game) o(ORG) io(in game) s(ORG) v(beat) o(ORG) s(ORG) v(beat) io(in game) v(beat) o(ORG) io(in game)

Table 2: Patterns extracted from one sample sentence. s stands for subject, v for verb, o for object, and io for indirect object.

the learned patterns should have high coverage and low ambiguity. We indirectly measure the quality of the acquired patterns using a text categorization strategy: we feed the acquired rules to a decisionlist classifier, which is then used to classify a new set of documents. The classifier assigns to each document the category label given by the first rule whose pattern matches. Since we expect higherquality patterns to appear higher in the rule list, the decision-list classifier never changes the category of an already-labeled document. The quality of the generated classification is measured using micro-averaged precision and recall: P q i=1

T rueP ositivesi (12) (T rueP ositivesi + F alseP ositivesi ) i=1

P = Pq

Pq

T rueP ositivesi (13) (T rueP ositivesi + F alseN egativesi ) i=1

R = Pq

i=1

where q is the number of categories in the document collection. For all experiments and all collections with the exception of REUTERS, which has a standard document split for training and testing, we used 5fold cross validation: we randomly partitioned the collections into 5 sets of equal sizes, and reserved a different one for testing in each fold. We have chosen this evaluation strategy because this indirect approach was shown to correlate well with a direct evaluation, where the learned patterns were used to customize an IE system (Yangarber et al., 2000). For this reason, much of the following work on pattern acquisition has used this approach as a de facto evaluation standard (Yangarber, 2003; Stevenson and Greenwood, 2005). Furthermore, given the high number of domains and patterns (we evaluate on 25 domains), an evaluation by human experts is extremely costly. Nevertheless, to show that the proposed indirect evaluation correlates well with a direct evaluation, two human experts have evaluated the patterns in several domains. The direct evaluation procedure is described next. 52

4.2 The Direct Evaluation Procedure The task of manually deciding whether an acquired pattern is relevant or not for a given domain is not trivial, mainly due to the ambiguity of the patterns. Thus, this process should be carried out by more than one expert, so that the relevance of the ambiguous patterns can be agreed upon. For example, the patterns s(ORG) v(score) o(goal) and s(PER) v(lead) io(with point) are clearly relevant only for the sports domain, whereas the patterns v(sign) io(as agent) and o(title) io(in DATE) might be regarded as relevant for other domains as well. The specific procedure to manually evaluate the patterns is the following: (1) two experts separately evaluate the acquired patterns for the considered domains and collections; and (2) the results of both evaluations are compared. For any disagreement, we have opted for a strict evaluation: all the occurrences of the corresponding pattern are looked up in the collection and, whenever at least one pattern occurrence belongs to a document assigned to a different domain than the domain in question, the pattern will be considered as not relevant. Both the ambiguity and the high number of the extracted patterns have prevented us from performing an exhaustive direct evaluation. For this reason, only the top (most relevant) 100 patterns have been evaluated for one domain per collection. The results are detailed in Section 5.2.

5

Experimental Evaluation

5.1 Indirect Evaluation For a better understanding of the proposed approach we perform an incremental evaluation: first, we evaluate only the various pattern selection criteria described in Section 2.4 by disabling the NB-EM component. Second, using the best selection criteria, we evaluate the complete co-training system. In both experiments we initialize the system with high-precision manually-selected seed rules which yield seed documents with a coverage of 10% of the training partitions. The remaining 90% of the training documents are maintained unlabeled. For all experiments we used a maximum of 400 bootstrapping iterations. The acquired rules are fed to the decision list classifier which assigns category labels to the documents in the test partitions. Evaluation of the pattern selection criteria Figure 3 illustrates the precision/recall charts

of the four algorithms as the number of patterns made available to the decision list classifier increases. All charts show precision/recall points starting after 100 learning iterations with 100iteration increments. It is immediately obvious that the Collins selection criterion performs significantly better than the other three criteria. For the same recall point, Collins yields a classification model with much higher precision, with differences ranging from 5% in the REUTERS collection to 20% in the AP collection. Theorem 5 in (Abney, 2002) provides a theoretical explanation for these results: if certain independence conditions between the classifier rules are satisfied and the precision of each rule is larger than a threshold T , then the precision of the final classifier is larger than T . Although the rule independence conditions are certainly not satisfied in our real-world evaluation, the above theorem indicates that there is a strong relation between the precision of the classifier rules on labeled data and the precision of the final classifier. Our results provide the empirical proof that controling the precision of the acquired rules (i.e. the Collins criterion) is important. The Collins criterion controls the recall of the learned model by favoring rules with high frequency in the collection. However, since the other two criteria do not use a high precision threshold, they will acquire more rules, which translates in better recall. For two out of the three collections, Riloff and Chi obtain a slightly better recall, about 2% higher than Collins’, albeit with a much lower precision. We do not consider this an important advantage: in the next section we show that co-training with the NB-EM component further boosts the precision and recall of the Collinsbased acquisition algorithm. The MI criterion performs the worst of the four evaluated criteria. A clue for this behavior lies in the following equivalent form for MI: M I(p, y) = log P (p|y)−log P (p). This formula indicates that, for patterns with equal conditional probabilities P (p|y), MI assigns higher scores to patterns with lower frequency. This is not the desired behavior in a TC-oriented system. Evaluation of the co-training system Figure 4 compares the performance of the stand-alone pattern acquisition algorithm (“bootstrapping”) with the performance of the acquisition algorithm trained in the co-training environ-

53

0.8 0.75

0.75

collins riloff chi mi

0.65

0.7 Precision

Precision

0.9

0.6

0.65 0.6 0.55 0.5

0.85

0.55 0.5 0.45

0.4

0.35

0.35

0.3 0.25

0.3

0.35

0.4

0.45

0.5

0.55


0.7

0.25 0.2

0.8

0.75

0.4

0.45

0.3 0.15

0.95


0.7

Precision

0.85

0.65 0.1

0.15

0.2

Recall

(a)

0.25

0.3

0.35

0.4

0.1

0.15

0.2

0.25

Recall

0.3

0.35

0.4

0.45

Recall

(b)

(c)

Figure 3: Performance of the pattern acquisition algorithm for various pattern selection strategies and multiple collections: (a) AP, (b) LATIMES, and (c) REUTERS

ment (“co-training”). For both setups we used the best pattern selection criterion for pattern acquisition, i.e. the Collins criterion. To put things in perspective, we also depict the performance obtained with a baseline system, i.e. the system configured to use the Riloff pattern selection criterion and without the NB-EM algorithm (“baseline”). To our knowledge, this system, or a variation of it, is the current state-of-the-art in pattern acquisition (Riloff, 1996; Yangarber et al., 2000; Yangarber, 2003; Stevenson and Greenwood, 2005). All algorithms were initialized with the same seed rules and had access to all documents. Figure 4 shows that the quality of the learned patterns always improves if the pattern acquisition algorithm is “reinforced” with EM. For the same recall point, the patterns acquired in the co-training environment yield classification models with precision (generally) much larger than the models generated by the pattern acquisition algorithm alone. When using the same pattern acquisition criterion, e.g. Collins, the differences between the co-training approach and the stand-alone pattern acquisition method (“bootstrapping”) range from 2-3% in the REUTERS collection to 20% in the LATIMES collection. These results support our intuition that the sparse pattern space is insufficient to generate good classification models, which directly influences the quality of all acquired patterns. Furthermore, due to the increased coverage of the lexicalized collection views, the patterns acquired in the co-training setup generally have better recall, up to 11% higher in the LATIMES collection. Lastly, the comparison of our best system (“cotraining”) against the current state-of-the-art (our “baseline”) draws an even more dramatic picture:

Collection

Domain

AP LATIMES REUTERS

Sports Financial Corporate Acquisitions

Relevant patterns baseline 22% 67% 38%

Relevant patterns co-training 68% 76% 46%

Initial inter-expert agreement 84% 70% 66%

Table 3: Percentage of relevant patterns for one domain per collection by the baseline system (Riloff) and the co-training system.

for the same recall point, the co-training system obtains a precision up to 35% higher for AP and LATIMES, and up to 10% higher for REUTERS. 5.2 Direct Evaluation As stated in Section 4.2, two experts have manually evaluated the top 100 acquired patterns for one different domain in each of the three collections. The three corresponding domains have been selected intending to deal with different degrees of ambiguity, which are reflected in the initial interexpert agreement. Any disagreement between experts is solved using the algorithm introduced in Section 4.2. Table 3 shows the results of this direct evaluation. The co-training approach outperforms the baseline for all three collections. Concretely, improvements of 9% and 8% are achieved for the Financial and the Corporate Acquisitions domains, and 46%, by far the largest difference, is found for the Sports domain in AP. Table 4 lists the top 20 patterns extracted by both approaches in the latter domain. It can be observed that for the baseline, only the top 4 patterns are relevant, the rest being extremely general patterns. On the other hand, the quality of the patterns acquired by our approach is much higher: all the patterns are relevant to the domain, although 7 out of the 20 might be considered ambiguous and according to the criterion defined in Section 4.2 have been evaluated as not relevant.

54

0.8

0.8

0.75

0.75

0.7

0.7 0.65 0.6

0.95

0.9

Precision

0.85

Precision

Precision

0.9 0.85

0.65 0.6

0.85

0.8

0.55

0.55

0.5 co-training bootstrapping baseline

0.5 0.45 0.3

0.35

0.4

0.45

0.5

0.55

0.75 co-training bootstrapping baseline

0.45 0.4 0.6

0.2

0.25

Recall

0.3

0.35

0.4

0.45

0.7 0.15

co-training bootstrapping baseline 0.2

0.25

Recall

(a)

(b)

0.3

0.35

0.4

0.45

Recall

(c)

Figure 4: Comparison of the bootstrapping pattern acquisition algorithm with the co-training approach: (a) AP, (b) LATIMES, and (c) REUTERS Baseline s(he) o(game) v(miss) o(game) v(play) o(game) v(play) io(in LOC) v(go) o(be) s(he) v(be) s(that) v(be) s(I) v(be) s(it) v(go) o(be) s(it) v(be) s(I) v(think) s(I) v(know) s(I) v(want) s(there) v(be) s(we) v(do) v(do) o(it) s(it) o(be) s(we) v(are) s(we) v(go) s(PER) o(DATE)

Co-training v(win) o(title) s(I) v(play) s(he) v(game) s(we) v(play) v(miss) o(game) s(he) v(coach) v(lose) o(game) s(I) o(play) v(make) o(play) v(play) io(in game) v(want) o(play) v(win) o(MISC) s(he) o(player) v(start) o(game) s(PER) o(contract) s(we) o(play) s(team) v(win) v(rush) io(for yard) s(we) o(team) v(win) o(Bowl)

Table 4: Top 20 patterns acquired from the Sports domain by the baseline system (Riloff) and the co-training system for the AP collection. The correct patterns are in bold.

6

Conclusions

This paper introduces a hybrid, lightly-supervised method for the acquisition of syntactico-semantic patterns for Information Extraction. Our approach co-trains a decision list learner whose feature space covers the set of all syntactico-semantic patterns with an Expectation Maximization clustering algorithm that uses the text words as attributes. Furthermore, we customize the decision list learner with up to four criteria for pattern selection, which is the most important component of the acquisition algorithm. For the evaluation of the proposed approach we have used both an indirect evaluation based on Text Categorization and a direct evaluation where human experts evaluated the quality of the generated patterns. Our results indicate that co-training the Expectation Maximization algorithm with the decision list learner tailored to acquire only high precision patterns is by far the best solution. For the same recall point, the proposed method increases the precision of the generated models up

to 35% from the previous state of the art. Furthermore, the combination of the two feature spaces (words and patterns) also increases the coverage of the acquired patterns. The direct evaluation of the acquired patterns by the human experts validates these results.

References S. Abney. 2002. Bootstrapping. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. S. Abney. 2004. Understanding the Yarowsky algorithm. Computational Linguistics, 30(3). A. Blum and T. Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory. M. Collins and Y. Singer. 1999. Unsupervised models for named entity classification. In Proceedings of EMNLP. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1). K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3). E. Riloff. 1996. Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96). M. Stevenson and M. Greenwood. 2005. A semantic approach to ie pattern induction. In Proceedings of the 43rd Meeting of the Association for Computational Linguistics. Y. Yang and J. O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning. R. Yangarber, R. Grishman, P. Tapanainen, and S. Hutunen. 2000. Automatic acquisition of domain knowledge for information extraction. In Proceedings of the 18th International Conference of Computational Linguistics (COLING 2000). R. Yangarber. 2003. Counter-training in discovery of semantic patterns. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003). D. Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.

55

Expanding the Recall of Relation Extraction by Bootstrapping Stephen Soderland Oren Etzioni Department of Computer Science & Engineering University of Washington Seattle, WA 98195-2350 soderlan,etzioni @cs.washington.edu

Junji Tomita NTT Cyber Solutions Laboratories, NTT Corporation 1-1 Hikarinooka Yokosuka-Shi, Kanagawa 239-0847, Japan [email protected]

Abstract Most works on relation extraction assume considerable human effort for making an annotated corpus or for knowledge engineering. Generic patterns employed in KnowItAll achieve unsupervised, highprecision extraction, but often result in low recall. This paper compares two bootstrapping methods to expand recall that start with automatically extracted seeds by KnowItAll. The first method is string pattern learning, which learns string contexts adjacent to a seed tuple. The second method learns less restrictive patterns that include bags of words and relation-specific named entity tags. Both methods improve the recall of the generic pattern method. In particular, the less restrictive pattern learning method can achieve a 250% increase in recall at 0.87 precision, compared to the generic pattern method.

1 Introduction Relation extraction is a task to extract tuples of entities that satisfy a given relation from textual documents. Examples of relations include CeoOf(Company, Ceo) and Acquisition(Organization, Organization). There has been much work on relation extraction; most of it employs knowledge engineering or supervised machine learning approaches (Feldman et al., 2002; Zhao and Grishman, 2005). Both approaches are labor intensive. We begin with a baseline information extraction system, KnowItAll (Etzioni et al., 2005), that does unsupervised information extraction at Web scale. KnowItAll uses a set of generic extraction pat-

terns, and automatically instantiates rules by combining these patterns with user supplied relation labels. For example, KnowItAll has patterns for a generic “of” relation: NP1 ’s relation , NP2 NP2 ,

relation of NP1

where NP1 and NP2 are simple noun phrases that extract values of argument1 and argument2 of a relation, and relation is a user-supplied string associated with the relation. The rules may also constrain NP1 and NP2 to be proper nouns. If a user supplies the relation labels “ceo” and “chief executive officer” for the relation CeoOf(Company, Ceo), KnowItAll inserts these labels into the generic patterns shown above, to create 4 extraction rules: NP1 ’s ceo , NP2 NP1 ’s chief executive officer , NP2 NP2 , ceo of NP1 NP2 , chief executive officer of NP1 The same generic patterns with different labels can also produce extraction rules for a MayorOf relation or an InventorOf relation. These rules have alternating context strings (exact string match) and extraction slots (typically an NP or head of an NP). This can produce rules with high precision, but low recall, due to the wide variety of contexts describing a relation. This paper looks at ways to enhance recall over this baseline system while maintaining high precision. To enhance recall, we employ bootstrapping techniques which start with seed tuples, i.e. the most frequently extracted tuples by the baseline system. The first method represents rules with three context strings of tokens immediately adjacent to the extracted arguments: a left context,

56

middle context, and right context. These are induced from context strings found adjacent to seed tuples. The second method uses a less restrictive pattern representation such as bag of words, similar to that of SnowBall(Agichtein, 2005). SnowBall is a semi-supervised relation extraction system. The input of Snowball is a few hand labeled correct seed tuples for a relation (e.g. for CeoOf relation). SnowBall clusters the bag of words representations generated from the context strings adjacent to each seed tuple, and generates rules from them. It calculates the confidence of candidate tuples and the rules iteratively by using an EM-algorithm. Because it can extract any tuple whose entities co-occur within a window, the recall can be higher than the string pattern learning method. The main disadvantage of SnowBall or a method which employs less restrictive patterns is that it requires Named Entity Recognizer (NER). We introduce Relation-dependent NER (Relation NER), which trains an off-the-shelf supervised NER based on CRF(Lafferty et al., 2001) with bootstrapping. This learns relation-specific NE tags, and we present a method to use these tags for relation extraction. This paper compares the following two bootstrapping strategies. SPL: a simple string pattern learning method. It learns string patterns adjacent to a seed tuple. LRPL: a less restrictive pattern learning method. It learns a variety of bag of words patterns, after training a Relation NER. Both methods are completely self-supervised extensions to the unsupervised KnowItAll. A user supplies KnowItAll with one or more relation labels to be applied to one or more generic extraction patterns. No further tagging or manual selection of seeds is required. Each of the bootstrapping methods uses seeds that are automatically selected from the output of the baseline KnowItAll system. The results show that both bootstrapping methods improve the recall of the baseline system. The two methods have comparable results, with LRPL outperforms SPL for some relations and SPL outperforms LRPL for other relations. The rest of the paper is organized as follows. Section 2 and 3 describe SPL and LRPL respectively. Section 4 reports on our experiments, and

section 5 and 6 describe related works and conclusions.

2 String Pattern Learning (SPL) Both SPL and LRPL start with seed tuples that were extracted by the baseline KnowItAll system, with extraction frequency at or above a threshold (set to 2 in these experiments). In these experiments, we downloaded a set of sentences from the Web that contained an occurrence of at least one relation label and used this as our reservoir of unlabeled training and test sentences. We created a set of positive training sentences from those sentences that contained both argument values of a seed tuple. SPL employs a method similar to that of (Downey et al., 2004). It generates candidate extraction rules with a prefix context, a middle context, and a right context. The prefix is zero to tokens immediately to the left of extracted argument1, the middle context is all tokens between argument1 and argument2, and the right context of zero to tokens immediately to the right of argument2. It discards patterns with more than intervening tokens or without a relation label. SPL tabulates the occurrence of such patterns in the set of positive training sentences (all sentences from the reservoir that contain both argument values from a seed tuple in either order), and also tabulates their occurrence in negative training sentences. The negative training are sentences that have one argument value from a seed tuple and a nearest simple NP in place of the other argument value. This idea is based on that of (Ravichandran and Hovy, 2002) for a QA system. SPL learns a possibly large set of strict extraction rules that have alternating context strings and extraction slots, with no gaps or wildcards in the rules. SPL selects the best patterns as follows: 1. Groups the context strings that have the exact same middle string. 2. Selects the best pattern having the largest pattern score, , for each group of context strings having the same middle string.

3. Selects the patterns having . 57

(1)

greater than

Figure 1: The architecture of LRPL (Less Restrictive Pattern Learning). where is a set of sentences that match pattern and include both argument values of a seed tuple. is a set of sentences that match and include just one argument value of a seed tuple (e.g. just a company or a person for CeoOf). is a constant for smoothing.

3 Less Restrictive Pattern Learning (LRPL) LRPL uses a more flexible rule representation than SPL. As before, the rules are based on a window of tokens to the left of the first argument, a window of middle tokens, and a window of tokens to the right of the second argument. Rather than using exact string match on a simple sequence of tokens, LRPL uses a combination of bag of words and immediately adjacent token. The left context is based on a window of tokens immediately to the left of argument1. It has two sets of tokens: the token immediately to the left and a bag of words for the remaining tokens. Each of these sets may have zero or more tokens. The middle and right contexts are similarly defined. We call this representation extended bag of words. Here is an example of how LRPL represents the context of a training sentence with window size set to 4. “Yesterday , Arg2Steve Ballmer /Arg2, the Chief Executive Officer of Arg1Microsoft /Arg1 said that he is ...”. order: arg2_arg1 values: Steve Ballmer, Microsoft L: {yesterday} {,} M: {,} {chief executive officer the} {of} R: {said} {he is that}

Some of the tokens in these bags of words may be dropped in merging this with patterns from

other training sentences. Each rule also has a confidence score, learned from EM-estimation. We experimented with simply using three bags of words as in SnowBall, but found that precision was increased when we distinguished the tokens immediately adjacent to argument values from the other tokens in the left, middle, and right bag of words. Less restrictive patterns require a Named Entity Recognizer (NER), because the patterns can not extract candidate entities by themselves1 . LRPL trains a supervised NER in bootstrapping for extracting candidate entities. Figure 1 overviews LRPL. It consists of two bootstrapping modules: Relation NER and Relation Assessor. LRPL trains the Relational NER from seed tuples provided by the baseline KnowItAll system and unlabeled sentences in the reservoir. Then it does NE tagging on the sentences to learn the less restrictive rules and to extract candidate tuples. The learning and extraction steps at Relation Assessor are similar to that of SnowBall; it generates a set of rules and uses EM-estimation to compute a confidence in each rule. When these rules are applied, the system computes a probability for each tuple based on the rule confidence, the degree of match between a sentence and the rule, and the extraction frequency. 3.1

Relation dependent Named Entity Recognizer

Relation NER leverages an off-the-shelf supervised NER, based on Conditional Random Fields (CRF). In Figure 1, TrainSentenceGenerator automatically generates training sentences from seeds and unlabeled sentences in the reservoir. TrainEntityRecognizer trains a CRF on the training sentences and then EntityRecognizer applies the trained CRF to all the unlabeled sentences, creating entity annotated sentences. It can extract entities whose type matches an argument type of a particular relation. The type is not explicitly specified by a user, but is automatically determined according to the seed tuples. For example, it can extract ‘City’ and ‘Mayor’ type entities for MayorOf(City, Mayor) relation. We describe CRF in brief, and then how to train it in bootstrapping. 1 Although using all noun phrases in a sentence may be possible, it apparently results in low precision.

58

3.1.1

Supervised Named Entity Recognizer

Several state-of-the-art supervised NERs are based on a feature-rich probabilistic conditional classifier such as Conditional Random Fields (CRF) for sequential learning tasks(Lafferty et al., 2001; Rosenfeld et al., 2005). The input of CRF is a feature sequence of features , and outputs a tag sequence of tags . In the training phrase, a set of is provided, and outputs a model

. In the applying phase, given , it outputs a tag sequence by using . In the case of NE tagging, given a sequence of tokens, it automatically generates a sequence of feature sets; each set is corresponding to a token. It can incorporate any properties that can be represented as a binary feature into the model, such as words, capitalized patterns, part-of-speech tags and the existence of the word in a dictionary. It works quite well on NE tagging tasks (McCallum and Li, 2003). 3.1.2

How to Train Supervised NER in Bootstrapping

We use bootstrapping to train CRF for relationspecific NE tagging as follows: 1) select the sentences that include all the entity values of a seed tuple, 2) automatically mark the argument values in each sentence, and 3)train CRF on the seed marked sentences. An example of a seed marked sentence is the following: seed tuple:

seed marked sentence: "Yesterday, Steve Ballmer, CEO of Microsoft announced that ..."

Because of redundancy, we can expect to generate a fairly large number of seed marked sentences by using a few highly frequent seed tuples. To avoid overfitting on terms from these seed tuples, we substitute the actual argument values with random characters for each training sentence, preserving capitalization patterns and number of characters in each token. 3.2

Relation Assessor

entity annotated sentences, and classifies the contexts into two classes: training contexts (if their entity values and their orders match a seed tuple) and test contexts (otherwise). TrainConfidenceEstimator clusters based on the match score between contexts, and generates a rule from each cluster, that has average vectors over contexts belonging to the cluster. Given a set of generated rules and test contexts , ConfidenceEstimator estimates each tuple confidence in by using an EM algorithm. It also estimates the confidence of the tuples extracted by the baseline system, and outputs the merged result tuples with confidence. We describe the match score calculation method, the EM-algorithm, and the merging method in the following sub sections. 3.2.1 Match Score Calculation The match score (or similarity) of two extended bag of words contexts , is calculated as the linear combination of the cosine values between the corresponding vectors.

(2)

where, is the index of left, middle, or right contexts. is the index of left adjacent, right adjacent, or other tokens. is the weight corresponding to the context vector indexed by and . To achieve high precision, Relation Assessor uses only the entity annotated sentences that have just one entity for each argument (two entities in total) and where those entities co-occur within tokens window, and it uses at most left and right tokens. It discards patterns without a relation label. 3.2.2

EM-estimation for tuple and rule confidence Several rules generated from only positive evidence result in low precision (e.g. rule “of” for MayorOf relation generated from “Rudolph Giuliani of New York”). This problem can be improved by estimating the rule confidence by the following EM-algorithm.

Relation Assessor employs several SnowBall-like techniques including making rules by clustering and EM-estimation for the confidence of the rules and tuples. In Figure 1, ContextRepresentationGenerator generates extended bag of words contexts, from

1. For each in , identifies the best match rule , based on the match score between and each rule . is the th context that includes tuple .

59

argmax

(3)

2. Initializes seed tuple confidence, for all , where is a seed tuple.

3. Calculates tuple confidence, , and rule confidence, , by using EM-algorithm. E and M stages are iterated several times. E stage:

(5)

(6)

where

left middle right

total 0.2 0.6 0.2

(4)

4.1

adjacency other right 0.067 0.133 0.24 0.12 0.24 0.133 0.067

left context

M stage:

Table 1: Weights corresponding to a context vector ( ).

is a constant for smoothing. This algorithm assigns a high confidence to the rules that frequently co-occur with only high confident tuples. It also assigns a high confidence to the tuples that frequently co-occur with the contexts that match high confidence rules. When it merges the tuples extracted by the baseline system, the algorithm uses the following constant value for any context that matches a baseline pattern.

(7) where denotes the context of tuple that matches a baseline pattern, and is any baseline pattern. With this calculation, the confidence of any tuple extracted by a baseline pattern is always greater than or equal to that of any tuple that is extracted by the learned rules and has the same frequency.

4 Evaluation The focus of this paper is the comparison between bootstrapping strategies for extraction, i.e., string pattern learning and less restrictive pattern learning having Relation NER. Therefore, we first compare these two bootstrapping methods with the baseline system. Furthermore, we also compare Relation NER with a generic NER, which is trained on a pre-existing hand annotated corpus.

Relation Extraction Task

We compare SPL and LRPL with the baseline system on 5 relations: Acquisition, Merger, CeoOf, MayorOf, and InventorOf. We downloaded about from 100,000 to 220,000 sentences for each of these relations from the Web, which contained a relation label (e.g. “acquisition”, “acquired”, “acquiring” or “merger”, “merged”, “merging”). We used all the tuples that co-occur with baseline patterns at least twice as seeds. The numbers of seeds are between 33 (Acquisition) and 289 (CeoOf). For consistency, SPL employs the same assessment methods with LRPL. It uses the EM algorithm in Section 3.2.2 and merges the tuples extracted by the baseline system. In the EM algorithm, the match score between a learned pattern and a tuple is set to a constant . LRPL uses MinorThird (Cohen, 2004) implementation of CRF for Relation NER. The features used in the experiments are the lower-case word, capitalize pattern, part of speech tag of the current and +-2 tokens, and the previous state (tag) referring to (Minkov et al., 2005; Rosenfeld et al., 2005). The parameters used for SPL and LRPL are experimentally set as follows: , , , , , and the context weights for LRPL shown in Table 1. Figure 2-6 show the recall-precision curves. We use the number of correct extractions to serve as a surrogate for recall, since computing actual recall would require extensive manual inspection of the large data sets. Compared to the the baseline system, both bootstrapping methods increases the number of correct extractions for almost all the relations at around 80% precision. For MayorOf relation, LRPL achieves 250% increase in recall at 0.87 precision, while SPL’s precision is less than the baseline system. This is because SPL can not distinguish correct tuples from the error tuples that

60

Figure 2: The recall-precision curve of CeoOf relation.

Figure 4: The recall-precision curve of Acquisition relation.

Figure 3: The recall-precision curve of MayorOf relation.

Figure 5: The recall-precision curve of Merger relation.

co-occur with a short strict pattern, and that have a wrong entity type value. An example of the error tuples extracted by SPL is the following: Learned Pattern: NP1 Mayor NP2 Sentence: "When Lord Mayor Clover Moore spoke,..." Tuple:

The improvement of Acquisition and Merger relations is small for both methods; the rules learned for Merger and Acquisition made erroneous extractions of mergers of geo-political entities, acquisition of data, ball players, languages or diseases. For InventorOf relation, LRPL does not work well. This is because ‘Invention’ is not a proper noun phrase, but a noun phrase. A noun phrase includes not only nouns, but a particle, a determiner, and adjectives in addition to noncapitalized nouns. Our Relation NER was unable to detect regularities in the capitalization pattern and word length of invention phrases. At around 60% precision, SPL achieves higher recall for CeoOf and MayorOf relations, in con-

trast, LRPL achieves higher recall for Acquisition and Merger. The reason can be that nominal style relations (CeoOf and MayorOf) have a smaller syntactic variety for describing them. Therefore, learned string patterns are enough generic to extract many candidate tuples. 4.2

Entity Recognition Task

Generic types such as person, organization, and location cover many useful relations. One might expect that NER trained for these generic types, can be used for different relations without modifications, instead of creating a Relation NER. To show the effectiveness of Relation NER, we compare Relation NER with a generic NER trained on a pre-existent hand annotated corpus for generic types; we used MUC7 train, dry-run test, and formal-test documents(Table 2) (Chinchor, 1997). We also incorporate the following additional knowledge into the CRF’s features referring to (Minkov et al., 2005; Rosenfeld et al.,

61

culation is in favor of Relation NER. For fair comparison, we also use the following measure.

Figure 6: The recall-precision curve of InventorOf relation. Table 2: The number of entities and unique entities in MUC7 corpus. The number of documents is 225. entity all uniq Organization 3704 993 Person 2120 1088 Location 2912 692

2005): first and last names, city names, corp designators, company words (such as “technology”), and small size lists of person title (such as “mr.”) and capitalized common words (such as “Monday”). The base features for both methods are the same as the ones described in Section 4.1. The ideal entity recognizer for relation extraction is recognizing only entities that have an argument type for a particular relation. Therefore, a generic test set such as MUC7 Named Entity Recognition Task can not be used for our evaluation. We randomly selected 200 test sentences from our dataset that had a pair of correct entities for CeoOf or MayorOf relations, and were not used as training for the Relation NER. We measured the accuracy as follows.

(10) where, is a set of true entities that have a generic type 2 . Table 3 shows that the Relation NER consistently works better than the generic NER, even when additional knowledge much improved the recall. This suggests that training a Relation NER for each particular relation in bootstrapping is better approach than using a NER trained for generic types.

5 Related Work SPL is a similar approach to DIPRE (Brin, 1998) DIPRE uses a pre-defined simple regular expression to identify argument values. Therefore, it can also suffer from the type error problem described above. LRPL avoids this problem by using the Relation NER. LRPL is similar to SnowBall(Agichtein, 2005), which employs a generic NER, and reported that most errors come from NER errors. Because our evaluation showed that Relation NER works better than generic NER, a combination of Relation NER and SnowBall can make a better result in other settings. 3 (Collins and Singer, 1999) and (Jones, 2005) describe self-training and co-training methods for Named Entity Classification. However, the problem of NEC task, where the boundary of entities are given by NP chunker or parser, is different from NE tagging task. Because the boundary of an entity is often different from a NP boundary, the technique can not be used for our purpose; “Microsoft CEO Steve Ballmer” is tagged as a single noun phrase.

(8)

6 Conclusion

(9)

This paper describes two bootstrapping strategies, SPL, which learns simple string patterns, and LRPL, which trains Relation NER and uses it with less restrictive patterns. Evaluations showed both

where, is a set of true entities that have an argument type of a target relation. is a set of entities extracted as an argument. Because Relation NER is trained for argument types (such as ‘Mayor’), and the generic NER is trained for generic types (such as person), this cal-

2

Although can be defined in the same way, we did not use it, because of our purpose and much effort needed for complete annotation for generic types. 3 Of course, further study needed for investigating whether Relation NER works with a smaller number of seeds.

62

Table 3: The argument precision and recall is the average over all arguments for CeoOf, and MayorOf relations. The Location is for MayorOf, Organization is for CeoOf, and person is the average of both relations. Argument Location Organization Person Recall Precision F Precision Precision Precision R-NER 0.650 0.912 0.758 0.922 0.906 0.955 G-NER 0.392 0.663 0.492 0.682 0.790 0.809 G-NER+dic 0.577 0.643 0.606 0.676 0.705 0.842

methods enhance the recall of the baseline system for almost all the relations. For some relations, SPL and LRPL have comparable recall and precision. For InventorOf, where the invention is not a named entity, SPL performed better, because its patterns are based on noun phrases rather than named entities. LRPL works better than SPL for MayorOf relation by avoiding several errors caused by the tuples that co-occur with a short strict context, but have a wrong type entity value. Evaluations also showed that Relation NER works better than the generic NER trained on MUC7 corpus with additional dictionaries.

Acknowledgements This work was done while the first author was a visiting Scholar at the University of Washington. The work was carried out at the University’s Turing Center and was supported in part by NSF grant IIS-0312988, DARPA contract NBCHD030010, ONR grant N00014-02-1-0324, and a gift from Google. We would like to thank Dr. Eugene Agichtein for informing us the technical details of SnowBall, and Prof. Ronen Feldman for a helpful discussion.

References Eugene Agichtein. 2005. Extracting Relations From Large Text Collections. Ph.D. thesis, Columbia University. Sergey Brin. 1998. Extracting Patterns and Relations from the World Wide Web. In WebDB Workshop at EDBT’98, pages 172–183, Valencia, Spain. Nancy Chinchor. 1997. Muc-7 named entity task definition version 3.5. William W. Cohen. 2004. Minorthird: Methods for identifying names and ontological relations in text using heuristics for inducing regularities from data. Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. In EMNLP 99.

Doug Downey, Oren Etzioni, Stephen Soderland, and Daniel S. Weld. 2004. Learning text patterns for web information extraction and assessment. In AAAI 2004 Workshop on ATEM. Oren Etzioni, Michael Cafarella, Doug Downey, AnaMaria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell., 165(1):91–134. Ronen Feldman, Yonatan Aumann, Michal FinkelsteinLandau, Eyal Hurvitz, Yizhar Regev, and Ariel Yaroshevich. 2002. A comparative study of information extraction strategies. In CICLing, pages 349–359. Rosie Jones. 2005. Learning to Extract Entities from Labeled and Unlabeled Texts. Ph.D. thesis, CMULTI-05-191. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML ’01, pages 282–289. Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In CoNLL-2003). Einat Minkov, Richard C. Wang, and William W. Cohen. 2005. Extracting personal names from email: Applying named entity recognition to informal text. In EMNLP/HLT-2005. D. Ravichandran and D. Hovy. 2002. Learning Surface Text Patterns for a Question Answering System. In Procs. of the 40th Annual Meeting of the Association for Computational Linguistics, pages 41– 47, Philadelphia, Pennsylvania. Binyamin Rosenfeld, Moshe Fresko, and Ronen Feldman. 2005. A systematic comparison of featurerich probabilistic classifiers for ner tasks. In PKDD, pages 217–227. Shubin Zhao and Ralph Grishman. 2005. Extracting relations with integrated information using kernel methods. In ACL’05, pages 419–426, June.

63

Active Annotation

Andreas Vlachos William Gates Building Computer Laboratory University of Cambridge [email protected]

Abstract

attracted the attention of the NLP community relatively recently (Kim et al., 2004). Even though there are plenty of biomedical texts, very little of it is annotated, such as the GENIA corpus (Kim et al., 2003). A very popular and well investigated framework in order to cope with the lack of training material is the active learning framework (Cohn et al., 1995; Seung et al., 1992). It has been applied to various NLP/IE tasks, including named entity recognition (Shen et al., 2004) and parse selection (Baldridge and Osborne, 2004) with rather impressive results in reducing the amount of annotated training data. However, some criticism of active learning has been expressed recently, concerning the reusability of the data (Baldridge and Osborne, 2004). This paper presents a framework in order to deal with the lack of training data for NLP tasks. The intuition behind it is that annotated training data is produced by applying an (imperfect) unsupervised method, and then the errors inserted in the annotation are detected automatically and reannotated by a human annotator. The main difference compared to active learning is that instead of selecting unlabeled instances for annotation, possible erroneous instances are selected for checking and correction if they are indeed erroneous. We will refer to this framework as “active annotation” in the rest of the paper. The structure of this paper is as follows. In Section 2 we describe the software and the dataset used. Section 3 explores the effect of errors in the training data and motivates the active annotation framework.

This paper introduces a semi-supervised learning framework for creating training material, namely active annotation. The main intuition is that an unsupervised method is used to initially annotate imperfectly the data and then the errors made are detected automatically and corrected by a human annotator. We applied active annotation to named entity recognition in the biomedical domain and encouraging results were obtained. The main advantages over the popular active learning framework are that no seed annotated data is needed and that the reusability of the data is maintained. In addition to the framework, an efficient uncertainty estimation for Hidden Markov Models is presented.

1 Introduction Training material is always an issue when applying machine learning to deal with information extraction tasks. It is generally accepted that increasing the amount of training data used improves performance. However, training material comes at a cost, since it requires annotation. As a consequence, when adapting existing methods and techniques to a new domain, researchers and users are faced with the problem of absence of annotated material that could be used for training. A good example is the biomedical domain, which has

64

In Section 4 we describe the framework in detail, while Section 5 presents a method for estimating uncertainty for HMMs. Section 6 presents results from applying the active annotation. Section 7 compares the proposed framework to active learning and Section 8 attempts an analysis of its performance. Finally, Section 9 suggests some future work.

types of noise was explored and learner specific extensions were proposed in order to deal with it. In our experiments we explored the effect of noise in training the selected named entity recognition system, keeping in mind that we are going to use an unsupervised method to create the training material. The kind of noise we expect is mislabelled instances. In order to simulate the behavior of a hypothetical unsupervised method, we corrupted the training data artificially using the following models:

2 Experimental setup The data used in the experiments that follow are taken from the BioNLP 2004 named entity recognition shared task (Kim et al., 2004). The text passages have been annotated with five classes of entities, “DNA”, “RNA”, “protein”, “cell type” and “cell line”. In our experiments, following the example of Dingare et al. (2004), we simplified the annotation to one entity class, namely “gene”, which includes the DNA, RNA and protein classes. In order to evaluate the performance on the task, we used the evaluation script supplied with the data, which recision∗Recall computes the F-score (F1 = 2∗P P recision+Recall ) for each entity class. It must be noted that all tokens of an entity must be recognized correctly in order to count as a correct prediction. A partially recognized entity counts both as a precision and recall error. In all the experiments that follow, the official split of the data in training and testing was maintained. The named entity recognition system used in our experiments is the open source NLP toolkit Lingpipe1 . The named entity recognition module is an HMM model using Witten-Bell smoothing. In our experiments, using the data mentioned earlier it achieved 70.06% F-score.

• LowRecall: Change tokens labelled as entities to non-entities. It must be noted that in this model, due to the presence of multi-token entities precision is reduced too, albeit less than recall. • LowRecall WholeEntities: Change the labeling of whole entities to non-entities. In this model, precision is kept intact. • LowPrecision: Change tokens labelled as nonentities to entities. • Random: Entities and non-entities are changed randomly. It can be viewed alternatively as a random tagger which labels the data with some accuracy. The level of noise inserted is adjusted by specifying the probability with which a candidate label is changed. In all the experiments in this paper, for a particular model and level of noise, the corruption of the dataset was repeated five times in order to produce more reliable results. In practice, the behavior of an unsupervised method is likely to be a mixture of the above models. However, given that the method would tag the data with a certain performance, we attempted through our experiments to identify which of these (extreme) behaviors would be less harmful. In Figure 1, we present graphs showing the effect of noise inserted with the above models. The experimental procedure was to add noise to the training data according to a model, evaluate the performance of the hypothetical tagger that produced it, train Lingpipe on the noisy training data and evaluate the performance of the latter on the test data. The process was repeated for various levels of noise. In the top graph, the F-score achieved by Lingpipe (F-ling) is plotted against the F-score of

3 Effect of errors Noise in the training data is a common issue in training machine learning for NLP tasks. It can have significant effect on the performance, as it was pointed out by Dingare et al. (2004), where the performance of the same system on the same task (named entity recognition in the biomedical domain) was lower when using noisier material. The effect of noise in the data used to train machine learning algorithms for NLP tasks has been explored by Osborne (2002), using the task of shallow parsing as the case study and a variety of learners. The impact of different 1

http://www.alias-i.com/lingpipe/

65

4 Active Annotation

the hypothetical tagger (F-tag), while in the bottom graph the F-score achieved by Lingpipe (F-ling) is plotted against the number of erroneous classifications made by the hypothetical tagger.

In this section we present a detailed description of the active annotation framework. Initially, we have a pool of unlabeled data, D, whose instances are annotated using an unsupervised method u, which does not need training data. As expected, a significant amount of errors is inserted during this process. A list L is created containing the tokens that have not been checked by a human annotator. Then, a supervised learner s is used to train a model M on this noisy training material. A query module q, which uses the model created by s decides which instances of D will be selected to be checked for errors by a human annotator. The selected instances are removed from L so that q does not select them again in future. The learner s is then trained on this partially corrected training data and the sequence is repeated from the point of applying the querying module q. The algorithm written in pseudocode appears in Figure 2.

0.8 0.7

F-ling

0.6 0.5 0.4 0.3 random lowrecall wholeentities lowprecision

0.2 0.1 0 0

0.2

0.4 0.6 F-tag

0.8

1

0.8 random lowrecall wholeentities lowprecision

0.7

F-ling

0.6 0.5 0.4 0.3 0.2 0.1

Data D, unsupervised tagger u, supervised learner s, query module q. Initialization: Apply u to D. Create list of instances L. Loop: Using s train a model M on D. Using q and M select a batch of instances B to be checked. Correct the instances of B in D. Remove the instances of B from L. Repeat until: L is empty or annotator stops.

0 0

50 100 150 200 250 300 350 400 450 500 raw errors (in thousands)

Figure 1: F-score achieved by Lingpipe is plotted against (a) the F-score of the hypothetical tagger in the top graph and (b) the number of errors made by the hypothetical tagger in the bottom graph.

A first observation is that limited noise does not affect the performance significantly, a phenomenon that can be attributed to the capacity of the machine learning method to deal with noise. From the point of view of correcting mistakes in the training data this suggests that not all mistakes need to be corrected. Another observation is that while the performance for all the models follow similar curves when plotted against the F-score of the hypothetical tagger, the same does not hold when plotted against the number of errors. While this can be attributed to the unbalanced nature of the task (very few entity tokens compared to non-entities), it also suggests that the raw number of errors in the training data is not a good indicator for the performance obtained by training on it. However, it represents the effort required to obtain the maximum performance from the data by correcting it.

Figure 2: Active annotation algorithm Comparing it with active learning, the similarities are apparent. Both frameworks have a loop in which a query module q, using a model produced by the learner, selects instances to be presented to a human annotator. The efficiency of active annotation can be measured in two ways, both of them used in evaluating active learning. The first is to measure the reduction in the checked instances needed in order to achieve a certain level of performance. The second is the increase in performance for a fixed number of checked instances. Following the active learning

66

paradigm, a baseline for active annotation is random selection of instances to be checked. There are though some notable differences. During initialization, an unsupervised method u is required to provide an initial tagging on the data D. This is an important restriction which is imposed by the lack of any annotated data. Even under this restriction, there are some options available, especially for tasks which have compiled resources. One option is to use an unsupervised learning algorithm, such the one presented by Collins & Singer (1999), where a seed set of rules is used to bootstrap a rulebased named entity recognizer. A different approach could be the use of a dictionary-based tagger, as in Morgan et al. (2003). It must be noted that the unsupervised method used to provide the initial tagging does not need to generalize to any data (a common problem for such methods), it only needs to perform well on the data used during active annotation. Generalization on unseen data is an attribute we hope that the supervised learning method s will have after training on the annotated material created with active annotation. The query module q is also different from the corresponding module in active learning. Instead of selecting unlabeled informative instances to be annotated and added to the training data, its purpose is to identify likely errors in the imperfectly labelled training data, so that they are checked and corrected by the human annotator. In order to perform error-detection, we chose to adapt the approach of Nakagawa and Matsumoto (2002) which resembles uncertainty based sampling for active learning. According to their paradigm, likely errors in the training data are instances that are “hard” for the classifier and inconsistent with the rest of the data. In our case, we used the uncertainty of the classifier as the measure of the “hardness” of an instance. As an indication of inconsistency, we used the disagreement of the label assigned by the classifier with the current label of the instance. Intuitively, if the classifier disagrees with the label of an instance used in its training, it indicates that there have been other similar instances in the training data that were labelled differently. Returning to the description of active annotation, the query module q ranks the instances in L first by their inconsistency and then by decreasing uncertainty of

the classifier. As a result, instances that are inconsistent with the rest of the data and hard for the classifier are selected first, then those that are inconsistent but easy for the classifier, then the consistent ones but hard for the classifier and finally the consistent and easy ones. While this method of detecting errors resembles uncertainty sampling, there are other approaches that could have been used instead and they can be very different. Sjöbergh and Knutsson (2005) inserted artificial errors and trained a classifier to recognize them. Dickinson and Meuers (2003) proposed methods based on n-grams occurring with different labellings in the corpus. Therefore, while it is reasonable to expect some correlation between the selections of active annotation and active learning (hard instances are likely to be erroneously annotated by the unsupervised tagger), the task of selecting hard instances is quite different from detecting errors. The use of the disagreement between taggers for selecting candidates for manual correction is reminiscent of corrected co-training (Pierce and Cardie, 2001). However, the main difference is corrected co-training results in a manually annotated corpus, while active annotation allows automatically annotated instances to be kept.

5 HMM uncertainty estimation In order to perform error detection according to the previous section we need to obtain uncertainty estimations over each token from the named entity recognition module of Lingpipe. For each token t and possible label l, Lingpipe estimates the following Hidden Markov Model from the training data: P (t[n], l[n]|l[n − 1], t[n − 1], t[n − 2])

(1)

When annotating a certain text passage, the tokens are fixed and the joint probability of Equation 1 is computed for each possible combination of labels. From Bayes’ rule, we obtain: P (l[n]|t[n], l[n − 1], t[n − 1], t[n − 2]) = P (t[n], l[n]|l[n − 1], t[n − 1], t[n − 2]) P (t[n]|l[n − 1], t[n − 1], t[n − 2])

(2)

For fixed token sequence t[n], t[n − 1], t[n − 2] and previous label (l[n − 1]) the second term of the

67

A different way of obtaining uncertainty estimations from HMMs in the framework of active learning has been presented in (Scheffer et al., 2001). There, the uncertainty is estimated by the margin between the two most likely predictions that would result in a different current label, explicitly:

left part of Equation 2 is a fixed value. Therefore, under these conditions, we can write: P (l[n]|t[n], l[n − 1], t[n − 1], t[n − 2]) ∝ P (t[n], l[n]|l[n − 1], t[n − 1], t[n − 2])

(3)

From Equation 3 we obtain an approximation for the conditional distribution of the current label (l[n]) conditioned on the previous label (l[n − 1]) for a fixed sequence of tokens. It must be stressed that the later restriction is very important. The resulting distribution from Equation 3 cannot be compared across different token sequences. However, for the purpose of computing the uncertainty over a fixed token sequence it is a reasonable approximation. One way to estimate the uncertainty of the classifier is to calculate the conditional entropy of this distribution. The conditional entropy for a distribution P (X|Y ) can be computed as: H[X|Y ] =

X y

P (Y = y)

X

M = maxi,j {P (t[n] = i|t[n − 1] = j)} − maxk,l,k6=i {P (t[n] = k|t[n − 1] = l)}

(5)

Intuitively, the margin M is the difference between the two highest scored predictions that disagree. The lower the margin, the higher the uncertainty of the HMM on the token at question. A drawback of this method is that it doesn’t take into account the distribution of the previous label. It is possible that the two highest scored predictions are obtained for two different previous labels. It may also be the case that a highly scored label can be obtained given a very improbable previous label. Finally, an alternative that we did not explore in this work is the Field Confidence Estimation (Culotta and McCallum, 2004), which allows the estimation of confidence over sequences of tokens, instead of singleton tokens only. However, in this work confidence estimation over singleton tokens is sufficient.

logP (X = x|Y = y)

x

(4) In our case, X is l[n] and Y is l[n − 1]. Function 4 can be interpreted as the weighted sum of the entropies of P (l[n]|l[n − 1]) for each value of l[n − 1], in our case the weighted sum of entropies of the distribution of the current label for each possible previous label. The probabilities for each tag (needed for P (l[n − 1])) are not calculated directly from the model. P (l[n]) corresponds to P (l[n]|t[n], t[n − 1], t[n − 2]), but since we are considering a fixed token sequence, we approximate its distribution using the conditional probability P (l[n]|t[n], l[n − 1], t[n − 1], t[n − 2]), by marginalizing over l[n − 1]. Again, it must be noted that the above calculations are to be used in estimating uncertainty over a single word. One property of the conditional entropy is that it estimates the uncertainty of the predictions for the current label given knowledge of the previous tag, which is important in our application because we need the uncertainty over each label independently from the rest of the sequence. This is confirmed by the theory, from which we know that for a conditional distribution of X given Y the following equation holds, H[X|Y ] = H[X, Y ] − H[Y ], where H denotes the entropy.

6 Experimental Results In this section we present results from applying active annotation to biomedical named entity recognition. Using the noise models described in Section 3, we corrupted the training data and then using Lingpipe as the supervised learner we applied the algorithm of Figure 2. The batch of tokens selected to be checked in each round was 2000 tokens. As a baseline for comparison we used random selection of tokens to be checked. The results for various noise models and levels are presented in the graphs of Figure 3. In each of these graphs, the performance of Lingpipe trained on the partially corrected material (F-ling) is plotted against the number of checked instances, under the label “entropy”. In all the experiments, active annotation significantly outperformed random selection, with the exception of 50% Random, where the high level of noise (the F-score of the hypothetical tagger that provided the initial data was 0.1) affected the initial

68

Random 10%

pected, uncertainty based sampling performed reasonably well, better than random selection but worse than using labelling consistency, except for the initial stage of 20% LowPrecision.

LowRecall_WholeEntities 20%

0.72

0.705

0.7

0.7 0.695

0.66

F-ling

F-ling

0.68 0.64 0.62 0.6 0.58

0.69 0.685

random margin entropy

0.68

0.56

0.675 0 75 150 225 300 375 450 525 tokens_checked (in thousands)

0 75 150 225 300 375 450 525 tokens_checked (in thousands)

LowRecall 50% 0.7

0.4

F-ling

F-ling

0.65 0.6 0.55 0.5 random margin entropy

0.35

0.71 0.7 0.69 0.68 0.67 0.66 0.65 0.64 0.63 0.62 0.61 0.6


random uncertainty entropy 0 75 150 225 300 375 450 525 tokens_checked (in thousands)

LowPrecision 20%

Random 50%

0.75

0.8

0.7

0.7 0.6 F-ling

F-ling

0.65 0.6 0.55 0.5

random uncertainty entropy

0.45

0.5 0.4 0.3

random uncertainty entropy

0.2 0.1


7 Active Annotation versus Active Learning

LowRecall 20%

0.75

0.45

random margin entropy

0

75 150 225 300 375 450 525 tokens_checked (in thousands)

Figure 3: F-score achieved by Lingpipe is plotted against the number of checked instances for various models and levels of noise. judgements of the query module on which instances should be checked. After having checked some portion of the dataset though, active annotation started outperforming random selection. In the graphs for the 10% Random, 20% LowRecall WholeEntities and 50% LowRecall noise models, under the label “margin”, appear the performance curves obtained using the uncertainty estimation of Scheffer et al. (2001). Even though active annotation using this method performs better than random selection, active annotation using conditional entropy performs significantly better. These results provide evidence of the theoretical advantages of conditional entropy described earlier. We also ran experiments using pure uncertainty based sampling (i.e. without checking the consistency of the labels) on selecting instances to be checked. The performance curves appear under the label “uncertainty” for the 20% LowRecall, 50% Random and 20% LowPrecision noise models. The uncertainty was estimated using the method described in Section 5. As ex-

69

In order to compare active annotation to active learning, we run active learning experiments using the same dataset and software. The paradigm employed was uncertainty based sampling, using the uncertainty estimation presented in Sec. 5. HMMs require annotated sequences of tokens, therefore annotating whole sentences seemed as the natural choice, as in (Becker et al., 2005). While token-level selections could be used in combination with EM, (as in (Scheffer et al., 2001)), constructing a corpus of individual tokens would result in a corpus that would be very difficult to be reused, since it would be partially labelled. We employed the two standard options of selecting sentences, selecting the sentences with the highest average uncertainty over the tokens or selecting the sentence containing the most uncertain token. As cost metric we used the number of tokens, which allows more straightforward comparison with active annotation. In Figure 4 (left graph), each active learning experiment is started by selecting a random sentence as seed data, repeating the seed selection 5 times. The random selection is repeated 5 times for each seed selection. As in (Becker et al., 2005), selecting the sentences with the highest average uncertainty (ave) performs better than selecting those with the most uncertain token (max). In the right graph, we compare the best active learning method with active annotation. Apparently, the performance of active annotation is highly dependent on the performance of the unsupervised tagger used to provide us with the initial annotation of the data. In the graph, we include curves for two of the noise models reported in the previous section, LowRecall20% and LowRecall50% which correspond to tagging performance of 0.66 / 0.69 / 0.67 and 0.33 / 0.43 / 0.37 respectively, in terms of Recall / Precision / F. We consider such tagging performances feasible with a dictionary-based tagger, since Morgan et al. (2003) report performance of

annotated corpus constructed using active annotation can be used more efficiently than the partially annotated one produced by active learning.

Active learning 0.8 0.7 0.5

8 Selecting errors

0.4 0.3 0.2

In order to investigate further the behavior of active annotation, we evaluated the performance of the trained supervised method against the number of errors corrected by the human annotator. The aim of this experiment was to verify whether the improvement in performance compared to random selection is due to selecting “informative” errors to correct, or due to the efficiency of the error detection technique.

ave max random

0.1 0 75 150 225 300 375 450 525 tokens_checked (in thousands) AA vs AL 0.8 0.7 0.5 0.4 0.3 0.2

0.72 0.7

AA-20% AA-50% AL-ave

0.68 F-ling

F-ling

0.6

0.1 0 75 150 225 300 375 450 525 tokens_checked (in thousands)

0.66 0.64 0.62 0.6 0.58 0.56

random entropy reverse 0 5 10 15 20 25 30 35 40 45 50 errors corrected (in thousands)

Figure 4: Left, comparison among various active learning methods. Right, comparison of active learning and active annotation.

errors corrected (in thousands)

F-ling

0.6

50 45 40 35 30 25 20 15 10 5 0

random entropy reverse 0

75 150 225 300 375 450 525 tokens checked (in thousands)

Figure 5: Left: F-score achieved by Lingpipe is plotted against the number of corrected errors. Right: Errors corrected plotted against the number of checked tokens.

0.88 / 0/78 / 83 with such a method. These results demonstrate that active annotation, given a reasonable starting point, can achieve reductions in the annotation cost comparable to those of active learning. Furthermore, active annotation produces an actual corpus, albeit noisy. Active learning, as pointed out by Baldridge & Osborne (2004), while it reduces the amount of training material needed, it selects data that might not be useful to train a different learner. In the active annotation framework, it is likely to preserve correct instances that might not be useful to the machine learning method used to create it, but maybe beneficial to a different method. Furthermore, producing an actual corpus can be very important when adding new features to the model. In the case of biomedical NER, one could consider adding document-level features, such as whether a token has been seen as part of a gene name earlier in the document. With the corpus constructed using active learning this is not feasible, since it is unlikely that all the sentences of a document are selected for annotation. Also, if one intended to use the same corpus for a different task, such as anaphora resolution, again the imperfectly

In Figure 5, we present such graphs for the 10% Random noise model. Similar results were obtained with different noise models. As can be observed on the left graph, the errors corrected initially during random selection are far more informative compared to those corrected at the early stages of active annotation (labelled “entropy”). The explanation for this is that using the error detection method described in Section 4, the errors that are detected are those on which the supervised method s disagrees with the training material, which implies that even if such an instance is indeed an error then it didn’t affect s. Therefore, correcting such errors will not improve the performance significantly. Informative errors are those that s has learnt to reproduce with high certainty. However, such errors are hard to detect because similar attributes are exhibited usually by correctly labelled instances. This can be verified by the curves labelled “reverse” in the graphs of Figure 5, in which the ranking of the instances to be selected was reversed, so that instances where the supervised method agrees confidently with the training material

70

are selected first. The fact that errors with high uncertainty are less informative than those with low uncertainty suggests that active annotation, while being related to active learning, it is sufficiently different. The right graph suggests that the error-detection performance during active annotation is much better than that of random selection. Therefore, the performance of active annotation could be improved by preserving the high error-detection performance and selecting more informative errors.

ping named entity recognition. In Proceedings of the Workshop on Learning with Multiple Views, ICML. D. A. Cohn, Z. Ghahramani, and M. I. Jordan. 1995. Active learning with statistical models. In Advances in Neural Information Processing Systems, volume 7. M. Collins and Y. Singer. 1999. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on EMNLP and VLC. Aron Culotta and Andrew McCallum. 2004. Confidence estimation for information extraction. In Proceedings of HLT 2004, Boston, MA.

9 Future work

M. Dickinson and W. D. Meurers. 2003. Detecting errors in part-of-speech annotation. In Proceedings of EACL 2003, pages 107–114, Budapest, Hungary.

This paper described active annotation, a semisupervised learning framework that reduces the effort needed to create training material, which is very important in adapting existing trainable methods to new domains. Future work should investigate the applicability of the framework in a variety of NLP/IE tasks and settings. We intend to apply this framework to NER for biomedical literature from the FlyBase project for which no annotated datasets exist. While we have used the number of instances checked by a human annotator to measure the cost of annotation, this might not be representative of the actual cost. The task of checking and possibly correcting instances differs from annotating them from scratch. In this direction, experiments in realistic conditions with human annotators should be carried out. We also intend to explore the possibility of grouping similar mistakes detected in a round of active annotation, so that the human annotator can correct them with less effort. Finally, alternative errordetection methods should be investigated.

S. Dingare, J. Finkel, M. Nissim, C. Manning, and C. Grover. 2004. A system for identifying named entities in biomedical text: How results from two evaluations reflect on both the system and the evaluations. In The 2004 BioLink meeting at ISMB. J. D. Kim, T. Ohta, Y. Tateisi, and J. Tsujii. 2003. Genia corpus - a semantically annotated corpus for biotextmining. In ISMB (Supplement of Bioinformatics). J. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier, editors. 2004. Proceedings of JNLPBA, Geneva. A. Morgan, L. Hirschman, A. Yeh, and M. Colosimo. 2003. Gene name extraction using FlyBase resources. In Proceedings of the ACL 2003 Workshop on NLP in Biomedicine, pages 1–8. T. Nakagawa and Y. Matsumoto. 2002. Detecting errors in corpora using support vector machines. In Proceedings of COLING 2002. M. Osborne. 2002. Shallow parsing using noisy and non-stationary training material. J. Mach. Learn. Res., 2:695–719. D. Pierce and C. Cardie. 2001. Limitations of co-training for natural language learning from large datasets. In Proceedings of EMNLP 2001, pages 1–9.

Acknowledgments

T. Scheffer, C. Decomain, and S. Wrobel. 2001. Active hidden Markov models for information extraction. Lecture Notes in Computer Science, 2189:309+.

The author was funded by BBSRC, grant number 38688. I would like to thank Ted Briscoe and Bob Carpenter for their feedback and comments.

H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by committee. In Proceedings of COLT 1992.

References

D. Shen, J. Zhang, J. Su, G. Zhou, and C. L. Tan. 2004. Multi-criteria-based active learning for named entity recongition. In Proceedings of ACL 2004, Barcelona.

J. Baldridge and M. Osborne. 2004. Active learning and the total cost of annotation. In Proceedings of EMNLP 2004, Barcelona, Spain.

J. Sjöbergh and O. Knutsson. 2005. Faking errors to avoid making errors: Machine learning for error detection in writing. In Proceedings of RANLP 2005.

M. Becker, B. Hachey, B. Alex, and C. Grover. 2005. Optimising selective sampling for bootstrap-

71

Author Index Ageno, Alicia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48 Ciravegna, Fabio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Etzioni, Oren . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Geleijnse, Gijs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Giuliano, Claudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Ireson, Neil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Iria, Jose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Korst, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Kushmerick, Nicholas . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Lavelli, Alberto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 McLernon, Brian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Porcelijn, Tijn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Romano, Lorenza . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Siniakov, Peter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Soderland, Stephen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Sporleder, Caroline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Surdeanu, Mihai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Tomita, Junji . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Turmo, Jordi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 van den Bosch, Antal . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 van Erp, Marieke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Vlachos, Andreas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Adaptive Text Extraction and Mining (ATEM 2006)

Adaptive Text Extraction and Mining (ATEM 2006)

Suggest Documents

Adaptive Text Extraction and Mining (ATEM 2006)

Adaptive Text Extraction and Mining (ATEM 2006)

Text Mining, Information and Fact Extraction

Metadata Extraction using Text Mining - Stefan Rueping

Role of Text Mining in Information Extraction and Information ...

Role of Text Mining in Information Extraction and Information ... - DRDO

Mining knowledge from text repositories using information extraction ...

Mining knowledge from text repositories using information extraction

Extraction of Keyterms by Simple Text Mining for Business Information

Event extraction for systems biology by text mining the literature

A Text Mining Technique Using Association Rules Extraction - CiteSeerX

Text Mining: Natural Language techniques and Text Mining applications

Extraction of Radiology Reports using Text mining - Engg Journals ...

Extraction of Radiology Reports using Text mining - Engg Journals ...

Event extraction for systems biology by text mining the literature

Text Mining with Information Extraction in Career Counseling - ipcsit

Text Mining with Information Extraction in Career ...

Text Mining with Information Extraction - Department of Computer ...

Text Mining

Domain Adaptive Information Extraction From Text - Semantic Scholar

Adaptive traffic road sign panels text extraction - CiteSeerX

Adaptive Text Mining - Department of Computer Science - University of ...

Adaptive Text Mining - Department of Computer Science - University of ...

On-Line LDA: Adaptive Topic Models for Mining Text ... - CiteSeerX