Sources of Success for Information Extraction Methods - CiteSeerX

4 downloads 147433 Views 211KB Size Report
learning rules after all positivexamples have bee n covered (resulting in high recall). As a quantitative indicator of the regularity of an extraction task, we propose ...
Sourceso S f uccessf or InformationExtractionMethods DavidKauchak JosephSmarr Dept.ofComputerScience SymbolicSystemsProgram UCSanD iego StanfordUniversity LaJolla,CA92093-0114 Stanford,CA94305 LaJoll [email protected] [email protected]

CharlesElkan Dept.ofComputerScience UCSanD iego a,CA92093-0114 elkan@cs .ucsd.edu

Abstract Inthispaper,weexamineanimportantrecentruletechniquenamedBoostedWrapperInduction(BWI),by widervarietyotasks f thanpreviouslystudied,inc collectionsonatural f textdocuments. Weprovide algorithmiccomponentofBWI,inparticularboostin showthatthebenefitofboostingarisesfromthea learnspecificrules(resultinginhighprecision) learningrulesafterallpositiveexampleshavebee Asquantitative a indicatoroftheregularityoan f

basedinformationextraction(IE) conductingexperimentsona ludingtasksusingseveral saystematicanalysisohow f each g,contributestoitssuccess. We bilitytoreweightexamplesto combinedwiththeabilitytocontinue ncovered(resultinginhighrecall). extractiontask,weproposenew a

measurethatwecallSWIratio. Weshowthatthis

measureigood as predictorofIE

success. Basedontheseresults,weanalyzethest

rengthsandlimitationsocurrent f

rule-basedIEmethodsingeneral. Specifically,we

explainlimitationsinthe

informationmadeavailabletothesemethods,andin

therepresentationstheyuse.

Wealsodiscusshowconfidencevaluesreturnedduri

ngextractionarenottrue

probabilities. Inthisanalysis,weinvestigateth andsemanticinformationfornaturaltextdocuments

be enefitsoincluding f grammatical as ,wellasparsetreeand

attribute-valueinformationforXMLandHTMLdocume experimentallythatincorporatingevenlimitedgram

nts. Weshow maticalinformationcanimprove

theregularityoand f henceperformanceon atural

textextractiontasks. We

concludewithproposalsforenrichingtherepresent

ationalpowerofrule-basedIE

methodstoexploittheseandothertypesoregular f

ities.

1Introduction Computerizedtextisabundant. Thousandsof new webpagesappear ever

yday. News,

magazine,andjournalarticlesare constantlybeingcreated online.

Emailhasbecome one

ofthe most popular waysofcommunicating. Allthese trends result

inanenormous

amountoftextavailable idnigital form,butthese repositories of

text are mostly untapped

resourcesofinformation, andidentifying specific desired informati difficulttask. Information extraction(IE)isthe taskoex f

on inthem isa tracting relevantfragmentsof

textfrom larger documents,to allow the fragmentsto bperocesse

fdurther in some

automatedway,for example toanswer uaser query. ExamplesofIE

tasksinclude

identifying the speaker featureditalk an announcement, findingthe protei

nsmentioned

in biomedical a journal article, and extractingthe namesofcredit

cardsaccepted bya

restaurantfrom anonline review. A varietyosystems f andtechniqueshave been developed toaddressthe informationextraction problem. Successful techniquesinclude statis

ticalmethodssuch

as n-gram models, hiddenMarkov models,probabilistic context-free grammars

(Califf,

1998),and rule-based methodsthatemploysome form ofmachine learning. R

ule-based

methodshave beenespecially popular in recentyears. They use var a

iety odifferent f

approaches, butall recognize number a ofcommonkeyfacts. First, c

reating rulesby

handidifficult s and time-consuming(Riloff, 1996). For this reaso

n, mostsystems

generate their rules automaticallygivenlabeledopartially r labe generating single, a general rule for extractingallinstancesof

led data. Second, gaiven field ioften s

impossible (Muslea et al .,1999). Therefore, mostsystemsattempttolearn number a of rulesthat together cover the trainingexamplesfor faield, andt

hencombine these rulesin

some way. Some recenttechniques for generating rulesin the realm oftextext

ractionare

called “wrapper induction” methods. These techniqueshave provedtobfeai successful for IE tasksintheir intendeddomains,whichare colle structureddocumentssuchaweb s pagesgenerated from taemplate sc

rly ctions ofhighly ript (Muslea

1999;Kushmerick,2000). However,wrapper induction methods donotextend wel natural language documentsbecause othe f specificityothe f induce

2

etal ., to l

rdules.

Recentresearchoinmproving the accuracyoweak f classifiersus

ing boosting

(Schapire,1999)hasledtm o ethodsfor learningclassifiers that

focus on examplesthat

previousclassifiers have found difficult. The AdaBoost algorithm w

orksby continually

reweighting the training examples,and using base a learner to learn

naew classifier

repeatedly, stoppingafter faixed number ofiterations. The class

ifierslearnedare then

combinedbw y eighted voting,where the weightofeachclassifier dependso

nits

accuracyointsweightedtraining sBoosting et. hasbeen shown theor

eticallytcoonverge

to aonptimalcombinedclassifier, and performswellinpractice. Boosted Wrapper Induction(BWI) isan IE technique that uses the AdaB algorithm togenerate general a extraction procedure thatcombines sa wrappers (Freitag

oost etof specific

etal 2000). ., BWIhasbeenshown tpoerform wellon variety a of

taskswith moderately structuredand highly structureddocuments, but

exactly how

boostingcontributes to this success hasnotbeeninvestigated. Further

more,BWIhas

beenproposedaan sIEmethodfor unstructurednaturallanguage documents, but hasbeen done toexamine its performance itnhischallengingdirection Inthispaper, we investigate the benefitofboosting inBWIandals performance oBWI f onnaturaltextdocuments. We compare the use of

little . othe boosting inBWI

withthe mostcommonapproach tcoombiningindividualextractionrules: se covering. Withsequentialcovering,extraction rulesare orderedisnome to performance onthe training sThe et. best rule ichosen, s anda

quential wayaccording llexamplesthatthis rule

correctlyclassifiesare removedfrom the trainingsA et. new

best rule ilearned, s andthe

processis repeateduntilthe entire training set hasbeen covere

For d.m a ore detailed

description osequential f covering see (Cardie, 2001). Sequentialcove in number a ofIE systems(Califf, 1998;Clark

ringhasbeen used

etal .,1989;Michalski, 1980; Muslea

al.,1999; Quinlan,1990)because iis simple t toimplement,tendstoge

et

nerate

understandable andintuitive rules, and hasachieved good performance. Thispaper isdividedinto number a ofsections. InSection 2we , bri BWIand related techniques,providing formalization a othe f problem a

eflydescribe nd review a of

relevant terminology. InSection3we , present experimentalresults

comparingthese

different rule-basedIE methodson wide a varietyodocument f colle

ctions. In Section4,

we analyze the resultsofthe experimentsin detail, with spe

cific emphasisonhow

3

boostingaffectsthe performance oBWI, f and how performance relates

to the regularity

ofthe extractiontask. InSection5we , introduce the SWIratio

asan objective measure

oftaskregularity, and further examine the connectionbetweenregular

ity and

performance. InSection6we , transition tdiscussion oa orf

ule-based IE methodsin

general. In Section7we , pointoutimportantlimitations ofcurre

nt methods,focusing on

whatinformationiused s and how itis represented,how results a

re scoredand selected,

and the efficiency otraining f and testing. Finally,in Section8we

suggestways to

addressthese limitations, and wperovide experimentalresultstha

show t thatincluding

grammaticalinformation inthe extractionprocesscanincrease

exploitable regularityand

therefore performance.

2Overview of algorithmsand terminology Inthis sectionwperesent barief review of the BWIapproach t

oinformationextraction,

including precise a problem description, saummary othe f algorithms

used, andimportant

terminology. Inparticular we present saimplifiedvariantofthe

BWIalgorithm,called

SWI, which willbe used toanalyze BWIand relatedalgorithms. 2.1 Informationextractionacaslassification task Mostofthe materialin this section ibased s o(nFreitag

et al. 2000). , We present an

abridged versionhere for convenience, starting with review a of rel assumptions. A documentis saequence otokens. f A token ione s of

evantvocabularyand three things:an

unbrokenstring oalphanumeric f characters, paunctuation character, or The problem ofinformationextractionito sextract subsequenceso

caarriage return. tokens f matching a

certain pattern from testdocuments. To learn the pattern, we re

formulate the IE problem

as calassification problem. Insteadothinking f othe f objective

as saubstringotokens, f

we lookathe t problem as faunctionover boundaries. A boundary ithe s twotokens. Notice that baoundaryinot s something thatisactuall

space between iynthe text(such as

white space),but naotion thatarisesfrom the parsingothe f text

into tokens. We wantto

learntwoclassifiers, thatistwo functionsfrom boundaries

tothe binary set{0,1}:one

functionthatis i1ffthe boundary ithe s beginningofafieldtobextr

acted,and one

functionthatis i1ffthe boundary ithe s end ofafield. Thistra

nsformationfrom a

segmentation problem to boundary a classificationproblem isalso used, f

4

or example,by

Shinnou (2001)to find wordboundariesin Japanese text, since writte

nJapanese doesnot

use spaces. Each othe f twoclassifiersis representedasasetofboundar

dyetectors, also s 〈p, A s〉. detector matchesa

called justdetectors. A detector is paair oftokensequence boundaryiffthe prefix string otokens, f the suffix stringotokens, f

p, matchesthe tokensbefore the boundary and

s,matchesthe tokensafter the boundary. For example,the

detector 〈Who:, Dr. matches 〉 “Who:Dr. John Smith” athe t boundary between the ‘:’and the ‘Dr’. Given beginningand endingclassifiers, extractioniper s

formedbyidentifying

the beginningand end ofafieldandtakingthe tokens between the twopoints. BWIseparatelylearns two setsofboundary detectorsfor recogni

zing the

alternative beginnings andendings of taarget fieldteoxtract. At

eachiteration, BWI

learns naew detector, andthenusesAdaBoosttoreweightall

the examplesinorder to

focusonportionsofthe trainingset thatthe current rulesdonot

matchwell. Once this

processhas been repeatedfor faixednumber ofiterations,BWI learned detectors, calledfore detectors,

returnsthe two setsof

F, and aftdetectors,

A, as well as haistogram

H

over the length(number of tokens) ofthe targetfield. 2.2 Sequential Covering Wrapper Induction (SWI) SWIusesthe same frameworkaBWI s with somemodifications

tothe choice of

individualdetectors. The basic algorithm for SWIcan bseeenin generates F and

A independently bycalling the

Figure SWI 1. GenerateDetectorsprocedure, whichis

saequential coveringrule learner. Ateachiterationthrough t

he while loop, naew

boundarydetector islearned, consisting opafrefixpartand suffi a

xpart. AswithBWI,

the bestdetector isgenerated bsytarting with empty prefixand suf

fix sequences. Then,

the besttokenifound s taoddteoither the prefixosuffix r beyxhaust possible extensionsoflength

ively searching all

This L. process isrepeated, addingone token ataime, t

untilthe detector being generated nolonger improves,as measured bya Once the bestboundary detector hasbeen found, itisaddedtothe set of to breeturned,either

scoring function. detectors

F or Before A. continuing,allthe positive examplescovered bythe

new rule are removed,leaving only uncovered examples. As soon athe s set coversallpositive trainingexamples,this setisreturned.

5

ofdetectors

SWI: training sets S and E  two lists of detectors and length histogram F = GenerateDetectors(S) A = GenerateDetectors(E) H = field length histogram from S and E return ‹F,A,H› GenerateDetectors: training set Y(S or E)  list of detectors prefix pattern p = [] suffix pattern s = [] list of detectors d while(positive examples are still uncovered) 〈p, s 〉  FindBestDetector() add 〈p, s 〉 to d remove positive examples covered by 〈p, s 〉 from Y p, s = [] return d

Figure1 The : SWIandGenerateDetectorsalgorithms.

To clarify,there idasistinction between the actualtext, whichi

sasequence of

tokens, and the two functions to bleearned, whichindicate whether baoundar startor end ofafield tboextracted. When the SWIalgorithm

yithe s “removes” the boundaries

thatare matched bdetector, ya these boundariesare relabeleditn

he trainingdocuments,

butthe actualsequence otokens f isnotchanged. There are two keydifferencesbetweenBWIandSWI. First,asm

entioned above,

withsequentialcoveringcovered positive training examplesare r

emoved. InBWI,

examplesare reweighted butnever relabeled. Relabeling positive a

example is

equivalent to settingits weightto zero. Second,the scoring functions and detectorsare slightlydifferent. We investigate two diffe

for the extensions rent versionsofSWI. The

basic SWIalgorithm uses saimple greedymethod. The score oan fe isthe number of positive examplesthatitcoversif itcovers

xtensionodetector r nonegative examples, zero

otherwise. So,if daetector misclassifieseven one example (i.e example),itisgiven score a ozero. f We callthisversion Gr

covers . naegative eedy-SWI. A second

version oSWI, f called Root-SWI, uses the same scoring functionas ofthe weightsofthe positive examplescoveredbtyhe extensionode r

BWI. Giventhe sum tector, W+and , the Wthe , score icalculated s to

sum ofthe weightsfor the negative examplescovered, minimize trainingerror (Coheneal., t 1999): score = W + − W −

6

Notice that for Root-SWIthe two sumsare the numbersofexample

covered s because

examplesare allgiven the same weight. The finaldifference be

tween BWIand SWIis

thatBWIterminatesafter paredeterminednumber ofboosting iter

ations. Incontrast,

SWIterminatesassoon aall softhe positive exampleshave bee

ncovered. Thisturns out

to bparticularly ae important difference betweenthe two differe

ntmethods.

3Experimentsand results Inthis section, we describe the data setsused for our set ofe

xperiments,our setupand

methods,and the numericalresults ofour tests. We provide aanna

lysisof the resultsin

the next section. 3.1 Document collections andIE tasks Inorder to compare BWI,Greedy-SWIand Root-SWIwe examine the thr on 15differentinformationextraction tasksusing different 8 documentc

ee algorithms ollections.

Manyothese f tasks are standardandhave been used intesting vari a

etyoIE ftechniques.

The documentcollectionscan bceategorized intothree groups:natura

(unstructured) l

text, partially structured text, and highlystructured text. The brea allows for caomparison othe f algorithmson wider a spectrum oft previously beenstudied. The naturaltext documentsare biomedicaljour contain little obviousformattinginformation. The partiallystructur resemble naturaltext,butalso containdocument-levelstructure a formatting. These documentsconsistprimarily onatural f language, regularities in formattingand annotations. For example, itiscom preceded byanidentifyinglabel (e.g. “Speaker:Dr. X”), thoughthis is case. The highly structureddocumentsare webpages that have been aut

dth othese f collections asksthan has nal abstracts and ed documents nd have standardized butcontain monfor key fieldstobe notalwaysthe omatically

generated. Theycontainreliable regularities, including location o

content, f punctuation

and style,and non-naturallanguage formatting such aHTML s tags. T

he partially

structuredandhighly structureddocumentcollections were obtained prim (RISE, 1998)while the natural text documentswere obtainedfrom the of Medicine MEDLINE abstractdatabase (NationalLibraryof

arily from NationalLibrary Medicine, 2001).

We use two collections ofnaturallanguage documents, bothconsisting of individualsentences from MEDLINE abstracts. The first collect

7

ion consistsof 555

sentencestaggedwith the namesofproteins, andtheir subcellular l

ocations. These

sentenceswere automatically selectedand tagged from larger docum

entsby searching for

known proteinsandlocations containedinthe Yeast-Protein-Database

( YPD). Because

ofthe automated labelingprocedure,the labelsare incomplete and

sometimesinaccurate.

RayandCravenestimate anerror rate inlabeling between 10%and 15

%(Ray etal .,

2001). The secondcollectionconsists of891 sentences, containing genes

and their

associated diseases, as identifiedusingthe Online-Mendelian-I

nheritance-In-Man

(OMIM)database,and isubject s to the same noisy labeling. Both these c

ollectionshave

beenusedionther research, including(Ray

etal 2001). .,

et al .,2001)and(Eliassi-Rad

Itshould bneoted thatthe original OMIMand YPDdocumentcollections

contain

manysentenceslabeledwith ntorainingexamplesoffieldsto extrac

t,because the

collections were createdfor extractingfield pairs only. Theref

ore,onlyicnases where

bothfieldsin pair a were found was anytraininglabeladded. Becaus

B e WIis designed

to extractonlysingle fields,we remove these sentences. Manyof

them in factcontain

examplesofdesired fields, sowhenunlabeled,theypresentincorrect

training

information. Moreover,the unlabeleddocumentsoriginallyoutnumber the la

beled

documentsbyalmostanorder of magnitude. Our restricted sentence coll

ections still pose

challenging information extraction tasks,but performance onthese

collectionscannotbe

comparedtoperformance reportedfor extractingfield pairs inthe or

iginalcollections.

For the partially structuredextractiontasks,we use three dif collections. The firsttwoare takenfrom RISE,andconsistof

ferent document 486 speaker

announcements ( SA)and298Usenetjob announcements(

Jobs). In the speaker

announcementcollection, four different fields are extracted: the s speaker), the location othe f seminar (

SA-

SA-location),the starting time othe f seminar (

stime)and the ending time othe f seminar (

SA-

SA-etime). Inthe jobannouncement

collection,three fields are extracted: the message identifi the company(

peaker’sname (

er code (

Jobs-company)and the title othe f available position (

We created the third collectionopartially f structured docum MEDLINE article citations. These documentscontain full MEDLIN for articles onhiparthroplastysurgery, i.e. author, journal,andpublica MeSHheadings (a standardcontrolled vocabularyomedical f subj

8

Jobs-id), the name of Jobs-title). entsfrom 630 E bibliographic data tioninformation, ect keywords), and

other relatedinformation,including the text of the paper’s abstrac

about t 75%of the time.

The taskito sidentifythe beginningand endothe f abstract,ifthe

citationincludesone

(AbstractText). The abstractsare generally large bodiesoftext, butwithout

any

consistent beginning oending r marker,and the information thatprecedeso

follows r the

abstracttextvariesfrom citationtcoitation. Finally, the highly structured tasks are taken from three different

document

collections, alsoiR n ISE: 20web pagescontaining restaurantdes AngelesTimes(

criptions from the Los

LATimes), where the taskito sextract the list ofacceptedcredit

cards

(LATimes-cc);91 web pagescontainingrestaurant reviewsfrom the ZagatGuide AngelesRestaurants(

toLos

Zagat), where the taskito sextractthe restaurant’saddress (

Zagat-

addr);and 1w 0 ebpagescontainingresponses from anautomated stockquote se (QS), where the taskito sextract the date othe f response (

rvice QS-date). Although these

collections contain small a number of documents,the individualdocuments comprise several separate entries,i.e. there are oftenmultipl

typically setockquote responsesor

restaurantreviewsper webpage. 3.2 Experimentaldesign The performance othe f differentrule-based IEmethods onthe tas measuredusingthree different standard metrics:precision, r mean oprecision f andrecall. Each resultpresentedithe s ave

ks describedabove is ecall, and F1, the harmonic rage o10 frandom

75%/25% train/testsplits. For allalgorithms, laookahead value, For the naturalandpartially structuredtexts(

L,of 3tokenswasused.

YPD, OMIM, SA, Jobs,and AbstractText),

the default BWIwildcard setwas used, andfor the web pages (

LATimes, Zagat,and QS),

the defaultlexicalwildcards were alsoused. A graphical sum

maryoour f results canbe

seeniF n igure For 2.complete numericalresults, see Table 1a

the t end othis f paper.

To understandthe differences between BWIand SWI, we performedfour tests. First, we ranBWIwith the same number of rounds ofboos Kushmerick (2000): for the

sets of ting used bF y reitagand

YPD, OMIM, SA, Jobs,and

collections, 500rounds ofboosting were used; for the

AbstractText document LATimes, Zagat,and QS document

collections, 50rounds were used. Next, we ranGreedy-SWIandRoottasksuntilallofthe exampleswere covered. The actualnumber

SWIon the same ofiterations for which

these algorithmsranvaried acrossthe different IE tasks. The

se numbersare presented in

9

1

Highlystructured

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 Precision

Recall

Partiallystructured

1

F1

Precision

Recall

F1

Naturaltext

1 0.8

BWI Fixed-BWI Root-SWI Greedy-SWI

0.6 0.4 0.2 0 Precision

Recall

Figure 2: Performance othe f four IEmethodsexaminedinthis paper, sepa documentcollection,indicatedabove eachchart,averagedover all

F1

ratedbyoverallregularityof tasks withinthegivencollections.

Table 2atthe end othe f paper. Finally, we ranBWIfor the same

number ofiterations

thatittook for Root-SWItocomplete eachtask:323rounds for

YPD-protein, 1round for

QS-date, and son;thismethodicalled s Fixed-BWI.. 3.3 Experimentalresults To investigate whether BWIoutperformsthe set-covering approach oSf Greedy-SWIandBWI. BWIappearsto perform better, since the F higher for allthe highly structured andnatural texttasks studie

WI, we compare v1alues for BWIare Greedy-SWI d. has

slightlyhigher F1 performance ontwoothe f partiallystructured tas

ks,but considerably

lower performance onthe other three. Generally,Greedy-SWIhas s

lightlyhigher

precision, while BWItendstohave considerablyhigher recall. Given these results,the logicalnext stepidetermining s whatpar ofBWIcomesfrom the difference isncoring functionsusedbythe whatpartcomesfrom how many rulesare learned. Toexplore the fi

10

of tthe success two algorithms,and rstofthese

differences, we compare Greedy-SWIandRoot-SWI,whichdiffer

only bytheir scoring

function. Greedy-SWI tendstohave higher precision while Root-SWIte higher recall. However,theyhave similar F1performance, raesul

nds tohave that t isillustrated

further bycomparingBWItoRoot-SWI:BWIstill has higher F1 per SWIin all butthree partially structuredtasks, despite the fa

formance thanRootctthattheyuse aindentical

scoring function. There are onlytwodifferencesbetweenBWIandRoot-SWI. First removesallpositive examplescovered after eachiteration,whe reweightsthem. Second, Root-SWIterminatesassoon aall sposit

Root-SWI , reas BWImerely ive exampleshave

beencovered,whereas BWIcontinuesto boost for faixed number ofiter showsthat Root-SWIalwaysterminatesafter manyfewer itera BWI. Freitag

ations. Table 2 tions than were setfor

etal (2000) . examined the performance oBWI f as the number ofrounds of

boostingincreases, and found improvementin many cases evenafter more

than500

iterations. To investigate thisissue further, we runBWIfor the samenumber ittakesRoot-SWIto complete eachtask; we callthismet

ofiterationsas hod Fixed-BWI. Recall thatthe

number ofrounds Fixed-BWIruns for dependsonthe extraction task. Her varyqualitativelywiththe levelof structure itnhe domain. In the

tehe results highlystructured tasks,

Fixed-BWIandRoot-SWIperform nearly identically. Inthe partial

ly structuredtasks,

Fixed-BWItendstoexhibithigher precisionbutlower recallthan R

oot-SWI, resultingin

similar F1values. In the unstructuredtasks,Fixed-BWIhasc

onsiderablyhigher

precisionandrecall thanRoot-SWI.

4Analysis of experimental results Inthis section, we discuss our experimentalresultsfrom twodiff examine the effectof the algorithmic differencesbetweenthe lookahow t overall performance iaffected s btyhe inherentdiffic

erentangles. We IE methods studied,andwe ulty othe f IE task.

4.1 WhyBWIoutperforms SWI The experimentsperformedyield detailed a understanding ohow f the a differences between BWIand SWIallow BWIto achieve consisten thanSWI. Themostimportantdifference ithat s BWIusesboosti

11

lgorithmic tlyhigher F1 values ng tolearn rulesas

opposedtsoetcovering. Boostinghastwocomplementaryeffects. Firs

t,boosting

continually reweightsallpositive andnegative examplesto focusonthe

increasingly

specific problemsthatthe existing setofrules isunable t

soolve. Thistendsto yield high

precision rules,asisclear from the factthatFixed-BWIc

onsistentlyhashigher precision

thanRoot-SWI,even thoughthey use the samescoring functionand learnthe

same

number ofrules. While Greedy-SWIalsohas high precision,thisis

achieved buysinga

scoring function that doesnot permitany negative examplesto bceovere

bdyanyrules.

Thisresults in lower recallthanwiththe other three methods, be

cause general ruleswith

wide coverage are hard tolearnwithoutcoveringsome negative examples

.

Second, boosting allowsBWIto continue learning even whenthe existings rulesalreadycoversall the positive trainingexamples. Unlike

etof BWI, SWIislimited in

the total number ofboundarydetectorsitcan learn from gaiventrai

ningsThis et. is

because every time new a detector islearned,allmatching exam

plesare removed,and

trainingstopswhen there are nomore positive exampleslefttoc

over. Since each new

detector mustmatch aleast t one positive example,the number of

boundarydetectorsSWI

canlearn iat smostequalto the number ofpositive examplesin t

he training s(Usually et.

manyfewer detectorsare learnedbecause multiple examplesare

covered bysingle rules.)

Incontrast, BWIreweightsexamplesafter each iteration, butdoe

not s remove them

entirely,so there ino slimitinprinciple tohow manyrulescan

be learned. The abilityto

continue learning rulesmeansthat BWIcan notonly learn tocover all

the positive

examplesinthe training set,butitcanwidenthemarginbetween

positive and negative

examples, learning redundantand overlapping rules,whichtogether better

separate the

positive andnegative examples. 4.2 The consequences of task difficulty Asjustexplained,boostinggivesBWIthe consistent advantage inF1

performance

observed itnhe experimentsof Section However, 3. the difficulty of

the extractiontask

alsohas paronounced effect onthe performance odifferent f IE met

hods. We now turn

to specific a investigationoeach f documentcollection. 4.2.1 Highly structuredIE tasks These tasks are the easiestfor allmethods. It isnocoinci

dence thattheyare often called

“wrapper tasks” alearning s the regularitiesin automaticallyge

12

nerated webpageswas

precisely the problem for which wrapper inductionwasoriginallypro

posed asasolution.

Allmethodshave near-perfectprecision, andSWItendsto cover al

examples l withonly a

few iterations. Surprisingly,SWIdoesnotachieve near-perfectrecall. While t

hese tasksare

highly regular,the regularitieslearnedotnhe firstcouple oiter f

ations are not the only

importantones. Neither changing the scoring function from “greedy” t“o

root” nor

changingthe iteration methodfrom setcovering toboostingfixesthis

problem,asRoot-

SWIand Fixed-BWIhave virtuallyidentical performance tothato

Greedy-SWI. f

However, because BWIcontinues tolearn new andusefulruleseven

after allexamples

are covered, itis able tolearn secondary regularitiesthatinc

rease itsrecallandcapture

the casesthatare missedbsmaller ya setof rules. Thus

BWIachieveshigher recallthan

the other methods onthese extractiontasks. 4.2.2 Partially structuredIE tasks For these tasks, we see the largest variation ipnerformance bet

ween the IE algorithms

investigated. Compared toGreedy-SWI,Root-SWIhas increasedrec

all,decreased

precision, andslightly higher F1. Byallowing rulestocover somenegative

training

examples, more generalrulescan bleearned thathave higher recall,

butalsocover some

negative testexamples, causing lower precision. Root-SWIconsist

entlyterminatesafter

fewer iterationsthan Greedy-SWI, confirmingthatitisindeedc

overingmore examples

withfewer rules. Changing from setcovering tbooostingam as eansoflearningmultiple results in precision/recall a tradeoffwith little change inF

rules

As 1.mentionedearlier,

boostingreweightsthe examplestofocusonthe hardcases. This res very specific,withhigher precision butalso lower recall. Boost slower learning curve, measured inperformance versus number ofrules positive examplesare oftencovered multiple timesbefore allexam covered once. SWIremovesallcoveredexamples,focusingaeach t i

ultsin rulesthatare ingalsoresults in a learned,because pleshave been terationonew an

setofexamples. The consequence ithat s Fixed-BWIcannotlearne

noughrules to

overcome itsbiastowardsprecision itnhe number ofiterationsR

oot-SWItakesto cover

allthe positive examples. However,if enoughiterations ofboosti

ng are used,BWI

13

compensatesfor its slow startby learningenough rulesto ensure

highrecall without

sacrificing highprecision. 4.2.3 UnstructuredtextIE tasks Extensive testing oBWI f on naturaltexthasnotbeenpreviouslydone, though

it wasthe

clear vision oFreitag f andKushmericktpoursue research inthis

direction. As canbe

seeniF n igure 2on ,the natural texttasksallthe IE methods

investigatedhave relatively

low precision, despite the biastowards highprecision othe f boundary dete

ctors. Allthe

methodsalsohave relativelylow recall. Thisismainlydue ttohe

fact thatthe

algorithmsoftenlearnlittle more thanthe specific examples se

en duringtraining, which

usuallydnootappear againitnhe test set. Lookingathe t specific b

oundary detectors

learned,most simplymemorize individuallabeledinstancesof the ta

rgetfield,ignoring

contextaltogether: the fore detectorshave noprefix,andthe targe

field t athe s suffix,and

vice versa for the aftdetectors. Changingthe SWIscoring function

from “greedy” to

“root” resultsin marked a decrease inprecisionwithonly sm a

allincrease irnecall. Even

whennegative examplesare allowed tobceovered,the ruleslearned

are not sufficiently

generaltoincrease recallsignificantly. Unlike otnhe partially andhighly structured IE tasks,Fixed-BWIhas bot precisionandhigher recallthanRoot-SWIon the naturaltexttasks. withboostingiexplainable s aabove. s Boostingalso increases reca

higher Higher precision ll because while

already coveredexamplesare down-weighted during boosting,they are notr entirely. Thus,BWIremainssensitive ttohe performance onall the rulesitlearns. Incontrast,SWIquickly capturesthe regular

emoved trainingexamplesofall ities thatcanbeasily

found,andthenbegins covering positive trainingexamplesone ataime, t i performance onew f rulesonalreadycovered positive examples. The learnsmanyruleswith low recalliscompoundedbtyhe factthat negative examples. Thisbehavior ofSWIisparticularly troublesom labelingsof the collectionsofnaturaltextdocumentsare par

gnoringthe problem thatSWI it never removes beecause the tiallyinaccurate,as

explainedinSection 3Unlike .1. otnhe more structuredtasks,allowing B 500iterationscauses it toperform worse than Fixed-BWI, suggesti iterations resultinoverfitting. Because othe f irregularity of

WIto runfor ng thatthe extra the data, Fixed-BWIruns for

more iterationson naturaltext tasksthan oenasier tasks. The

last roundsofboosting

14

tendtcooncentrate oonnly few a highlyweightedexamples, meaningthat

unreliable

rulesare learned,which are more likelytboaertifacts oft

he training data than true

regularities.

5Quantifyingtask regularity Inaddition tothe observed variationsresulting from changestothe sc

oring functionand

iterationmethodoBWI f andSWI, itisclear thatthe inherentdi

fficultyothe f extraction

taskialso s strong a predictor ofperformance. Highly structured

documentssuch athose s

generated from daatabase and template a are easily handledbyall

the algorithmswe

consider,whereas naturaltext documentsare uniformlymore difficul interestingtoinvestigatemethodsfor measuringthe regularityof

Itis thus t. aninformation

extractiontask numerically,asopposed tosubjectively andqualitative

ly.

5.1 SWIratio am as easure otask f regularity We propose that gaood measure othe f regularityoan fextracti

on task ithe s number of

iterations SWItakesto cover allthe positive examplesin ta

raining set,divided btyhe

totalnumber of positive examplesinthe training sWe et.callthi

measure s the

for obvious reasons. In the mostregular limit, saingle rule woul

dcover aninfinite

SWI ratio

∞ At =0the . other extreme, in

number ofdocuments,andthusthe SWIratio wouldb1e/ caompletelyirregular setofdocuments,SWIwouldhave tolearn se a

parate rule for each

positive example (i.e.there wouldbneogeneralization possible), sofo

rN positive

examples, SWIwouldneed

N rules, for anSWIratioof

mustcover atleast one positive example,the number ofSWIrules le

N/N Since =1. eachSWIrule arned for gaiven

documentsetwillalwaysbe lessthan oequal r to the total number

ofpositive examples,

and thusthe SWIratio willalways be between 0and1with , lowe a

value r indicating a

more regular extraction task. The SWIratioisasensible and well-motivatedmeasure otask f severalreasons. First,itis raelativemeasure withres

regularity for pecttothe number oftraining

examples, soone canuse ito tcompare the regularity otasks f wi

th smallandlarge

trainingcollectionsof documents. Second,itissimple andobjective, straightforwardalgorithm,withnofree parametersto setother t extending boundary-detector rules, which ifixed s inallour tests. F

15

because SWIisa han the lookahead for inally,the SWIratio

1.0

F1

0.8 0.6

Greedy-SWI BWI

0.4 0.2 0.0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

SWI-Ratio

Figure 3: F1performancefor BWIandGreedySWIonthe 15differentextractiontasks, plottedversus the SWIratioothe f task,withseparations betweenhighlystructured, partiallystructured,andnaturaltext.

isefficienttocompute,andiispractical t to run SWIbefore

or while runningany other

technique. 5.2 Performance and task regularity Inaddition tothe clear differences inperformance between the thr taskspresented inthe previous sections, there ialso s wide var

ee classesofextraction iation withineachclassof

tasks. We mustbe carefulnotto forgetthateach value iF n igur

2is eanaverage over

severalextractiontasks. Usingthe SWIratio wceancompare

algorithmson per-task a

basis. This revealsingreater detail bothhow performance is

relatedttoaskdifficulty,

and whether BWIreliably outperformsSWI. InFigure 3F1 , valuesfor Greedy-SWIand BWIare plotted for e itsSWIratio. Asnoted inSection 4BWI , performsbetter tha

ach taskversus nGreedy-SWIonallbut

twotasks,andthe Fd1ifference otnhose tasks is small. Byplotti

ngthe twoalgorithms’

performance versus the SWIratio, we canexamine how the two algor the regularityothe f tasksdecreases. There icasonsistent

ithmssucceedas decline ipnerformance athe s

tasksdecrease irnegularity, for bothalgorithms. Also, there i

no sclear divergence in

performance betweenthe twoalgorithmsasthe regularity othe f although BWIappears toreliably outperform SWI,bothalgorithmsare st regularityothe f tasktboseolved.

16

task decreases. So, illlimitedbythe

6What rule-based IEmethodsare learning Having investigated BWIasanimportantnew IE technique,andhavingunders

toodits

advantages over variantmethods, we now turntorule-basedIE methodsin

generaland

their performance. Specifically, we investigate the relevantin

formationthat these

methodsdo anddonot appear to bexploiting during training. Asshown inSection above 3 andinumerousother papers, rule-based system perform quite wellon fairly structured tasks. Ifthe target f

s

ield tobextracted is

canonically preceded ofollowed r bgiven ay setoftokens,or by tokensof

daistincttype,

represented btyhe wildcards available, boundarydetectors readily le

arn this context.

Thisis caommon occurrence inhighlystructuredandpartially struc

tureddocuments,

where fieldsare often preceded byidentifyinglabels(e.g. “Speake

r:Dr.X”),or followed

by identifiable piecesofinformation (e.g.inthe Zagatsurvey, rae

staurantaddress is

almostalwaysfollowed biytstelephone number,which ieasily s dis

tinguishedfrom the

rest ofthe text). Surprisingly,there igasood deal ofthistype oregularity f tobexpl

oitedeven in

more naturaltexts. For example,whenextracting the locationswhere

proteinsare found,

“locatedin” and “localizesto” are commonprefixes, and are learne

edarlybyBWI. In

general, human authorsoftentypicallyprovide given a type oinformat f

iononly ian

certain context, and specific lead-in wordsor following words are u

sedrepeatedly.

While rule-based IE methods are primarily designedtoidentifycontex

tsoutside

target fields, BWIand other methodsdoalso learnaboutsome othe f

regularities that

occur inside the fields being extracted. In the case oBWI, f bounda into the edge othe f targetfield awell s as into the local

rydetectors extend context, so the fore detectorscan

learnwhatthe first few tokens of taarget field looklike, if

the fieldtendsto have a

regular beginning,and the aftdetectorscan learnwhat the last few

tokenslooklike.

With shortfields, individualboundary detectorsoften memorize instance target field. Thishappensin the MEDLINE tasks,where speci names, etc.getmemorizedwhen their contextisnototherwise helpf longer fields,there istill s importantinformationlearned. For citations,the abstracttextoftenstartswithphrases like

of sthe

fic gene names,protein ul. However,with example,inthe MEDLINE “OBJECTIVE:”,or “We

investigated…”

17

Inaddition tolearning the typical starting andendingtokensof taarge

field, t BWI

learns parobability distributionfor the number oftokensin the fie

ld. Specifically, the

fieldlength irecorded s for eachtraining example, andthis histogram

isnormalized intoa

probability distribution once trainingicomplete. s InBWI, lengthithe s

onlypiece of

informationthattiesfore detectors andaftdetectorstogethe

Length r. informationcanbe

extremely useful. WhenBWIruns on new a testdocument,itfirstnot

eseveryfore and

aftdetector thatfire,and thenpairs them up tofind specific toke

snequences. BWIonly

keeps m a atchthatisconsistentwith the length distribution.

Specifically,itdrops

matcheswhose fore and aftboundaries are farther apartor c

loser together thanseen

during training,and given two overlappingmatchesofequaldetector conf

idence, it

prefers the match whose lengthhas been seen more times. Inallbutthemostregular IE tasks, saingle extraction r

ule iinsufficient s tocover

allthe positive examples,soiisnecessary t tloearn set a

ofrulesthat capture different

regularities. Animportantpiece oinformation f tocapture,then,

isthe relative

prominence othe f differentpattern types, to distinguishthe general rule

from s the

exceptional cases. In the case oBWI, f thisinformation ilea s

rned explicitlybayssigning

eachboundary detector caonfidence value,which iproportional s to the

totalweightof

positive examplescovered bythe rule duringtraining,and soreflects t

he generalityothe f

rule. InSWI,rules thatcover the mostexamplesare learnedfi

rst,so there irasankingof

rules.

7Limitationsof BWI: what rule-based IEmethodsdo

notlearn

BWIand similar methodsachieve highperformance onmany documentcoll

ections, but

there isubstantial s room for improvement. There are usefulpi

ecesof information that

mostIEmethodscurrentlyignore, thatwould increase the apparent re

gularity othe f

extractiontask were theyexploited. There are alsoimportant fac

tsandcluesaboutmany

extractiontasks thatcurrent methodscannotrepresent,and theref

ore cannotlearn.

Finally, there are issueswith the speed andefficiency ocurrent f

methods,aswellashow

matchesare scoredand presented, thatlimitthe usefulness ofthes investigate eachothese f areas inmore detailin this section

seystems. We .

18

7.1 Representationalexpressiveness BWIlearns setsof fore andaftboundarydetectorsand histogr a

am ofthe lengths ofthe

fields. The boundary detectorsare shortsequencesofspecific wo directly precede ofollow r the targetfield othat r are foundwi

rds or wildcardsthat thin the target field. Asa

consequence othis f representation, there are valuable typesof r captured, including detailed a modelofthe fieldtobextracted, fieldwithin documents,and structuralinformationsuchathat s expos

egularity thatcannotbe the globallocation othe f edbgrammatical ya

parse oan rHTML/XML documenttree. 7.1.1 Limited modelof the contentbeingextracted Clearly ainmportantclue infindingandextracting particular a fac

is recognizing t what

relevant fragmentsoftextlook like. Asmentionedinthe previous s

ection, BWIcan

learnthe canonicalbeginningsand endingsof taarget field,aswel

as ldaistributionover

the length othe f field. However, boththese piecesofinformat

ionare learnedian

superficial manner,and much othe f regularityothe f fieldiignored s Currently, BWInormalizesits frequencycountsoffield lengthsin data withoutanygeneralization. Thismeansthatanycandidate fie whose length inot s precisely equal to thatofat leastone targe

entirely. the training ld tboextracted field t itnhe training data

willbe dismissed,no matter how compelling the boundary detection. This problemsfor very short fields, whenthere are only few a possible

does notcause lengths, butit isa

major problem for fieldswith wide a variance olengths. f For example,inthe MEDLINE citationstask, the abstract texts have mean a lengtho230 f tokens, with standard a deviation o264 f tokens. 462positive examplesofabstracts, in 630 citations, many reasonable never seen intrainingdata. When runningour experimentson this docume we made BWIignore alllength information. Ignoring lengthinformation

tobextracted Withonly abstractlengthsare ntcollection, resultedinF1

increasing from 0.1 to0.97. The lengthproblem couldclearlybreemediedbys the distribution somehow, for example bybinning the observedlengthsprior

moothing to

normalizing. Inaddition tothe lengthdistribution,the only other informationBWIl the content ofthe target field comesfrom boundarydetectors thatoverl Often, more informationaboutthe canonical form ofthe target fiel

19

earnsabout apintothe field. dcouldbexploited.

There has been great a dealofworkonmodeling fragmentsoftext t larger documents, mostunder the rubric

iodentifythem in

NamedEntity Extraction For .example, Baluja

(1999)reports impressive performance finding proper namesin free

text,using only

modelsofthe namesthemselves,basedoncapitalization, commonpref

ixesand suffixes,

and other word-level features. NYMBLE isanother notable effort in

this field (Bikel,

1997). Combining these field-findingmethodswiththe BWIcontext-finding m

ethods

should yieldsuperior performance. 7.1.2 Limitedexpressivenessof boundary detectors Boundary detectorsare designedtcoapture the flat,nearby contextofa

fieldtobe

extracted, bylearningshort sequencesofsurroundingtokens. Theyare

goodacapturing t

regular,sequentialinformationaround the field tobextracted,res

ultinginhigh

precision. However,currentboundarydetectorsare not effective in

partiallystructured

and natural texts, where regularities incontext are lessconsi

stentand reliable. Inthese

domainsmany detectorsonlycover one ofarew examples, andcollec

tivelythe detectors

have low recall. A second limitationoexisting f boundarydetectorsis that they cannot the grammaticalstructure osentences, f or moststructural

represent

information presentin HTML

or XML documents. Boundary detectorscan onlycapture informationabout thatappear directlybefore oafter r taarget fielditnhe line

the tokens ar sequence otokens f ina

document. BWIand similar methodscannot represent or learnanyinf

ormation aboutthe

parentnodes, siblings, or childpositionitnhe grammar,or XML or HT

ML tree, to which

target fields implicitly belong. In addition, the boundary detector repr allow BWIto take advantage opart-of-speech f information,andother si

esentation doesnot milar features

thathave beenshowntcoontainimportant sourcesofregularity. While contextinformationioften s the mostrelevantfor recognizing a globalinformation aboutthe relative position othe f field withinthe alsobiemportant. For example, despite the impressive performance abstractdetectiontask, the mostobviousclue tfoinding aanbstracti citationis, “look for baigblock otext f inthe middle othe f citati

field, entire documentcan ofBWIonthe MEDLINE an on.” There ino swayto

capture this globallocationknowledge usingthe existing BWIrepresenta Additional knowledge thatBWIis unable tloearnouse r includes “the

20

tions. field iin sthe

second sentence othe f secondparagraphiit is fanywhere,” “the fie

ld inever s more than

one sentence long,” “the field has noline breaksinside it,” ands

on.

7.2 Scoringocandidate f matches EachmatchthatBWIreturnswhen applied ttest ao documenthas

an associated

confidence score,whichithe s productof the confidences ofthe fore

and aftboundary

detectors,andthe learnedprobabilityothe f fieldlength othe f match.

These scoresare

usefulfor identifyingthe mostimportant detectors,and also for shar

eholding tobtaina

precision/recalltradeoff. While the score ocafandidate match icorrelated s with the

chance thatthe

proposedmatchicorrect, s these scores are not asusefulas

actual probabilitieswouldbe.

First, the confidence scores are unbounded positive real numbers,so i

is difficult t to

compare the scoresof matchesin different documentcollections,

where the typicalrange

ofscoresis different. For the samereason, it is hardtsoe

absolute t confidence thresholds

for acceptingproposed matches. Second, it ishardtocompare the confidence ofull f matchesandpart

ialmatches,

because the componentsof the confidence score have dramaticallydif

ferent ranges. This

obscures the “decision boundary” othe f learnedsystem, which wouldother importantentity for analysisto improve performance. Finally,witho itis difficultto use matchscores in larger a framework,

wise baen uttrue probabilities,

suchaBayesian s evidence

combination odecision-theoretic r choice oaction. f The basic proble confidence scoresare usefulfor statingthe relative merit

m isthatwhile oftwomatchesinthe same

documentcollection, theyare notusefulfor comparingmatchesa

cross collections, nor

are they usefulfor providinganabsolute measure omerit. f Another problem withcurrentconfidence scoresisthatno distinct between fragment a oftextthat has appeared numeroustimesinthe

ionimade s training set asa

negative example, faragmentunlike anyseen before,and fragment a wher boundarydetector has confidently fired butthe other has not. These score zero, sothere ino sway totell “aconfidentrejection” f

oene allhave confidence rom “anear miss.”

BWIalsoimplicitly assumesthatfieldsofjust one type are documentcollection. Mostdocumentscontain multiple piecesof rel be extracted,andthe positionoone f field ioften s indicative of

extracted from gaiven ated information to the position oanother. f

21

However, BWIcanonlyextract fieldsone ataime, t andignoresthe

relative position of

different fields. For example,inthe speaker announcementsdomain,eve fieldsare extracted from each document,the field extractorsa

tnhough four re alltrained andtested

independently,and the presence oone f fieldinot s usedtporedict the loca Itisdifficulttoextend methodslike BWIto extract relations

tionoanother. f hipsintext, taaskthatis

oftenaimportant s asextractingindividual fields (Muslea, 1999). Fu inconvenientfor real-world settings,where one should baeble tola

rthermore,itis belandtrainoan

single documentat taime,with multiple fieldslabeled ienach

document, rather than

goingthrough every documentin collection a multiple times. 7.3 Efficiency Inaddition toprecision andrecall, the speedandefficiency otrai f

ning andtestingialso s

cariticalissue inmakingIE systemspractical for solving r

eal-worldproblems.

7.3.1 BWIisslow to train andtest Evenomodern na workstation, trainingBWIon single a fieldwith few a

hundred

documentscan take severalhours, andtesting cantake severalminut

es. The slownessof

trainingand testingmakesusing larger documentcollections, or lar

ger lookaheadfor

detectors,prohibitive,eventhough bothmaybneecessary toachieve highper on complextasks. Slownessalso inhibitsreliable comparisonodif f

formance ferent IE methods,

and makesBWIunsuitable athe s engine oan finteractive system

based , ointerative

human labeling during active learning. There are number a offactorsthat make BWIslower than maybe

necessary. The

innermostloopotraining f BWIis findingextensionsto boundarydetector in brute-force a manner,with specified a lookaheadparameter,

This s. isdone L, andrepeated untilno

better rule canbfeound. Finding boundary a detector extension iexponent s because every combinationotokens f andwildcards isenumerated ands modestlookaheadvaluesare prohibitivelyexpensive. A value of

cored, so even L=3 inormally s usedto

achieve balance a betweenefficiencyandperformance. Freitag isusually sufficient, for some tasks vaalue utpo

ialin L,

etal note . that while L=8 irequired s toachieve the best

results. Although testing iconsiderably s faster thentraining, it too iineff s technique for improvingtesting speedito scompressthe setofbounda 22

icient. One rydetectors

L=3

learned duringtraining, before applyingthem to set a oftestdocuments

For .example,

one could eliminate redundancy bcyombiningduplicate detectors,or elimina detectorsthatare logicallysubsumed bset ya ofother detectors

ting This . is paractice used 2 for partially structured

by Cohen(1999)in SLIPPER,and byCiravegna (2001)in(LP) tasks,but it isdoubtful how useful redundancy eliminationwouldbfeor

less regular

documentcollections, whichare the ones for whichmanydetectors m

ustbe learned.

There are alsomore sophisticated methodsto compress saetof

rules, manyowhich f are

somewhatlossy(Margineantu,1997;Yarowsky,1994). For these methods, the t

radeoff

between speed andaccuracy wouldneedtobm e easured. 7.3.2 BWIisnon-incremental After BWIhasbeen trained ogiven na set oflabeleddocuments,if

more documentsare

labeledand added, there icurrently s nowaytaovoidtrainingonthe entir

seet from

scratch again. In other words,there ino sway tolearn extra inf

ormationfrom faew

additionaldocuments,eventhough the majorityowhat f willbe learnedin willbe identicalto whatwaslearnedbefore. Thisismore

the secondrun paroblem with boostingthan

withBWIspecifically, because the reweightingthatoccursat currentperformance onalltraining examples, soadding even few a ext mean thattraining takes vaerydifferentdirection. Nevertheless be able toadddocumentsto collection a athey s become labeled, and not

each rounddependsonthe ra documentscan it is clearly , desirable to toneed to

retrain each time. 7.3.3 BWIcannotdetermine when tsotopboosting While the abilityoBWI f toboost for faixed number ofrounds instead soon aall spositive examplesare covered iclearly s one oits fad

ofterminatingas vantages,it is hard to

know how manyroundstouse. Dependingotnhe difficultyothe f task, gaiven ofroundsmaybiensufficientto learn allthe cases, or itmay

number lead toverfitting, or to

redundancy. Ideally, BWIwouldcontinue boosting for as long auseful, s and

then stop

whencontinuingtbooostwould domore harm thangood. Cohenand Singer address the problem ofwhen tostopboostingdirectly in design oSLIPPER f by using internalfive-fold validationoheld-out na trainingset (Cohen, 1999). For each fold, theyfirst boost uptspec oa number ofrounds, testing onthe held-out data after eachround,and thenthey se

23

the partofthe ifiedmaximum lectthe

number ofrounds thatleadsto the lowestaverage error on the held-out

set, finally

trainingfor thatnumber ofroundsonthe fulltraining sThis et. pr

ocessisreasonable, and

sensitive ttohe difficulty othe f task,butit has severaldrawbac

ks. First, itmakesthe

already slow trainingprocess six timesslower, when aleas t

part t ofthemotivation for

findinganoptimalnumber ofroundsisto accelerate training.

Second,the free parameter

ofthe number of boosting rounds isreplacedbaynother free paramet

er,the maximum

number ofrounds to try during cross-validation. Settingthisvalue

too low willmeanthat

the bestnumber ofroundsisnever found, while settingitoo t high willm

eanwasted

effort. Itmaybpeossible tuose the BWIdetector confidencestotell boost is futile oharmful. r Since detector scorestend todecre

when continuingto ase athe s number of

boostingroundsincreases, itmaybpeossible tosetanabsolute orel r below whichtostopboosting. A relative threshold wouldmeasure when t flattenedout,andnew detectorsare only pickingupexceptionalcases one

ative threshold he curve has at taime. For

example,Ciravegna (2001)prunesrulesthatcover less than speci a

fiednumber of

examples, because they are unreliable,and too likelytporopose spu

riousmatches.

8Extendingrule-based IEmethods Inthis section, we presentseveral suggestions for advancingresearc methodsin general. These suggestionsare applicable tB o WIand toal algorithmsfor IE. We concentrate oindentifyingnew sources ofinf constructingnew representations for handling them,andproducing more mea output. In some cases, we present preliminary results that corrobor

ohnrule-basedIE ternative learning ormationtoconsider, ningful ate our hypotheses. In

other cases, we cite existing work byother researchersthatw

beelieve represent stepsin

the rightdirection. Our hope ithat s this sectionwillprovide a

useful perspective on

existing research,aswellasinspirationfor novelprojects. 8.1 Exploiting the grammatical structure osentences f Section7.1.2discusses the importance ousing f grammaticalstruc or hierarchicalstructure inHTML and XML documents. Ray andCr takenanimportant first stepinthis direction bpyreprocessingnatu parser,and thenflatteningthe parser outputby dividing sentences into

24

ture inatural language aven(2001) have raltextwith shallow a typed phrase

segments. The textisthenmarkedupwiththisgrammaticalinform

ation,which iused s

aspartof the informationextractionprocess. RayandCravenuse hi

ddenMarkovmodels,

buttheir technique igenerally s applicable. For example,usingXML ta

gsto represent

these phrase segments,theyconstructsentencesfrom MEDLINE a

rticles such as:

Uba2p is locatedlargely in



the nucleus .

While the parsesproduced are not perfect,and the flatteningoft

enfurther distorts

things,this procedure ifairly s consistentinlabelingnoun,verb,and

prepositional

phrases. Aninformation extractionsystem canthenlearn boundaryde

tectorsthatinclude

these tags, allowing it, for example,torepresentthe constraintt

hatproteins tend tboe

foundinounphrases, andnotin verb phrases. Ray andCraven reportthat

their results

“suggestthatthere ivalue s irnepresenting grammaticalstruc

ture itnhe HMM

architectures,but the Phrase Model[withtypedphrase segments]is

notdefinitively more

accurate.” Inaddition tothe resultspresented inSection3which , use the docum

ent

collections ofRay and Cravenwithoutgrammaticalinformation,we a

lsoobtainedresults

using phrase segmentinformation. We used BWItoextractthe four i

ndividualfields in

the two relations thatRay andCravenstudy (proteins andtheir loca their associated diseases). We ran identical testswithand

lizations, genesand withoutXML tagsas shown

above. Includingthe tagsuniformlyand considerably improvesboth precisiona for allfour extraction tasks (Figure 4In f).act,allfour tasks

nd recall

see double-digitpercentage

increases in precision, recall, andF1, withaverage increasesof

21%, 65%,and 46%

respectively. The factthatprecision improveswhenusing the phrase segmenttagsm BWIis able touse thisinformation torejectpossible fields

eansthat thatitwould otherwise return.

The factthatrecallalso improvessuggeststhat having segment

tagshelpsBWIto find

fieldsthat it wouldotherwise miss. The combination othese f r

esultsissomewhat

surprising. For example,while not being noun a phrase maybheighlycorr

25

elatedwith

Usingtypedphrasesegmenttagsuniformlyimproves performanceonnaturaltextMEDLINEextractiontask

BWI s

1.0

Averageperformanceon4datasets

0.8

0.6 notags tags 0.4

0.2

0.0 Precision

Figure 4:

Recall

F1

Performance oBWI f averagedacross thefour naturaltextext the use otyped f phrasesegments. Means areshownwithstanda

ractiontasks,withandwithout rderror bars.

notbeing protein, a the inverse inot s necessarilythe case, since

there are manynon-

protein nounphrases in MEDLINE articles. We hypothesize thatincludingtypedphrase segmentinformation actually regularizes the extractiontask, enabling rulesto gaingreater pos

itive coverage without

increasingtheir negative coverage. Using the SWIratiodescribed i

S n ection5we , can

testthis hypothesisquantitatively. Figure 5showsthe SWIratios

for allfour extraction

taskswith andwithout phrase segmenttags. Asexpected, there a percentage decreasesinthe SWIratiofor all four tasks, wit

re double-digit hanaverage reductionof

21%. Recallthat laower SWIratio indicates m a ore regular

domain, because imeans t

thatthe same number ofpositive examplescan bpeerfectly covered

with fewer rules.

We conclude thatincludinggrammaticalinformation,evenwithexist

ingrule-basedIE

methods,canbconsiderable ae advantage for both recallandprecisi

on, andiworth s

investigatingim n ore detail. The success ofincludingeven limitedgrammaticalinformation imme raisesthe question owhat f additionalgrammaticalinformationc helpful itmightbe. Itisplausible thatregularitiesinfiel

diately an buesed, and how

cdontext such aargument s

positioniverb an phrase, subject/objectdistinction,and socnanbveal 26

uable. For

Usingtypedphrasesegmenttagsuniformlyincreases regularityonatural f textMEDLINEextractiontasks

the

0.7 0.6

SWI-Ratio

0.5 0.4

notags tags

0.3 0.2 0.1 0 protein

location

gene

disease

Dataset

Figure 5:

Average SWIratio for the four naturalextractiontasks,withand phrasesegments. Meansare shown withstandarderror bars.

withoutthe use otyped f

example,Charniak (2001)hasshownthatprobabilisticallyparsing sente

nces isgreatly

aidedbyconditioning oinnformation about the linguistic head othe f current ifthe headiseveral s tokensawayinthe flatrepresentation.

phrase, even Charniak’s finding is

evidence thatlinguistic headinformationian simportantsource orf

egularity, which is

exactlywhatrule-based IE methodsare designedtoexploit. Unfortunately,the existing formulation oboundary f detectorsis not well equippedtroepresent or handle grammaticalinformation. While inse

rtingXML tagsto

indicate typed phrase segmentsis useful, this approach iunprincipled, s a

it loses s the

distinction between meta-level informationand the text itself. T representationused bbyoundarydetectors should bextendedtoexpressregul implicithigher-levelinformationawell s asin explicittokeni When using typed phrase segments,we can increase performance without simple,flat representationcurrently used. However,this approachc structure explicit, so ican t onlybteaken sofar.

27

he knowledge arities in nformationidocument. na changingthe annotmake recursive

8.2 Handling XMLor HTMLstructure andinformation Justas the grammaticalstructure osafentence presents oppor structuralregularities, the hierarchicalstructure and attri

tunities for exploiting bute information available inan

XML or HTML documentalso containimportantclues for locatingtarge

fields. t

However, thisinformationiessentially s lost whenanXML document

is parsedaas

linear sequence otokens f instead ousing f document a objectmodel(D

OM). For

example,tagslike

or

that shouldbtereated asingle s tokensare instead

broken into pieces. More problematic,however,isthe fact thatmanyt namespace references, and attribute-value pairsinside the star

agscontain tingtag, suchas

When . using flattokenization, lookaheadbecomesa

serious problem,andthere ino swaytointelligently generalize over match tag a with the same name, torequire specific attribute

suchtags, e.g. to and/or s values,etc.

Muslea etal (1999) . made some importantcontributions tosolving thisproblem withSTALKER,whichconstructs an“embeddedcatalog” for webpages

(a conceptual

hierarchyocontent f in document) a thatallowsittotake advantage

of globallandmarks

for finding andextractingtext fields. Theyaccomplish thisbyuse o

the f “Skip-To”

operator, whichmatches baoundary byignoringalltextuntil gaiven s

equence otokens f

and/or wildcards. While the token sequenceslearned with Skip-To ope

ratorsare similar

to the boundarydetectors used iB n WI, they canbceombinedsequentiallyto extractionrule that relieson severaldistincttextfragments,

form an whichcanbseeparated by

arbitrary amountsofintermediate text. Rulesin STALKERare learned bsytartingwiththe mostsimple S matches gaivenpositive example andgreedily addingtokensand new Ski asnecessarytoeliminate covering anynegative examples. Thisappr exponential problem ofthe exhaustive enumerationoboundary f detectorsused The tradeoff, however, isthatthe resultingboundarydetectorsonly work

kip-To rule that p-To operators oach avoidsthe by BWI. for pageswith a

highly regular andconsistent structure,and the token sequenceslearned

tendtbosehort

and specific. Furthermore,using Skip-Toonly approximatesthe inform

ationavailable in

tarue DOMrepresentation. A second approach teoxploitingthe structuralinformationembeddedinw pagesand XML documentsis due toYih (1997), who useshierarchical

28

eb document

templatesfor IE in webpages, and finds fieldsbylearning thei

position r within the

template’snode tree. The results presentedare impressive,but

currentlythe document

templatesmustbe constructedmanually from webpages ofinterest, be hierarchies in the templatesare more subjective thanjust

cause the the HTML parse. Constructing

templatesmaybleess of paroblem with XML documents,but onlyitf

heir tag-based

structure correspondsinsome relevantwayttohe content structu

re thatisnecessaryto

exploit for information extraction. Similar resultsare presente

bdyLiuandcolleagues

withtheir “XWRAP”, raule-basedlearner for webpage informa

tionextractionthatuses

heuristicslike font size,blockpositioning oHTML f elements,and s on documenthierarchyandextract specific nodes (Liu

toconstructa etal 2000). .,

8.3 Extending the expressivenessof boundary detectors One quick solution tothe problem ofincorporatingmore grammaticaland

structural

informationinto the existing rule-based IE frameworkithe s develop

mentof m a ore

sophisticated setofwildcards. Currently,wildcardsonly repre classes, such a“all s caps”, “beginswith lower-case a let

sent word-level syntactic ter”, “containsonly digits”,and

so oWhile n. these are usefulgeneralizations over matchingindividual

tokens,theyare

alsoextremely broad. A potential middle groundconsistsof wildcards

that matchwords

of gaivenlinguistic partof speech (e.g. “noun”), gaivensemantic cla

ss (e.g.

“location/place”),or gaivenlexicalfeature (e.g.specific pref

ix/suffix,word frequency

threshold,etc.). Inprinciple,such wildcards canbueseditnhe existingBWIframewor cancapture regularities thatare currentlybeing learnedone ins withmanyothese f new types of wildcards,constructing either pare allowable wordsor saimple online test for inclusion inot s feas

k, and they tance ataime. t However, defined set of ible, andinsteadwneeed

either preprocessingusing existing NLPsystems,asdone bR y ayand grammaticalclassifiers thatcanbaeppliedaneeded. s Clear

Craven,or ly efficiencycanbaen

importantissue,particularlyduring trainingwhengeneralizations

mustbe repeatedly

considered. Encouragingresultsin thisdirectionare already available. F

or example,

2

Ciravegna’s (LP) system useswordmorphologyandpart-of-speech information to generalize the rulesit initially learns(Ciravegna,2001), paroce

29

ss similar tousing lexical

and part-of-speech wildcards. Ciravegna’sresultsare comparabl

tetohose with other 2 also does rule correction andrule

state othe f artmethods,including BWI. While (LP) pruning,Ciravegna saysthat “the use oNLP f for generalization” is

the mostresponsible

for the performance othe f system. Another clever use ogenerali f

zationscan bfeoundin

RAPIER(Califf etal 1999), ., which usesWordNet(Miller,1995)to findhypernyms(a semantic class ofwordstowhichthe target wordbelongs), then use

these s initslearned

extractionrules. Yarowsky (1994)reportsthatincluding semantic wor

cdlasses like

“weekday” and “month” tocover setsoftokensimproves performance f

or lexical

disambiguation, suggesting that there are indeeduseful semantic regular

itiesto be

exploited for IE. HTML and XML wildcards thatallow generalizations suchaany s “fo formattingtag” oany r “table cell with specified a backgroundcolo

ntr” couldalso be

employed. These generalizations shouldnot onlyhelpincrease recall wi

thout sacrificing

precision, butalso help solve the commonproblem thatsmallchangest

web oa page

“break” learned IE detectors,and the system hasto breetrained.

For example, if w a eb

site changesthe fontcolor of paiece oinformation, f but the boundaryde

tector has

learned general a rule thatjustlooks for naon-standard fontcolor

itmay , wellcontinue to

functioncorrectly, while more a specific rule would not. 8.4 Converting BWImatchscoresintoprobabilities While improvingthe accuracy oIE fmethodsisanimportantgoal,progres

will s be

limited initsusefulness by the outputrepresentationused, sothi

is sanother criticaltarget

for further research. Asmentioned inSection 7.2,BWIandother methods probabilitiestothe matchesthey return. Probabilityithe s

do notattach linguafranca for combining

informationprocessing systems,because probabilitieshave both absolut meaning, andbecause there are powerfulmathematicalframeworks

aendrelative for dealing with

them,such aBayesian s evidence combination anddecisiontheory. Thus,it is worth askingthe question, canBWIconfidence scoresbe t into probabilities? Thisquestion really comesintwopieces. Fi

ransformed rst,are BWIconfidence

scoresmeaningfuland consistent? Thatis,doesBWIproduce incorrec highconfidence,or doesitsaccuracyincrease withitsconfidenc

30

matches t with Second, e? can BWI

1.0 Precisionimatches f below givenconfidencethreshold weredropped

Performance

0.8

Recallifmatchesbelow givenconfidencethreshold weredropped

0.6

0.4

F1imatches f belowgiven confidencethresholdwere dropped

0.2 Empiricalcalibrated probabilityagiven t confidencethreshold

0.0 0

10

20

30

40

50

60

70

80

90 100

Confidencethreshold

Figure 6: Precision,recall,and F1for BWIonthe AbstractTexttask,versus confidence threshold belowwhichtoignorematches. The empiricalprobabilityomatch f c orrectness within eachconfidence intervalisalso shown.

learnthe correlationbetweenconfidence scoresandthe chance of

correctnessof m a atch,

and thusautomaticallycalibrate its scoresinto probabilities? The firstquestion, whether BWIscores are consistent, can be

answered

empirically, bytraining andtesting BWIondifferent documentcoll

ections. Ideally,most

incorrectpredictionshave low confidence comparedtcoorrectpredi

ctions. Sucha

distribution idesirable s for tworeasons. First, itresemble

tasrue probability distribution,

where higher a score icorrelated s with higher a chance obe f

ingcorrect. Second,it

allows usersto set caonfidence threshold above whichtaocceptpre

dictions, and below

whichtoexamine predictions further. Our results suggestthat BWIscoresare fairly consistent anda threshold setting for taskswithhighregularity, butonhardtasksit correctand incorrectpredictionsbasedonly om n atchconfidences. Fo

menable to is difficultto separate the r

AbstractText

task, confidence scores range from 0t1o00, butallincorrectpredi

ctionshave confidence

scoresbelow 20(Figure 6This ). meansthatifBWIignorespredi

ctionsbelow a

confidence threshold o20, f itobtainsperfect precision. However, this in only59%recall,comparedwith95% recallwiththreshold A c0. of10 maximizesF1, because mostincorrectanswerscan bperuned wit correctguesses,ascan bseeeniF n igure The 6. confidence scores

31

threshold results onfidence threshold houteliminating are behaving roughly

1.0 Precisionimatches f below givenconfidencethreshold weredropped

Performance

0.8

Recallifmatchesbelow givenconfidencethreshold weredropped

0.6

0.4

F1imatches f belowgiven confidencethresholdwere dropped

0.2

Empiricalcalibrated probabilityagiven t confidencethreshold

0.0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

Confidencethreshold

Figure 7: Precision,recall,and F1for BWIonthe YPD-protein task, versus confidence threshold belowwhichtoignorematches. The empiricalprobabilityomatch f c orrectness within eachconfidence intervalisalso shown.

asprobabilities should, because aconfidence s increases, sodoes the guesses. Note thatthe precision for thistask ilower s than repor

fraction ocorrect f tediT n able 1because

here wceonsider allBWImatches,whereasnormallyBWIeli

minatesoverlapping

matches,keeping onlythe matchwith higher confidence. Unfortunately,the correlationbetween confidence andprobability ocor f isweaker for more difficulttasks. Onthe proteintask,the mos

rectness

difficult t asmeasured

bothbyperformance andbS y WIratio, Figure 7showsthatthere ino scle separate the correctand incorrectguesses. The highestconfide score 0.65, butabove thisthreshold,BWIonlyachieves 0.8% recall, because

anwayto nce negative matchhas almostall

correctmatcheshave confidence score below 0.35. The highest F1itn

hiscase comes

from caonfidence threshold o0, fwith precision52%and recall24%. (

These numbers

are taken from an individualtrain/testfoldand thusdiffer slight

ly from the averages

presentedinTable 1There .) ino swaytiomprove performance with

naon-zero

confidence threshold, because there ino ssmoothtransition thighe oa

density r ocorrect f

predictions asconfidence scores increase. There idasirectway ocalibrating f scoresinto probabilities on validation a set of documentsbyconfidence score,and estimate the eachbinathe s fraction ocorrect f predictions inthe bin. Essent

—bin BWIpredictions probability for ially ifor f gaivenrange

32

ofconfidence scores, say75% of predictions are correct, then pr a

ediction otest na

documentwith similar confidence iestimated s tohave 75% a chance of

beingcorrect.

Evenfor difficult tasks for whichconfidence scores are notconsi

stent, i.e.there ino s

smooth increase ipnrobability ascores s increase,the calibrat

ed probabilitiesare still

meaningful, because theymake explicitthe uncertainty ipnrediction

s.

9Concludingremarks Current information extractionmethodsare able todiscover and exploit

regularitiesin

textthatcome from localword ordering, punctuation usage, anddistincti featureslike capitalization. This information ioften s suffic

ve word-level ientfor partiallyand highly

structureddocumentsbecause their regularity iat sthe surface

level—inthe use and

ordering owords f themselves. However,currentIE methodsperform

considerablyworse

whendealingwith naturaltextdocuments, because the regularities i

tnhese are less

apparent. Nevertheless,there

are still importantregularitiesin naturaltextdocuments, at

the grammaticaland semantic levels. We have shown thatmaking expl grammaticalinformation resultsin considerablyhigher precision andr

icitevenlimited ecallfor

informationextractionfrom naturaltextdocuments. AllIE methodsperform differentlyodnocumentcollectionsofdiffer regularity, butthere hasbeennqouantitative analysisin previouspaper relationship betweendocumentregularity and IE accuracy. We pr qauantitative measure odocument f regularity, andwuese ito tin relationship between regularity and accuracy ignreater detail Since the SWIratioobjectivelymeasuresthe relative regular itis suitable for comparison oIE ftasks ondocumentcollections of

ent of sthe opose the SWIratioas vestigate the thanpreviously possible. ity odafocumentcollection, varying size and

content. While all the IE algorithmswe studyperform worse olness re collections, theystillexhibitconsistentdifferences. Many cur including the SWImethodproposed inthispaper, employsome form ofset combinemultiple extraction rules. These methodsare relatively s theyare all fundamentallylimitedinthatthey remove positive exampl are covered once. Boosting overcomesthislimitationbydownweightingcove

33

gular document rent rule-basedIE methods, coveringto imple toimplement,but esas soon athey s red

examplesinsteadoremoving f them. We have shown thatBWIexploitst

his propertyto

learnadditionaluseful rulesevenafter allexampleshave beencove

red,and consistently

outperformssetcoveringmethods. Reweightingalsohelpsby focusing BWIon

learning

specific rulesfor the exceptional cases missedbgyeneralrules

resulting , ihnigher

precision. The combinationolearning f accurate rules, andcontinuin

gtrainingtobroaden

coverage, is the source othe f success ofBWI. While this paper focuseson rule-based IE methods,and onBWIin par

ticular,

manyoour f observationsholdfor information extraction aw as hole.

For example,while

hidden Markovmodelsemploy different a representation,theytendlike B

WItolearn

regularities thatare localandflat, concerning word usage, punctuati

onusage, andother

surface patterns. Thus the sections above aboutsources of inform

ationthatare currently

ignored are relevant to HMMs and rule-based IEmethods. There are

manyopportunities

for improvingthe usefulnessofcurrent informationextractionmethods.

We hope this

paper is caontribution inthisdirection.

Acknowledgements The authors would like ttohankDayne Freitag for hisassistance and f

or makingthe BWI

code available,MarkCraven for givinguthe s natural text MEDLINE

documentswith

annotatedphrase segments,and MedExpert International, Inc.for its f

inancial supportof

this research. Thiswork wasconducted with support from the Cali Telecommunicationsand InformationTechnology,Cal-(IT)

fornia Institute for 2

.

References Baluja, S.,Mittal,V.,&Sukthankar,R. (1999).Applyingmachine learning for hi entity extraction.In

gh performancenamed-

ProceedingsoftheConferenceothe f Pacific AssociationforComputatio

nal

Linguistics (pp. 365-378). Bikel,D.,Miller,S.,Schwartz, R.,&Weischedel,R.(1997). Nymble: Name-Finder.In

H a igh-Performance Learning

Proceedings ofthe FifthConference on

AppliedNaturalLanguage Processing

,

pp. 194-201. Califf,M.E.,&Mooney,R.J. (1999).

RelationalLearning oPattern-Match f Rulesfor Information

Extraction. In Proceedingsofthe SixteenthNationalConference onArtificialInte 99),Orlando,FL, pp.328-334.

34

lligence(AAAI-

Califf,M.(1998). RelationalLearning Techniquesfor NaturalLang

uage Information Extraction, Artificial

Intelligence1998,pg.276. Canyoucheck this reference? can’t I find

it.

Cardie,C. (2001). Rule Induction andNaturalLanguage Applications of

Rule Induction,

http://www.cs.cornell.edu/Info/People/cardie/tutorial/tutorial.html. Charniak,E. (2001).Immediate-HeadParsingfor LanguageModels.In

Proceedings ofthe 39thAnnual

Meetingothe f AssociationforComputationalLinguistics(2001). 2

Ciravegna, F.(2001).(LP)

an ,AdaptiveAlgorithmfor InformationExtraction fromWeb-rela

Proceedings ofthe 17

InternationalJointConference onArtificialIntelligence(IJCAI-2001

Clark, P.andNiblett,T.(1989). The CN2InductionAlgorithm.

).

Machine Learning 3,pg. 261-283.

Cohen, W.W.,&Singer,Y.(1999).A Simple,Fast,andEffectiveRule Le

arner. In

SixteenthNationalConference onArtificialIntelligence

Proceedings ofthe

1999. ,

Eliassi-Rad,T.,&Shavlik, J.(2001).A Theory-RefinementApproachtIonf Proceedings ofthe 18

tedTexts.In

th

ormation Extraction. In

th

InternationalConference onMachine Learning(ICML-2001).

Freitag,D.&Kushmerick,N.(2000). Boosted Wrapper Induction,

Proceedings ofthe Seventeenth

NationalConference onArtificialIntelligence

pg. , 577-583.

Kushmerick,N. (2000). Wrapper Induction: Efficiency andExpressiveness,

ArtificialIntelligence 2000,

pg. 118. Liu,L., Pu,C., &Ilan, W.(2000).XWrap:An XML-EnabledWrapper Constr InformationSources.In

uctionSystem for Web

Proceedings ofthe InternationalConferenceoD n ata Engineering,2000.

Margineantu,D.D.,&Diettrich, T.G. (1997).Pruningadaptive boosting.In

Machine Learning:

Proceedings ofthe Fourteenth InternationalConference

pp. , 211-218.

NationalLibraryoMedicine. f (2001).The MEDLINEdatabase,2001. http://

www.ncbi.nlm.gov/Pubmed/.

Michalski, S.(1980). Patternrecognitionarule-guided s inductiveinfere Analysis andMachine Intelligence

nce. IEEETransactions onPattern

2, ,349-361.

Miller,G.(1995).Wordnet:A lexicaldatabasefor English.

Communications oftheACM

Muslea, I,Minton,S. &Knoblock,C. (1999). A HierarchicalApproach tW o

38(11):39-41

rapper Induction.

Muslea, I.(1999).Extraction Patterns for InformationExtracti

on:ASurvey.In

SixteenthNationalConference onArtificialIntelligence

Proceedings ofthe

1999. ,

Quinlan,J.R.(1990).LearninglogicaldefinitionsfromRelations.Machine

Learning, 5:239-266,1990.

Ray,S. &Craven,M.(2001). RepresentingSentence Structure inHidden

MarkovModelsfor Information

Extraction.In

Proceedings ofthe 17

th

InternationalJointConference onArtificialIntelligence

(IJCAI-2001). Riloff,E.(1996).AutomaticallyGenerating ExtractionPatternsf

romUntagged Text. In

ThirteenthNationalConference oA n rtificialIntelligence(AAAIShapire,R.E.(1999). ABrief Introduction tB o oosting.In

96),pp.1044-1049.

Proceedings ofthe 16

Conference oA n rtificialIntelligence(IJCAI-1999).

35

Proceedings ofthe

th

InternationalJoint

Shinnou,H. (2001). Detection oerrors f intrainingdatabyusing decision a l

istandAdaBoost.In

2001 Workshop,"TextLearning:BeyondSupervision",

pp.61-65.

Yarowsky,D.(1994). Decision Lists for LexicalAmbiguity Resolution: inSpanish And French. In

IJCAI-

ApplicationtoAccentRestoration

Proceedings ofthe ACL'94

pp. , 77-95.

Yih, W-t. (1997).Template-basedInformationExtraction from Tree-s

tructuredHTMLDocuments.

Master’s thesis, NationalTaiwan University.

DataSet

BWI

SWI-Ratio

Prec. Recall

Fixed-BWI F1

Prec.

Recall

Root-SWI F1

Prec.

Greedy-SWI

Recall

F1

Prec.

Recall

F1

LATimes-cc

0.013

0.996

1.000 0.998

1.000

0.985 0.993

1.000

0.975 0.986

1.000

0.948 0.973

Zagat-addr

0.011

1.000

0.937 0.967

1.000

0.549 0.703

1.000

0.549 0.703

1.000

0.575 0.724

QS-date

0.056

1.000

1.000 1.000

1.000

0.744 0.847

1.000

0.744 0.847

1.000

0.783 0.875

AbstractText

0.149

0.993

0.954 0.973

0.966

0.495 0.654

0.846

0.585 0.685

0.847

0.317 0.447

SA-speaker

0.246

0.791

0.592 0.677

0.887

0.446 0.586

0.777

0.457 0.565

0.904

0.342 0.494

SA-location

0.157

0.854

0.696 0.767

0.927

0.733 0.818

0.800

0.766 0.780

0.924

0.647 0.759

SA-stime

0.098

0.996

0.996 0.996

0.991

0.949 0.969

0.975

0.952 0.964

0.979

0.842 0.902

SA-etime

0.077

0.944

0.949 0.939

0.993

0.818 0.892

0.912

0.793 0.843

0.987

0.813 0.885

Jobs-id

0.068

1.000

1.000 1.000

1.000

0.956 0.978

1.000

0.956 0.978

0.996

0.829 0.902

Jobs-company

0.246

0.884

0.701 0.782

0.955

0.733 0.824

0.794

0.838 0.784

0.904

0.751 0.802

Jobs-title

0.381

0.596

0.432 0.501

0.661

0.479 0.547

0.480

0.658 0.546

0.660

0.477 0.549

YPD-protein

0.651

0.567

0.239 0.335

0.590

0.219 0.319

0.516

0.154 0.228

0.594

0.134 0.218

YPD-location

0.478

0.738

0.446 0.555

0.775

0.418 0.542

0.633

0.246 0.347

0.774

0.240 0.365

OMIM-gene

0.534

0.655

0.368 0.470

0.826

0.480 0.606

0.469

0.249 0.324

0.646

0.199 0.304

OMIM-disease

0.493

0.707

0.428 0.532

0.741

0.411 0.528

0.487

0.251 0.327

0.785

0.241 0.369

Table1: SWI-Ratio andperformance othe f four IEmethodsexaminedonthe

cc

addr date abstract speaker location stime etime 500

500

15 extractiontasks.

id comp.

500 500 500 500

title protein

500 500

BWI

50

50 50

SWI

5.1

1.3 1.0

51.8

139.4 75.6 72.1

25.0 15.2 46.9 132.1

333.1

235.0 357.0

280.1

Root-SWI 3.6 1.3 1.0

48.4

110.7 66.8 62.0

17.9

322.5

229.7 344.6

278.8

1.0

46.3 119.6

Table2: Number ofboostingiterations usedBWI, SWI,andRoot-SWIon the

36

500

location gene disease 500

500

500

15data sets.

Suggest Documents