Sourceso S f uccessf or InformationExtractionMethods DavidKauchak JosephSmarr Dept.ofComputerScience SymbolicSystemsProgram UCSanD iego StanfordUniversity LaJolla,CA92093-0114 Stanford,CA94305 LaJoll
[email protected] [email protected]
CharlesElkan Dept.ofComputerScience UCSanD iego a,CA92093-0114 elkan@cs .ucsd.edu
Abstract Inthispaper,weexamineanimportantrecentruletechniquenamedBoostedWrapperInduction(BWI),by widervarietyotasks f thanpreviouslystudied,inc collectionsonatural f textdocuments. Weprovide algorithmiccomponentofBWI,inparticularboostin showthatthebenefitofboostingarisesfromthea learnspecificrules(resultinginhighprecision) learningrulesafterallpositiveexampleshavebee Asquantitative a indicatoroftheregularityoan f
basedinformationextraction(IE) conductingexperimentsona ludingtasksusingseveral saystematicanalysisohow f each g,contributestoitssuccess. We bilitytoreweightexamplesto combinedwiththeabilitytocontinue ncovered(resultinginhighrecall). extractiontask,weproposenew a
measurethatwecallSWIratio. Weshowthatthis
measureigood as predictorofIE
success. Basedontheseresults,weanalyzethest
rengthsandlimitationsocurrent f
rule-basedIEmethodsingeneral. Specifically,we
explainlimitationsinthe
informationmadeavailabletothesemethods,andin
therepresentationstheyuse.
Wealsodiscusshowconfidencevaluesreturnedduri
ngextractionarenottrue
probabilities. Inthisanalysis,weinvestigateth andsemanticinformationfornaturaltextdocuments
be enefitsoincluding f grammatical as ,wellasparsetreeand
attribute-valueinformationforXMLandHTMLdocume experimentallythatincorporatingevenlimitedgram
nts. Weshow maticalinformationcanimprove
theregularityoand f henceperformanceon atural
textextractiontasks. We
concludewithproposalsforenrichingtherepresent
ationalpowerofrule-basedIE
methodstoexploittheseandothertypesoregular f
ities.
1Introduction Computerizedtextisabundant. Thousandsof new webpagesappear ever
yday. News,
magazine,andjournalarticlesare constantlybeingcreated online.
Emailhasbecome one
ofthe most popular waysofcommunicating. Allthese trends result
inanenormous
amountoftextavailable idnigital form,butthese repositories of
text are mostly untapped
resourcesofinformation, andidentifying specific desired informati difficulttask. Information extraction(IE)isthe taskoex f
on inthem isa tracting relevantfragmentsof
textfrom larger documents,to allow the fragmentsto bperocesse
fdurther in some
automatedway,for example toanswer uaser query. ExamplesofIE
tasksinclude
identifying the speaker featureditalk an announcement, findingthe protei
nsmentioned
in biomedical a journal article, and extractingthe namesofcredit
cardsaccepted bya
restaurantfrom anonline review. A varietyosystems f andtechniqueshave been developed toaddressthe informationextraction problem. Successful techniquesinclude statis
ticalmethodssuch
as n-gram models, hiddenMarkov models,probabilistic context-free grammars
(Califf,
1998),and rule-based methodsthatemploysome form ofmachine learning. R
ule-based
methodshave beenespecially popular in recentyears. They use var a
iety odifferent f
approaches, butall recognize number a ofcommonkeyfacts. First, c
reating rulesby
handidifficult s and time-consuming(Riloff, 1996). For this reaso
n, mostsystems
generate their rules automaticallygivenlabeledopartially r labe generating single, a general rule for extractingallinstancesof
led data. Second, gaiven field ioften s
impossible (Muslea et al .,1999). Therefore, mostsystemsattempttolearn number a of rulesthat together cover the trainingexamplesfor faield, andt
hencombine these rulesin
some way. Some recenttechniques for generating rulesin the realm oftextext
ractionare
called “wrapper induction” methods. These techniqueshave provedtobfeai successful for IE tasksintheir intendeddomains,whichare colle structureddocumentssuchaweb s pagesgenerated from taemplate sc
rly ctions ofhighly ript (Muslea
1999;Kushmerick,2000). However,wrapper induction methods donotextend wel natural language documentsbecause othe f specificityothe f induce
2
etal ., to l
rdules.
Recentresearchoinmproving the accuracyoweak f classifiersus
ing boosting
(Schapire,1999)hasledtm o ethodsfor learningclassifiers that
focus on examplesthat
previousclassifiers have found difficult. The AdaBoost algorithm w
orksby continually
reweighting the training examples,and using base a learner to learn
naew classifier
repeatedly, stoppingafter faixed number ofiterations. The class
ifierslearnedare then
combinedbw y eighted voting,where the weightofeachclassifier dependso
nits
accuracyointsweightedtraining sBoosting et. hasbeen shown theor
eticallytcoonverge
to aonptimalcombinedclassifier, and performswellinpractice. Boosted Wrapper Induction(BWI) isan IE technique that uses the AdaB algorithm togenerate general a extraction procedure thatcombines sa wrappers (Freitag
oost etof specific
etal 2000). ., BWIhasbeenshown tpoerform wellon variety a of
taskswith moderately structuredand highly structureddocuments, but
exactly how
boostingcontributes to this success hasnotbeeninvestigated. Further
more,BWIhas
beenproposedaan sIEmethodfor unstructurednaturallanguage documents, but hasbeen done toexamine its performance itnhischallengingdirection Inthispaper, we investigate the benefitofboosting inBWIandals performance oBWI f onnaturaltextdocuments. We compare the use of
little . othe boosting inBWI
withthe mostcommonapproach tcoombiningindividualextractionrules: se covering. Withsequentialcovering,extraction rulesare orderedisnome to performance onthe training sThe et. best rule ichosen, s anda
quential wayaccording llexamplesthatthis rule
correctlyclassifiesare removedfrom the trainingsA et. new
best rule ilearned, s andthe
processis repeateduntilthe entire training set hasbeen covere
For d.m a ore detailed
description osequential f covering see (Cardie, 2001). Sequentialcove in number a ofIE systems(Califf, 1998;Clark
ringhasbeen used
etal .,1989;Michalski, 1980; Muslea
al.,1999; Quinlan,1990)because iis simple t toimplement,tendstoge
et
nerate
understandable andintuitive rules, and hasachieved good performance. Thispaper isdividedinto number a ofsections. InSection 2we , bri BWIand related techniques,providing formalization a othe f problem a
eflydescribe nd review a of
relevant terminology. InSection3we , present experimentalresults
comparingthese
different rule-basedIE methodson wide a varietyodocument f colle
ctions. In Section4,
we analyze the resultsofthe experimentsin detail, with spe
cific emphasisonhow
3
boostingaffectsthe performance oBWI, f and how performance relates
to the regularity
ofthe extractiontask. InSection5we , introduce the SWIratio
asan objective measure
oftaskregularity, and further examine the connectionbetweenregular
ity and
performance. InSection6we , transition tdiscussion oa orf
ule-based IE methodsin
general. In Section7we , pointoutimportantlimitations ofcurre
nt methods,focusing on
whatinformationiused s and how itis represented,how results a
re scoredand selected,
and the efficiency otraining f and testing. Finally,in Section8we
suggestways to
addressthese limitations, and wperovide experimentalresultstha
show t thatincluding
grammaticalinformation inthe extractionprocesscanincrease
exploitable regularityand
therefore performance.
2Overview of algorithmsand terminology Inthis sectionwperesent barief review of the BWIapproach t
oinformationextraction,
including precise a problem description, saummary othe f algorithms
used, andimportant
terminology. Inparticular we present saimplifiedvariantofthe
BWIalgorithm,called
SWI, which willbe used toanalyze BWIand relatedalgorithms. 2.1 Informationextractionacaslassification task Mostofthe materialin this section ibased s o(nFreitag
et al. 2000). , We present an
abridged versionhere for convenience, starting with review a of rel assumptions. A documentis saequence otokens. f A token ione s of
evantvocabularyand three things:an
unbrokenstring oalphanumeric f characters, paunctuation character, or The problem ofinformationextractionito sextract subsequenceso
caarriage return. tokens f matching a
certain pattern from testdocuments. To learn the pattern, we re
formulate the IE problem
as calassification problem. Insteadothinking f othe f objective
as saubstringotokens, f
we lookathe t problem as faunctionover boundaries. A boundary ithe s twotokens. Notice that baoundaryinot s something thatisactuall
space between iynthe text(such as
white space),but naotion thatarisesfrom the parsingothe f text
into tokens. We wantto
learntwoclassifiers, thatistwo functionsfrom boundaries
tothe binary set{0,1}:one
functionthatis i1ffthe boundary ithe s beginningofafieldtobextr
acted,and one
functionthatis i1ffthe boundary ithe s end ofafield. Thistra
nsformationfrom a
segmentation problem to boundary a classificationproblem isalso used, f
4
or example,by
Shinnou (2001)to find wordboundariesin Japanese text, since writte
nJapanese doesnot
use spaces. Each othe f twoclassifiersis representedasasetofboundar
dyetectors, also s 〈p, A s〉. detector matchesa
called justdetectors. A detector is paair oftokensequence boundaryiffthe prefix string otokens, f the suffix stringotokens, f
p, matchesthe tokensbefore the boundary and
s,matchesthe tokensafter the boundary. For example,the
detector 〈Who:, Dr. matches 〉 “Who:Dr. John Smith” athe t boundary between the ‘:’and the ‘Dr’. Given beginningand endingclassifiers, extractioniper s
formedbyidentifying
the beginningand end ofafieldandtakingthe tokens between the twopoints. BWIseparatelylearns two setsofboundary detectorsfor recogni
zing the
alternative beginnings andendings of taarget fieldteoxtract. At
eachiteration, BWI
learns naew detector, andthenusesAdaBoosttoreweightall
the examplesinorder to
focusonportionsofthe trainingset thatthe current rulesdonot
matchwell. Once this
processhas been repeatedfor faixednumber ofiterations,BWI learned detectors, calledfore detectors,
returnsthe two setsof
F, and aftdetectors,
A, as well as haistogram
H
over the length(number of tokens) ofthe targetfield. 2.2 Sequential Covering Wrapper Induction (SWI) SWIusesthe same frameworkaBWI s with somemodifications
tothe choice of
individualdetectors. The basic algorithm for SWIcan bseeenin generates F and
A independently bycalling the
Figure SWI 1. GenerateDetectorsprocedure, whichis
saequential coveringrule learner. Ateachiterationthrough t
he while loop, naew
boundarydetector islearned, consisting opafrefixpartand suffi a
xpart. AswithBWI,
the bestdetector isgenerated bsytarting with empty prefixand suf
fix sequences. Then,
the besttokenifound s taoddteoither the prefixosuffix r beyxhaust possible extensionsoflength
ively searching all
This L. process isrepeated, addingone token ataime, t
untilthe detector being generated nolonger improves,as measured bya Once the bestboundary detector hasbeen found, itisaddedtothe set of to breeturned,either
scoring function. detectors
F or Before A. continuing,allthe positive examplescovered bythe
new rule are removed,leaving only uncovered examples. As soon athe s set coversallpositive trainingexamples,this setisreturned.
5
ofdetectors
SWI: training sets S and E two lists of detectors and length histogram F = GenerateDetectors(S) A = GenerateDetectors(E) H = field length histogram from S and E return ‹F,A,H› GenerateDetectors: training set Y(S or E) list of detectors prefix pattern p = [] suffix pattern s = [] list of detectors d while(positive examples are still uncovered) 〈p, s 〉 FindBestDetector() add 〈p, s 〉 to d remove positive examples covered by 〈p, s 〉 from Y p, s = [] return d
Figure1 The : SWIandGenerateDetectorsalgorithms.
To clarify,there idasistinction between the actualtext, whichi
sasequence of
tokens, and the two functions to bleearned, whichindicate whether baoundar startor end ofafield tboextracted. When the SWIalgorithm
yithe s “removes” the boundaries
thatare matched bdetector, ya these boundariesare relabeleditn
he trainingdocuments,
butthe actualsequence otokens f isnotchanged. There are two keydifferencesbetweenBWIandSWI. First,asm
entioned above,
withsequentialcoveringcovered positive training examplesare r
emoved. InBWI,
examplesare reweighted butnever relabeled. Relabeling positive a
example is
equivalent to settingits weightto zero. Second,the scoring functions and detectorsare slightlydifferent. We investigate two diffe
for the extensions rent versionsofSWI. The
basic SWIalgorithm uses saimple greedymethod. The score oan fe isthe number of positive examplesthatitcoversif itcovers
xtensionodetector r nonegative examples, zero
otherwise. So,if daetector misclassifieseven one example (i.e example),itisgiven score a ozero. f We callthisversion Gr
covers . naegative eedy-SWI. A second
version oSWI, f called Root-SWI, uses the same scoring functionas ofthe weightsofthe positive examplescoveredbtyhe extensionode r
BWI. Giventhe sum tector, W+and , the Wthe , score icalculated s to
sum ofthe weightsfor the negative examplescovered, minimize trainingerror (Coheneal., t 1999): score = W + − W −
6
Notice that for Root-SWIthe two sumsare the numbersofexample
covered s because
examplesare allgiven the same weight. The finaldifference be
tween BWIand SWIis
thatBWIterminatesafter paredeterminednumber ofboosting iter
ations. Incontrast,
SWIterminatesassoon aall softhe positive exampleshave bee
ncovered. Thisturns out
to bparticularly ae important difference betweenthe two differe
ntmethods.
3Experimentsand results Inthis section, we describe the data setsused for our set ofe
xperiments,our setupand
methods,and the numericalresults ofour tests. We provide aanna
lysisof the resultsin
the next section. 3.1 Document collections andIE tasks Inorder to compare BWI,Greedy-SWIand Root-SWIwe examine the thr on 15differentinformationextraction tasksusing different 8 documentc
ee algorithms ollections.
Manyothese f tasks are standardandhave been used intesting vari a
etyoIE ftechniques.
The documentcollectionscan bceategorized intothree groups:natura
(unstructured) l
text, partially structured text, and highlystructured text. The brea allows for caomparison othe f algorithmson wider a spectrum oft previously beenstudied. The naturaltext documentsare biomedicaljour contain little obviousformattinginformation. The partiallystructur resemble naturaltext,butalso containdocument-levelstructure a formatting. These documentsconsistprimarily onatural f language, regularities in formattingand annotations. For example, itiscom preceded byanidentifyinglabel (e.g. “Speaker:Dr. X”), thoughthis is case. The highly structureddocumentsare webpages that have been aut
dth othese f collections asksthan has nal abstracts and ed documents nd have standardized butcontain monfor key fieldstobe notalwaysthe omatically
generated. Theycontainreliable regularities, including location o
content, f punctuation
and style,and non-naturallanguage formatting such aHTML s tags. T
he partially
structuredandhighly structureddocumentcollections were obtained prim (RISE, 1998)while the natural text documentswere obtainedfrom the of Medicine MEDLINE abstractdatabase (NationalLibraryof
arily from NationalLibrary Medicine, 2001).
We use two collections ofnaturallanguage documents, bothconsisting of individualsentences from MEDLINE abstracts. The first collect
7
ion consistsof 555
sentencestaggedwith the namesofproteins, andtheir subcellular l
ocations. These
sentenceswere automatically selectedand tagged from larger docum
entsby searching for
known proteinsandlocations containedinthe Yeast-Protein-Database
( YPD). Because
ofthe automated labelingprocedure,the labelsare incomplete and
sometimesinaccurate.
RayandCravenestimate anerror rate inlabeling between 10%and 15
%(Ray etal .,
2001). The secondcollectionconsists of891 sentences, containing genes
and their
associated diseases, as identifiedusingthe Online-Mendelian-I
nheritance-In-Man
(OMIM)database,and isubject s to the same noisy labeling. Both these c
ollectionshave
beenusedionther research, including(Ray
etal 2001). .,
et al .,2001)and(Eliassi-Rad
Itshould bneoted thatthe original OMIMand YPDdocumentcollections
contain
manysentenceslabeledwith ntorainingexamplesoffieldsto extrac
t,because the
collections were createdfor extractingfield pairs only. Theref
ore,onlyicnases where
bothfieldsin pair a were found was anytraininglabeladded. Becaus
B e WIis designed
to extractonlysingle fields,we remove these sentences. Manyof
them in factcontain
examplesofdesired fields, sowhenunlabeled,theypresentincorrect
training
information. Moreover,the unlabeleddocumentsoriginallyoutnumber the la
beled
documentsbyalmostanorder of magnitude. Our restricted sentence coll
ections still pose
challenging information extraction tasks,but performance onthese
collectionscannotbe
comparedtoperformance reportedfor extractingfield pairs inthe or
iginalcollections.
For the partially structuredextractiontasks,we use three dif collections. The firsttwoare takenfrom RISE,andconsistof
ferent document 486 speaker
announcements ( SA)and298Usenetjob announcements(
Jobs). In the speaker
announcementcollection, four different fields are extracted: the s speaker), the location othe f seminar (
SA-
SA-location),the starting time othe f seminar (
stime)and the ending time othe f seminar (
SA-
SA-etime). Inthe jobannouncement
collection,three fields are extracted: the message identifi the company(
peaker’sname (
er code (
Jobs-company)and the title othe f available position (
We created the third collectionopartially f structured docum MEDLINE article citations. These documentscontain full MEDLIN for articles onhiparthroplastysurgery, i.e. author, journal,andpublica MeSHheadings (a standardcontrolled vocabularyomedical f subj
8
Jobs-id), the name of Jobs-title). entsfrom 630 E bibliographic data tioninformation, ect keywords), and
other relatedinformation,including the text of the paper’s abstrac
about t 75%of the time.
The taskito sidentifythe beginningand endothe f abstract,ifthe
citationincludesone
(AbstractText). The abstractsare generally large bodiesoftext, butwithout
any
consistent beginning oending r marker,and the information thatprecedeso
follows r the
abstracttextvariesfrom citationtcoitation. Finally, the highly structured tasks are taken from three different
document
collections, alsoiR n ISE: 20web pagescontaining restaurantdes AngelesTimes(
criptions from the Los
LATimes), where the taskito sextract the list ofacceptedcredit
cards
(LATimes-cc);91 web pagescontainingrestaurant reviewsfrom the ZagatGuide AngelesRestaurants(
toLos
Zagat), where the taskito sextractthe restaurant’saddress (
Zagat-
addr);and 1w 0 ebpagescontainingresponses from anautomated stockquote se (QS), where the taskito sextract the date othe f response (
rvice QS-date). Although these
collections contain small a number of documents,the individualdocuments comprise several separate entries,i.e. there are oftenmultipl
typically setockquote responsesor
restaurantreviewsper webpage. 3.2 Experimentaldesign The performance othe f differentrule-based IEmethods onthe tas measuredusingthree different standard metrics:precision, r mean oprecision f andrecall. Each resultpresentedithe s ave
ks describedabove is ecall, and F1, the harmonic rage o10 frandom
75%/25% train/testsplits. For allalgorithms, laookahead value, For the naturalandpartially structuredtexts(
L,of 3tokenswasused.
YPD, OMIM, SA, Jobs,and AbstractText),
the default BWIwildcard setwas used, andfor the web pages (
LATimes, Zagat,and QS),
the defaultlexicalwildcards were alsoused. A graphical sum
maryoour f results canbe
seeniF n igure For 2.complete numericalresults, see Table 1a
the t end othis f paper.
To understandthe differences between BWIand SWI, we performedfour tests. First, we ranBWIwith the same number of rounds ofboos Kushmerick (2000): for the
sets of ting used bF y reitagand
YPD, OMIM, SA, Jobs,and
collections, 500rounds ofboosting were used; for the
AbstractText document LATimes, Zagat,and QS document
collections, 50rounds were used. Next, we ranGreedy-SWIandRoottasksuntilallofthe exampleswere covered. The actualnumber
SWIon the same ofiterations for which
these algorithmsranvaried acrossthe different IE tasks. The
se numbersare presented in
9
1
Highlystructured
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 Precision
Recall
Partiallystructured
1
F1
Precision
Recall
F1
Naturaltext
1 0.8
BWI Fixed-BWI Root-SWI Greedy-SWI
0.6 0.4 0.2 0 Precision
Recall
Figure 2: Performance othe f four IEmethodsexaminedinthis paper, sepa documentcollection,indicatedabove eachchart,averagedover all
F1
ratedbyoverallregularityof tasks withinthegivencollections.
Table 2atthe end othe f paper. Finally, we ranBWIfor the same
number ofiterations
thatittook for Root-SWItocomplete eachtask:323rounds for
YPD-protein, 1round for
QS-date, and son;thismethodicalled s Fixed-BWI.. 3.3 Experimentalresults To investigate whether BWIoutperformsthe set-covering approach oSf Greedy-SWIandBWI. BWIappearsto perform better, since the F higher for allthe highly structured andnatural texttasks studie
WI, we compare v1alues for BWIare Greedy-SWI d. has
slightlyhigher F1 performance ontwoothe f partiallystructured tas
ks,but considerably
lower performance onthe other three. Generally,Greedy-SWIhas s
lightlyhigher
precision, while BWItendstohave considerablyhigher recall. Given these results,the logicalnext stepidetermining s whatpar ofBWIcomesfrom the difference isncoring functionsusedbythe whatpartcomesfrom how many rulesare learned. Toexplore the fi
10
of tthe success two algorithms,and rstofthese
differences, we compare Greedy-SWIandRoot-SWI,whichdiffer
only bytheir scoring
function. Greedy-SWI tendstohave higher precision while Root-SWIte higher recall. However,theyhave similar F1performance, raesul
nds tohave that t isillustrated
further bycomparingBWItoRoot-SWI:BWIstill has higher F1 per SWIin all butthree partially structuredtasks, despite the fa
formance thanRootctthattheyuse aindentical
scoring function. There are onlytwodifferencesbetweenBWIandRoot-SWI. First removesallpositive examplescovered after eachiteration,whe reweightsthem. Second, Root-SWIterminatesassoon aall sposit
Root-SWI , reas BWImerely ive exampleshave
beencovered,whereas BWIcontinuesto boost for faixed number ofiter showsthat Root-SWIalwaysterminatesafter manyfewer itera BWI. Freitag
ations. Table 2 tions than were setfor
etal (2000) . examined the performance oBWI f as the number ofrounds of
boostingincreases, and found improvementin many cases evenafter more
than500
iterations. To investigate thisissue further, we runBWIfor the samenumber ittakesRoot-SWIto complete eachtask; we callthismet
ofiterationsas hod Fixed-BWI. Recall thatthe
number ofrounds Fixed-BWIruns for dependsonthe extraction task. Her varyqualitativelywiththe levelof structure itnhe domain. In the
tehe results highlystructured tasks,
Fixed-BWIandRoot-SWIperform nearly identically. Inthe partial
ly structuredtasks,
Fixed-BWItendstoexhibithigher precisionbutlower recallthan R
oot-SWI, resultingin
similar F1values. In the unstructuredtasks,Fixed-BWIhasc
onsiderablyhigher
precisionandrecall thanRoot-SWI.
4Analysis of experimental results Inthis section, we discuss our experimentalresultsfrom twodiff examine the effectof the algorithmic differencesbetweenthe lookahow t overall performance iaffected s btyhe inherentdiffic
erentangles. We IE methods studied,andwe ulty othe f IE task.
4.1 WhyBWIoutperforms SWI The experimentsperformedyield detailed a understanding ohow f the a differences between BWIand SWIallow BWIto achieve consisten thanSWI. Themostimportantdifference ithat s BWIusesboosti
11
lgorithmic tlyhigher F1 values ng tolearn rulesas
opposedtsoetcovering. Boostinghastwocomplementaryeffects. Firs
t,boosting
continually reweightsallpositive andnegative examplesto focusonthe
increasingly
specific problemsthatthe existing setofrules isunable t
soolve. Thistendsto yield high
precision rules,asisclear from the factthatFixed-BWIc
onsistentlyhashigher precision
thanRoot-SWI,even thoughthey use the samescoring functionand learnthe
same
number ofrules. While Greedy-SWIalsohas high precision,thisis
achieved buysinga
scoring function that doesnot permitany negative examplesto bceovere
bdyanyrules.
Thisresults in lower recallthanwiththe other three methods, be
cause general ruleswith
wide coverage are hard tolearnwithoutcoveringsome negative examples
.
Second, boosting allowsBWIto continue learning even whenthe existings rulesalreadycoversall the positive trainingexamples. Unlike
etof BWI, SWIislimited in
the total number ofboundarydetectorsitcan learn from gaiventrai
ningsThis et. is
because every time new a detector islearned,allmatching exam
plesare removed,and
trainingstopswhen there are nomore positive exampleslefttoc
over. Since each new
detector mustmatch aleast t one positive example,the number of
boundarydetectorsSWI
canlearn iat smostequalto the number ofpositive examplesin t
he training s(Usually et.
manyfewer detectorsare learnedbecause multiple examplesare
covered bysingle rules.)
Incontrast, BWIreweightsexamplesafter each iteration, butdoe
not s remove them
entirely,so there ino slimitinprinciple tohow manyrulescan
be learned. The abilityto
continue learning rulesmeansthat BWIcan notonly learn tocover all
the positive
examplesinthe training set,butitcanwidenthemarginbetween
positive and negative
examples, learning redundantand overlapping rules,whichtogether better
separate the
positive andnegative examples. 4.2 The consequences of task difficulty Asjustexplained,boostinggivesBWIthe consistent advantage inF1
performance
observed itnhe experimentsof Section However, 3. the difficulty of
the extractiontask
alsohas paronounced effect onthe performance odifferent f IE met
hods. We now turn
to specific a investigationoeach f documentcollection. 4.2.1 Highly structuredIE tasks These tasks are the easiestfor allmethods. It isnocoinci
dence thattheyare often called
“wrapper tasks” alearning s the regularitiesin automaticallyge
12
nerated webpageswas
precisely the problem for which wrapper inductionwasoriginallypro
posed asasolution.
Allmethodshave near-perfectprecision, andSWItendsto cover al
examples l withonly a
few iterations. Surprisingly,SWIdoesnotachieve near-perfectrecall. While t
hese tasksare
highly regular,the regularitieslearnedotnhe firstcouple oiter f
ations are not the only
importantones. Neither changing the scoring function from “greedy” t“o
root” nor
changingthe iteration methodfrom setcovering toboostingfixesthis
problem,asRoot-
SWIand Fixed-BWIhave virtuallyidentical performance tothato
Greedy-SWI. f
However, because BWIcontinues tolearn new andusefulruleseven
after allexamples
are covered, itis able tolearn secondary regularitiesthatinc
rease itsrecallandcapture
the casesthatare missedbsmaller ya setof rules. Thus
BWIachieveshigher recallthan
the other methods onthese extractiontasks. 4.2.2 Partially structuredIE tasks For these tasks, we see the largest variation ipnerformance bet
ween the IE algorithms
investigated. Compared toGreedy-SWI,Root-SWIhas increasedrec
all,decreased
precision, andslightly higher F1. Byallowing rulestocover somenegative
training
examples, more generalrulescan bleearned thathave higher recall,
butalsocover some
negative testexamples, causing lower precision. Root-SWIconsist
entlyterminatesafter
fewer iterationsthan Greedy-SWI, confirmingthatitisindeedc
overingmore examples
withfewer rules. Changing from setcovering tbooostingam as eansoflearningmultiple results in precision/recall a tradeoffwith little change inF
rules
As 1.mentionedearlier,
boostingreweightsthe examplestofocusonthe hardcases. This res very specific,withhigher precision butalso lower recall. Boost slower learning curve, measured inperformance versus number ofrules positive examplesare oftencovered multiple timesbefore allexam covered once. SWIremovesallcoveredexamples,focusingaeach t i
ultsin rulesthatare ingalsoresults in a learned,because pleshave been terationonew an
setofexamples. The consequence ithat s Fixed-BWIcannotlearne
noughrules to
overcome itsbiastowardsprecision itnhe number ofiterationsR
oot-SWItakesto cover
allthe positive examples. However,if enoughiterations ofboosti
ng are used,BWI
13
compensatesfor its slow startby learningenough rulesto ensure
highrecall without
sacrificing highprecision. 4.2.3 UnstructuredtextIE tasks Extensive testing oBWI f on naturaltexthasnotbeenpreviouslydone, though
it wasthe
clear vision oFreitag f andKushmericktpoursue research inthis
direction. As canbe
seeniF n igure 2on ,the natural texttasksallthe IE methods
investigatedhave relatively
low precision, despite the biastowards highprecision othe f boundary dete
ctors. Allthe
methodsalsohave relativelylow recall. Thisismainlydue ttohe
fact thatthe
algorithmsoftenlearnlittle more thanthe specific examples se
en duringtraining, which
usuallydnootappear againitnhe test set. Lookingathe t specific b
oundary detectors
learned,most simplymemorize individuallabeledinstancesof the ta
rgetfield,ignoring
contextaltogether: the fore detectorshave noprefix,andthe targe
field t athe s suffix,and
vice versa for the aftdetectors. Changingthe SWIscoring function
from “greedy” to
“root” resultsin marked a decrease inprecisionwithonly sm a
allincrease irnecall. Even
whennegative examplesare allowed tobceovered,the ruleslearned
are not sufficiently
generaltoincrease recallsignificantly. Unlike otnhe partially andhighly structured IE tasks,Fixed-BWIhas bot precisionandhigher recallthanRoot-SWIon the naturaltexttasks. withboostingiexplainable s aabove. s Boostingalso increases reca
higher Higher precision ll because while
already coveredexamplesare down-weighted during boosting,they are notr entirely. Thus,BWIremainssensitive ttohe performance onall the rulesitlearns. Incontrast,SWIquickly capturesthe regular
emoved trainingexamplesofall ities thatcanbeasily
found,andthenbegins covering positive trainingexamplesone ataime, t i performance onew f rulesonalreadycovered positive examples. The learnsmanyruleswith low recalliscompoundedbtyhe factthat negative examples. Thisbehavior ofSWIisparticularly troublesom labelingsof the collectionsofnaturaltextdocumentsare par
gnoringthe problem thatSWI it never removes beecause the tiallyinaccurate,as
explainedinSection 3Unlike .1. otnhe more structuredtasks,allowing B 500iterationscauses it toperform worse than Fixed-BWI, suggesti iterations resultinoverfitting. Because othe f irregularity of
WIto runfor ng thatthe extra the data, Fixed-BWIruns for
more iterationson naturaltext tasksthan oenasier tasks. The
last roundsofboosting
14
tendtcooncentrate oonnly few a highlyweightedexamples, meaningthat
unreliable
rulesare learned,which are more likelytboaertifacts oft
he training data than true
regularities.
5Quantifyingtask regularity Inaddition tothe observed variationsresulting from changestothe sc
oring functionand
iterationmethodoBWI f andSWI, itisclear thatthe inherentdi
fficultyothe f extraction
taskialso s strong a predictor ofperformance. Highly structured
documentssuch athose s
generated from daatabase and template a are easily handledbyall
the algorithmswe
consider,whereas naturaltext documentsare uniformlymore difficul interestingtoinvestigatemethodsfor measuringthe regularityof
Itis thus t. aninformation
extractiontask numerically,asopposed tosubjectively andqualitative
ly.
5.1 SWIratio am as easure otask f regularity We propose that gaood measure othe f regularityoan fextracti
on task ithe s number of
iterations SWItakesto cover allthe positive examplesin ta
raining set,divided btyhe
totalnumber of positive examplesinthe training sWe et.callthi
measure s the
for obvious reasons. In the mostregular limit, saingle rule woul
dcover aninfinite
SWI ratio
∞ At =0the . other extreme, in
number ofdocuments,andthusthe SWIratio wouldb1e/ caompletelyirregular setofdocuments,SWIwouldhave tolearn se a
parate rule for each
positive example (i.e.there wouldbneogeneralization possible), sofo
rN positive
examples, SWIwouldneed
N rules, for anSWIratioof
mustcover atleast one positive example,the number ofSWIrules le
N/N Since =1. eachSWIrule arned for gaiven
documentsetwillalwaysbe lessthan oequal r to the total number
ofpositive examples,
and thusthe SWIratio willalways be between 0and1with , lowe a
value r indicating a
more regular extraction task. The SWIratioisasensible and well-motivatedmeasure otask f severalreasons. First,itis raelativemeasure withres
regularity for pecttothe number oftraining
examples, soone canuse ito tcompare the regularity otasks f wi
th smallandlarge
trainingcollectionsof documents. Second,itissimple andobjective, straightforwardalgorithm,withnofree parametersto setother t extending boundary-detector rules, which ifixed s inallour tests. F
15
because SWIisa han the lookahead for inally,the SWIratio
1.0
F1
0.8 0.6
Greedy-SWI BWI
0.4 0.2 0.0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
SWI-Ratio
Figure 3: F1performancefor BWIandGreedySWIonthe 15differentextractiontasks, plottedversus the SWIratioothe f task,withseparations betweenhighlystructured, partiallystructured,andnaturaltext.
isefficienttocompute,andiispractical t to run SWIbefore
or while runningany other
technique. 5.2 Performance and task regularity Inaddition tothe clear differences inperformance between the thr taskspresented inthe previous sections, there ialso s wide var
ee classesofextraction iation withineachclassof
tasks. We mustbe carefulnotto forgetthateach value iF n igur
2is eanaverage over
severalextractiontasks. Usingthe SWIratio wceancompare
algorithmson per-task a
basis. This revealsingreater detail bothhow performance is
relatedttoaskdifficulty,
and whether BWIreliably outperformsSWI. InFigure 3F1 , valuesfor Greedy-SWIand BWIare plotted for e itsSWIratio. Asnoted inSection 4BWI , performsbetter tha
ach taskversus nGreedy-SWIonallbut
twotasks,andthe Fd1ifference otnhose tasks is small. Byplotti
ngthe twoalgorithms’
performance versus the SWIratio, we canexamine how the two algor the regularityothe f tasksdecreases. There icasonsistent
ithmssucceedas decline ipnerformance athe s
tasksdecrease irnegularity, for bothalgorithms. Also, there i
no sclear divergence in
performance betweenthe twoalgorithmsasthe regularity othe f although BWIappears toreliably outperform SWI,bothalgorithmsare st regularityothe f tasktboseolved.
16
task decreases. So, illlimitedbythe
6What rule-based IEmethodsare learning Having investigated BWIasanimportantnew IE technique,andhavingunders
toodits
advantages over variantmethods, we now turntorule-basedIE methodsin
generaland
their performance. Specifically, we investigate the relevantin
formationthat these
methodsdo anddonot appear to bexploiting during training. Asshown inSection above 3 andinumerousother papers, rule-based system perform quite wellon fairly structured tasks. Ifthe target f
s
ield tobextracted is
canonically preceded ofollowed r bgiven ay setoftokens,or by tokensof
daistincttype,
represented btyhe wildcards available, boundarydetectors readily le
arn this context.
Thisis caommon occurrence inhighlystructuredandpartially struc
tureddocuments,
where fieldsare often preceded byidentifyinglabels(e.g. “Speake
r:Dr.X”),or followed
by identifiable piecesofinformation (e.g.inthe Zagatsurvey, rae
staurantaddress is
almostalwaysfollowed biytstelephone number,which ieasily s dis
tinguishedfrom the
rest ofthe text). Surprisingly,there igasood deal ofthistype oregularity f tobexpl
oitedeven in
more naturaltexts. For example,whenextracting the locationswhere
proteinsare found,
“locatedin” and “localizesto” are commonprefixes, and are learne
edarlybyBWI. In
general, human authorsoftentypicallyprovide given a type oinformat f
iononly ian
certain context, and specific lead-in wordsor following words are u
sedrepeatedly.
While rule-based IE methods are primarily designedtoidentifycontex
tsoutside
target fields, BWIand other methodsdoalso learnaboutsome othe f
regularities that
occur inside the fields being extracted. In the case oBWI, f bounda into the edge othe f targetfield awell s as into the local
rydetectors extend context, so the fore detectorscan
learnwhatthe first few tokens of taarget field looklike, if
the fieldtendsto have a
regular beginning,and the aftdetectorscan learnwhat the last few
tokenslooklike.
With shortfields, individualboundary detectorsoften memorize instance target field. Thishappensin the MEDLINE tasks,where speci names, etc.getmemorizedwhen their contextisnototherwise helpf longer fields,there istill s importantinformationlearned. For citations,the abstracttextoftenstartswithphrases like
of sthe
fic gene names,protein ul. However,with example,inthe MEDLINE “OBJECTIVE:”,or “We
investigated…”
17
Inaddition tolearning the typical starting andendingtokensof taarge
field, t BWI
learns parobability distributionfor the number oftokensin the fie
ld. Specifically, the
fieldlength irecorded s for eachtraining example, andthis histogram
isnormalized intoa
probability distribution once trainingicomplete. s InBWI, lengthithe s
onlypiece of
informationthattiesfore detectors andaftdetectorstogethe
Length r. informationcanbe
extremely useful. WhenBWIruns on new a testdocument,itfirstnot
eseveryfore and
aftdetector thatfire,and thenpairs them up tofind specific toke
snequences. BWIonly
keeps m a atchthatisconsistentwith the length distribution.
Specifically,itdrops
matcheswhose fore and aftboundaries are farther apartor c
loser together thanseen
during training,and given two overlappingmatchesofequaldetector conf
idence, it
prefers the match whose lengthhas been seen more times. Inallbutthemostregular IE tasks, saingle extraction r
ule iinsufficient s tocover
allthe positive examples,soiisnecessary t tloearn set a
ofrulesthat capture different
regularities. Animportantpiece oinformation f tocapture,then,
isthe relative
prominence othe f differentpattern types, to distinguishthe general rule
from s the
exceptional cases. In the case oBWI, f thisinformation ilea s
rned explicitlybayssigning
eachboundary detector caonfidence value,which iproportional s to the
totalweightof
positive examplescovered bythe rule duringtraining,and soreflects t
he generalityothe f
rule. InSWI,rules thatcover the mostexamplesare learnedfi
rst,so there irasankingof
rules.
7Limitationsof BWI: what rule-based IEmethodsdo
notlearn
BWIand similar methodsachieve highperformance onmany documentcoll
ections, but
there isubstantial s room for improvement. There are usefulpi
ecesof information that
mostIEmethodscurrentlyignore, thatwould increase the apparent re
gularity othe f
extractiontask were theyexploited. There are alsoimportant fac
tsandcluesaboutmany
extractiontasks thatcurrent methodscannotrepresent,and theref
ore cannotlearn.
Finally, there are issueswith the speed andefficiency ocurrent f
methods,aswellashow
matchesare scoredand presented, thatlimitthe usefulness ofthes investigate eachothese f areas inmore detailin this section
seystems. We .
18
7.1 Representationalexpressiveness BWIlearns setsof fore andaftboundarydetectorsand histogr a
am ofthe lengths ofthe
fields. The boundary detectorsare shortsequencesofspecific wo directly precede ofollow r the targetfield othat r are foundwi
rds or wildcardsthat thin the target field. Asa
consequence othis f representation, there are valuable typesof r captured, including detailed a modelofthe fieldtobextracted, fieldwithin documents,and structuralinformationsuchathat s expos
egularity thatcannotbe the globallocation othe f edbgrammatical ya
parse oan rHTML/XML documenttree. 7.1.1 Limited modelof the contentbeingextracted Clearly ainmportantclue infindingandextracting particular a fac
is recognizing t what
relevant fragmentsoftextlook like. Asmentionedinthe previous s
ection, BWIcan
learnthe canonicalbeginningsand endingsof taarget field,aswel
as ldaistributionover
the length othe f field. However, boththese piecesofinformat
ionare learnedian
superficial manner,and much othe f regularityothe f fieldiignored s Currently, BWInormalizesits frequencycountsoffield lengthsin data withoutanygeneralization. Thismeansthatanycandidate fie whose length inot s precisely equal to thatofat leastone targe
entirely. the training ld tboextracted field t itnhe training data
willbe dismissed,no matter how compelling the boundary detection. This problemsfor very short fields, whenthere are only few a possible
does notcause lengths, butit isa
major problem for fieldswith wide a variance olengths. f For example,inthe MEDLINE citationstask, the abstract texts have mean a lengtho230 f tokens, with standard a deviation o264 f tokens. 462positive examplesofabstracts, in 630 citations, many reasonable never seen intrainingdata. When runningour experimentson this docume we made BWIignore alllength information. Ignoring lengthinformation
tobextracted Withonly abstractlengthsare ntcollection, resultedinF1
increasing from 0.1 to0.97. The lengthproblem couldclearlybreemediedbys the distribution somehow, for example bybinning the observedlengthsprior
moothing to
normalizing. Inaddition tothe lengthdistribution,the only other informationBWIl the content ofthe target field comesfrom boundarydetectors thatoverl Often, more informationaboutthe canonical form ofthe target fiel
19
earnsabout apintothe field. dcouldbexploited.
There has been great a dealofworkonmodeling fragmentsoftext t larger documents, mostunder the rubric
iodentifythem in
NamedEntity Extraction For .example, Baluja
(1999)reports impressive performance finding proper namesin free
text,using only
modelsofthe namesthemselves,basedoncapitalization, commonpref
ixesand suffixes,
and other word-level features. NYMBLE isanother notable effort in
this field (Bikel,
1997). Combining these field-findingmethodswiththe BWIcontext-finding m
ethods
should yieldsuperior performance. 7.1.2 Limitedexpressivenessof boundary detectors Boundary detectorsare designedtcoapture the flat,nearby contextofa
fieldtobe
extracted, bylearningshort sequencesofsurroundingtokens. Theyare
goodacapturing t
regular,sequentialinformationaround the field tobextracted,res
ultinginhigh
precision. However,currentboundarydetectorsare not effective in
partiallystructured
and natural texts, where regularities incontext are lessconsi
stentand reliable. Inthese
domainsmany detectorsonlycover one ofarew examples, andcollec
tivelythe detectors
have low recall. A second limitationoexisting f boundarydetectorsis that they cannot the grammaticalstructure osentences, f or moststructural
represent
information presentin HTML
or XML documents. Boundary detectorscan onlycapture informationabout thatappear directlybefore oafter r taarget fielditnhe line
the tokens ar sequence otokens f ina
document. BWIand similar methodscannot represent or learnanyinf
ormation aboutthe
parentnodes, siblings, or childpositionitnhe grammar,or XML or HT
ML tree, to which
target fields implicitly belong. In addition, the boundary detector repr allow BWIto take advantage opart-of-speech f information,andother si
esentation doesnot milar features
thathave beenshowntcoontainimportant sourcesofregularity. While contextinformationioften s the mostrelevantfor recognizing a globalinformation aboutthe relative position othe f field withinthe alsobiemportant. For example, despite the impressive performance abstractdetectiontask, the mostobviousclue tfoinding aanbstracti citationis, “look for baigblock otext f inthe middle othe f citati
field, entire documentcan ofBWIonthe MEDLINE an on.” There ino swayto
capture this globallocationknowledge usingthe existing BWIrepresenta Additional knowledge thatBWIis unable tloearnouse r includes “the
20
tions. field iin sthe
second sentence othe f secondparagraphiit is fanywhere,” “the fie
ld inever s more than
one sentence long,” “the field has noline breaksinside it,” ands
on.
7.2 Scoringocandidate f matches EachmatchthatBWIreturnswhen applied ttest ao documenthas
an associated
confidence score,whichithe s productof the confidences ofthe fore
and aftboundary
detectors,andthe learnedprobabilityothe f fieldlength othe f match.
These scoresare
usefulfor identifyingthe mostimportant detectors,and also for shar
eholding tobtaina
precision/recalltradeoff. While the score ocafandidate match icorrelated s with the
chance thatthe
proposedmatchicorrect, s these scores are not asusefulas
actual probabilitieswouldbe.
First, the confidence scores are unbounded positive real numbers,so i
is difficult t to
compare the scoresof matchesin different documentcollections,
where the typicalrange
ofscoresis different. For the samereason, it is hardtsoe
absolute t confidence thresholds
for acceptingproposed matches. Second, it ishardtocompare the confidence ofull f matchesandpart
ialmatches,
because the componentsof the confidence score have dramaticallydif
ferent ranges. This
obscures the “decision boundary” othe f learnedsystem, which wouldother importantentity for analysisto improve performance. Finally,witho itis difficultto use matchscores in larger a framework,
wise baen uttrue probabilities,
suchaBayesian s evidence
combination odecision-theoretic r choice oaction. f The basic proble confidence scoresare usefulfor statingthe relative merit
m isthatwhile oftwomatchesinthe same
documentcollection, theyare notusefulfor comparingmatchesa
cross collections, nor
are they usefulfor providinganabsolute measure omerit. f Another problem withcurrentconfidence scoresisthatno distinct between fragment a oftextthat has appeared numeroustimesinthe
ionimade s training set asa
negative example, faragmentunlike anyseen before,and fragment a wher boundarydetector has confidently fired butthe other has not. These score zero, sothere ino sway totell “aconfidentrejection” f
oene allhave confidence rom “anear miss.”
BWIalsoimplicitly assumesthatfieldsofjust one type are documentcollection. Mostdocumentscontain multiple piecesof rel be extracted,andthe positionoone f field ioften s indicative of
extracted from gaiven ated information to the position oanother. f
21
However, BWIcanonlyextract fieldsone ataime, t andignoresthe
relative position of
different fields. For example,inthe speaker announcementsdomain,eve fieldsare extracted from each document,the field extractorsa
tnhough four re alltrained andtested
independently,and the presence oone f fieldinot s usedtporedict the loca Itisdifficulttoextend methodslike BWIto extract relations
tionoanother. f hipsintext, taaskthatis
oftenaimportant s asextractingindividual fields (Muslea, 1999). Fu inconvenientfor real-world settings,where one should baeble tola
rthermore,itis belandtrainoan
single documentat taime,with multiple fieldslabeled ienach
document, rather than
goingthrough every documentin collection a multiple times. 7.3 Efficiency Inaddition toprecision andrecall, the speedandefficiency otrai f
ning andtestingialso s
cariticalissue inmakingIE systemspractical for solving r
eal-worldproblems.
7.3.1 BWIisslow to train andtest Evenomodern na workstation, trainingBWIon single a fieldwith few a
hundred
documentscan take severalhours, andtesting cantake severalminut
es. The slownessof
trainingand testingmakesusing larger documentcollections, or lar
ger lookaheadfor
detectors,prohibitive,eventhough bothmaybneecessary toachieve highper on complextasks. Slownessalso inhibitsreliable comparisonodif f
formance ferent IE methods,
and makesBWIunsuitable athe s engine oan finteractive system
based , ointerative
human labeling during active learning. There are number a offactorsthat make BWIslower than maybe
necessary. The
innermostloopotraining f BWIis findingextensionsto boundarydetector in brute-force a manner,with specified a lookaheadparameter,
This s. isdone L, andrepeated untilno
better rule canbfeound. Finding boundary a detector extension iexponent s because every combinationotokens f andwildcards isenumerated ands modestlookaheadvaluesare prohibitivelyexpensive. A value of
cored, so even L=3 inormally s usedto
achieve balance a betweenefficiencyandperformance. Freitag isusually sufficient, for some tasks vaalue utpo
ialin L,
etal note . that while L=8 irequired s toachieve the best
results. Although testing iconsiderably s faster thentraining, it too iineff s technique for improvingtesting speedito scompressthe setofbounda 22
icient. One rydetectors
L=3
learned duringtraining, before applyingthem to set a oftestdocuments
For .example,
one could eliminate redundancy bcyombiningduplicate detectors,or elimina detectorsthatare logicallysubsumed bset ya ofother detectors
ting This . is paractice used 2 for partially structured
by Cohen(1999)in SLIPPER,and byCiravegna (2001)in(LP) tasks,but it isdoubtful how useful redundancy eliminationwouldbfeor
less regular
documentcollections, whichare the ones for whichmanydetectors m
ustbe learned.
There are alsomore sophisticated methodsto compress saetof
rules, manyowhich f are
somewhatlossy(Margineantu,1997;Yarowsky,1994). For these methods, the t
radeoff
between speed andaccuracy wouldneedtobm e easured. 7.3.2 BWIisnon-incremental After BWIhasbeen trained ogiven na set oflabeleddocuments,if
more documentsare
labeledand added, there icurrently s nowaytaovoidtrainingonthe entir
seet from
scratch again. In other words,there ino sway tolearn extra inf
ormationfrom faew
additionaldocuments,eventhough the majorityowhat f willbe learnedin willbe identicalto whatwaslearnedbefore. Thisismore
the secondrun paroblem with boostingthan
withBWIspecifically, because the reweightingthatoccursat currentperformance onalltraining examples, soadding even few a ext mean thattraining takes vaerydifferentdirection. Nevertheless be able toadddocumentsto collection a athey s become labeled, and not
each rounddependsonthe ra documentscan it is clearly , desirable to toneed to
retrain each time. 7.3.3 BWIcannotdetermine when tsotopboosting While the abilityoBWI f toboost for faixed number ofrounds instead soon aall spositive examplesare covered iclearly s one oits fad
ofterminatingas vantages,it is hard to
know how manyroundstouse. Dependingotnhe difficultyothe f task, gaiven ofroundsmaybiensufficientto learn allthe cases, or itmay
number lead toverfitting, or to
redundancy. Ideally, BWIwouldcontinue boosting for as long auseful, s and
then stop
whencontinuingtbooostwould domore harm thangood. Cohenand Singer address the problem ofwhen tostopboostingdirectly in design oSLIPPER f by using internalfive-fold validationoheld-out na trainingset (Cohen, 1999). For each fold, theyfirst boost uptspec oa number ofrounds, testing onthe held-out data after eachround,and thenthey se
23
the partofthe ifiedmaximum lectthe
number ofrounds thatleadsto the lowestaverage error on the held-out
set, finally
trainingfor thatnumber ofroundsonthe fulltraining sThis et. pr
ocessisreasonable, and
sensitive ttohe difficulty othe f task,butit has severaldrawbac
ks. First, itmakesthe
already slow trainingprocess six timesslower, when aleas t
part t ofthemotivation for
findinganoptimalnumber ofroundsisto accelerate training.
Second,the free parameter
ofthe number of boosting rounds isreplacedbaynother free paramet
er,the maximum
number ofrounds to try during cross-validation. Settingthisvalue
too low willmeanthat
the bestnumber ofroundsisnever found, while settingitoo t high willm
eanwasted
effort. Itmaybpeossible tuose the BWIdetector confidencestotell boost is futile oharmful. r Since detector scorestend todecre
when continuingto ase athe s number of
boostingroundsincreases, itmaybpeossible tosetanabsolute orel r below whichtostopboosting. A relative threshold wouldmeasure when t flattenedout,andnew detectorsare only pickingupexceptionalcases one
ative threshold he curve has at taime. For
example,Ciravegna (2001)prunesrulesthatcover less than speci a
fiednumber of
examples, because they are unreliable,and too likelytporopose spu
riousmatches.
8Extendingrule-based IEmethods Inthis section, we presentseveral suggestions for advancingresearc methodsin general. These suggestionsare applicable tB o WIand toal algorithmsfor IE. We concentrate oindentifyingnew sources ofinf constructingnew representations for handling them,andproducing more mea output. In some cases, we present preliminary results that corrobor
ohnrule-basedIE ternative learning ormationtoconsider, ningful ate our hypotheses. In
other cases, we cite existing work byother researchersthatw
beelieve represent stepsin
the rightdirection. Our hope ithat s this sectionwillprovide a
useful perspective on
existing research,aswellasinspirationfor novelprojects. 8.1 Exploiting the grammatical structure osentences f Section7.1.2discusses the importance ousing f grammaticalstruc or hierarchicalstructure inHTML and XML documents. Ray andCr takenanimportant first stepinthis direction bpyreprocessingnatu parser,and thenflatteningthe parser outputby dividing sentences into
24
ture inatural language aven(2001) have raltextwith shallow a typed phrase
segments. The textisthenmarkedupwiththisgrammaticalinform
ation,which iused s
aspartof the informationextractionprocess. RayandCravenuse hi
ddenMarkovmodels,
buttheir technique igenerally s applicable. For example,usingXML ta
gsto represent
these phrase segments,theyconstructsentencesfrom MEDLINE a
rticles such as:
Uba2p is locatedlargely in
the nucleus .
While the parsesproduced are not perfect,and the flatteningoft
enfurther distorts
things,this procedure ifairly s consistentinlabelingnoun,verb,and
prepositional
phrases. Aninformation extractionsystem canthenlearn boundaryde
tectorsthatinclude
these tags, allowing it, for example,torepresentthe constraintt
hatproteins tend tboe
foundinounphrases, andnotin verb phrases. Ray andCraven reportthat
their results
“suggestthatthere ivalue s irnepresenting grammaticalstruc
ture itnhe HMM
architectures,but the Phrase Model[withtypedphrase segments]is
notdefinitively more
accurate.” Inaddition tothe resultspresented inSection3which , use the docum
ent
collections ofRay and Cravenwithoutgrammaticalinformation,we a
lsoobtainedresults
using phrase segmentinformation. We used BWItoextractthe four i
ndividualfields in
the two relations thatRay andCravenstudy (proteins andtheir loca their associated diseases). We ran identical testswithand
lizations, genesand withoutXML tagsas shown
above. Includingthe tagsuniformlyand considerably improvesboth precisiona for allfour extraction tasks (Figure 4In f).act,allfour tasks
nd recall
see double-digitpercentage
increases in precision, recall, andF1, withaverage increasesof
21%, 65%,and 46%
respectively. The factthatprecision improveswhenusing the phrase segmenttagsm BWIis able touse thisinformation torejectpossible fields
eansthat thatitwould otherwise return.
The factthatrecallalso improvessuggeststhat having segment
tagshelpsBWIto find
fieldsthat it wouldotherwise miss. The combination othese f r
esultsissomewhat
surprising. For example,while not being noun a phrase maybheighlycorr
25
elatedwith
Usingtypedphrasesegmenttagsuniformlyimproves performanceonnaturaltextMEDLINEextractiontask
BWI s
1.0
Averageperformanceon4datasets
0.8
0.6 notags tags 0.4
0.2
0.0 Precision
Figure 4:
Recall
F1
Performance oBWI f averagedacross thefour naturaltextext the use otyped f phrasesegments. Means areshownwithstanda
ractiontasks,withandwithout rderror bars.
notbeing protein, a the inverse inot s necessarilythe case, since
there are manynon-
protein nounphrases in MEDLINE articles. We hypothesize thatincludingtypedphrase segmentinformation actually regularizes the extractiontask, enabling rulesto gaingreater pos
itive coverage without
increasingtheir negative coverage. Using the SWIratiodescribed i
S n ection5we , can
testthis hypothesisquantitatively. Figure 5showsthe SWIratios
for allfour extraction
taskswith andwithout phrase segmenttags. Asexpected, there a percentage decreasesinthe SWIratiofor all four tasks, wit
re double-digit hanaverage reductionof
21%. Recallthat laower SWIratio indicates m a ore regular
domain, because imeans t
thatthe same number ofpositive examplescan bpeerfectly covered
with fewer rules.
We conclude thatincludinggrammaticalinformation,evenwithexist
ingrule-basedIE
methods,canbconsiderable ae advantage for both recallandprecisi
on, andiworth s
investigatingim n ore detail. The success ofincludingeven limitedgrammaticalinformation imme raisesthe question owhat f additionalgrammaticalinformationc helpful itmightbe. Itisplausible thatregularitiesinfiel
diately an buesed, and how
cdontext such aargument s
positioniverb an phrase, subject/objectdistinction,and socnanbveal 26
uable. For
Usingtypedphrasesegmenttagsuniformlyincreases regularityonatural f textMEDLINEextractiontasks
the
0.7 0.6
SWI-Ratio
0.5 0.4
notags tags
0.3 0.2 0.1 0 protein
location
gene
disease
Dataset
Figure 5:
Average SWIratio for the four naturalextractiontasks,withand phrasesegments. Meansare shown withstandarderror bars.
withoutthe use otyped f
example,Charniak (2001)hasshownthatprobabilisticallyparsing sente
nces isgreatly
aidedbyconditioning oinnformation about the linguistic head othe f current ifthe headiseveral s tokensawayinthe flatrepresentation.
phrase, even Charniak’s finding is
evidence thatlinguistic headinformationian simportantsource orf
egularity, which is
exactlywhatrule-based IE methodsare designedtoexploit. Unfortunately,the existing formulation oboundary f detectorsis not well equippedtroepresent or handle grammaticalinformation. While inse
rtingXML tagsto
indicate typed phrase segmentsis useful, this approach iunprincipled, s a
it loses s the
distinction between meta-level informationand the text itself. T representationused bbyoundarydetectors should bextendedtoexpressregul implicithigher-levelinformationawell s asin explicittokeni When using typed phrase segments,we can increase performance without simple,flat representationcurrently used. However,this approachc structure explicit, so ican t onlybteaken sofar.
27
he knowledge arities in nformationidocument. na changingthe annotmake recursive
8.2 Handling XMLor HTMLstructure andinformation Justas the grammaticalstructure osafentence presents oppor structuralregularities, the hierarchicalstructure and attri
tunities for exploiting bute information available inan
XML or HTML documentalso containimportantclues for locatingtarge
fields. t
However, thisinformationiessentially s lost whenanXML document
is parsedaas
linear sequence otokens f instead ousing f document a objectmodel(D
OM). For
example,tagslike
or
that shouldbtereated asingle s tokensare instead
broken into pieces. More problematic,however,isthe fact thatmanyt namespace references, and attribute-value pairsinside the star
agscontain tingtag, suchas
When . using flattokenization, lookaheadbecomesa
serious problem,andthere ino swaytointelligently generalize over match tag a with the same name, torequire specific attribute
suchtags, e.g. to and/or s values,etc.
Muslea etal (1999) . made some importantcontributions tosolving thisproblem withSTALKER,whichconstructs an“embeddedcatalog” for webpages
(a conceptual
hierarchyocontent f in document) a thatallowsittotake advantage
of globallandmarks
for finding andextractingtext fields. Theyaccomplish thisbyuse o
the f “Skip-To”
operator, whichmatches baoundary byignoringalltextuntil gaiven s
equence otokens f
and/or wildcards. While the token sequenceslearned with Skip-To ope
ratorsare similar
to the boundarydetectors used iB n WI, they canbceombinedsequentiallyto extractionrule that relieson severaldistincttextfragments,
form an whichcanbseeparated by
arbitrary amountsofintermediate text. Rulesin STALKERare learned bsytartingwiththe mostsimple S matches gaivenpositive example andgreedily addingtokensand new Ski asnecessarytoeliminate covering anynegative examples. Thisappr exponential problem ofthe exhaustive enumerationoboundary f detectorsused The tradeoff, however, isthatthe resultingboundarydetectorsonly work
kip-To rule that p-To operators oach avoidsthe by BWI. for pageswith a
highly regular andconsistent structure,and the token sequenceslearned
tendtbosehort
and specific. Furthermore,using Skip-Toonly approximatesthe inform
ationavailable in
tarue DOMrepresentation. A second approach teoxploitingthe structuralinformationembeddedinw pagesand XML documentsis due toYih (1997), who useshierarchical
28
eb document
templatesfor IE in webpages, and finds fieldsbylearning thei
position r within the
template’snode tree. The results presentedare impressive,but
currentlythe document
templatesmustbe constructedmanually from webpages ofinterest, be hierarchies in the templatesare more subjective thanjust
cause the the HTML parse. Constructing
templatesmaybleess of paroblem with XML documents,but onlyitf
heir tag-based
structure correspondsinsome relevantwayttohe content structu
re thatisnecessaryto
exploit for information extraction. Similar resultsare presente
bdyLiuandcolleagues
withtheir “XWRAP”, raule-basedlearner for webpage informa
tionextractionthatuses
heuristicslike font size,blockpositioning oHTML f elements,and s on documenthierarchyandextract specific nodes (Liu
toconstructa etal 2000). .,
8.3 Extending the expressivenessof boundary detectors One quick solution tothe problem ofincorporatingmore grammaticaland
structural
informationinto the existing rule-based IE frameworkithe s develop
mentof m a ore
sophisticated setofwildcards. Currently,wildcardsonly repre classes, such a“all s caps”, “beginswith lower-case a let
sent word-level syntactic ter”, “containsonly digits”,and
so oWhile n. these are usefulgeneralizations over matchingindividual
tokens,theyare
alsoextremely broad. A potential middle groundconsistsof wildcards
that matchwords
of gaivenlinguistic partof speech (e.g. “noun”), gaivensemantic cla
ss (e.g.
“location/place”),or gaivenlexicalfeature (e.g.specific pref
ix/suffix,word frequency
threshold,etc.). Inprinciple,such wildcards canbueseditnhe existingBWIframewor cancapture regularities thatare currentlybeing learnedone ins withmanyothese f new types of wildcards,constructing either pare allowable wordsor saimple online test for inclusion inot s feas
k, and they tance ataime. t However, defined set of ible, andinsteadwneeed
either preprocessingusing existing NLPsystems,asdone bR y ayand grammaticalclassifiers thatcanbaeppliedaneeded. s Clear
Craven,or ly efficiencycanbaen
importantissue,particularlyduring trainingwhengeneralizations
mustbe repeatedly
considered. Encouragingresultsin thisdirectionare already available. F
or example,
2
Ciravegna’s (LP) system useswordmorphologyandpart-of-speech information to generalize the rulesit initially learns(Ciravegna,2001), paroce
29
ss similar tousing lexical
and part-of-speech wildcards. Ciravegna’sresultsare comparabl
tetohose with other 2 also does rule correction andrule
state othe f artmethods,including BWI. While (LP) pruning,Ciravegna saysthat “the use oNLP f for generalization” is
the mostresponsible
for the performance othe f system. Another clever use ogenerali f
zationscan bfeoundin
RAPIER(Califf etal 1999), ., which usesWordNet(Miller,1995)to findhypernyms(a semantic class ofwordstowhichthe target wordbelongs), then use
these s initslearned
extractionrules. Yarowsky (1994)reportsthatincluding semantic wor
cdlasses like
“weekday” and “month” tocover setsoftokensimproves performance f
or lexical
disambiguation, suggesting that there are indeeduseful semantic regular
itiesto be
exploited for IE. HTML and XML wildcards thatallow generalizations suchaany s “fo formattingtag” oany r “table cell with specified a backgroundcolo
ntr” couldalso be
employed. These generalizations shouldnot onlyhelpincrease recall wi
thout sacrificing
precision, butalso help solve the commonproblem thatsmallchangest
web oa page
“break” learned IE detectors,and the system hasto breetrained.
For example, if w a eb
site changesthe fontcolor of paiece oinformation, f but the boundaryde
tector has
learned general a rule thatjustlooks for naon-standard fontcolor
itmay , wellcontinue to
functioncorrectly, while more a specific rule would not. 8.4 Converting BWImatchscoresintoprobabilities While improvingthe accuracy oIE fmethodsisanimportantgoal,progres
will s be
limited initsusefulness by the outputrepresentationused, sothi
is sanother criticaltarget
for further research. Asmentioned inSection 7.2,BWIandother methods probabilitiestothe matchesthey return. Probabilityithe s
do notattach linguafranca for combining
informationprocessing systems,because probabilitieshave both absolut meaning, andbecause there are powerfulmathematicalframeworks
aendrelative for dealing with
them,such aBayesian s evidence combination anddecisiontheory. Thus,it is worth askingthe question, canBWIconfidence scoresbe t into probabilities? Thisquestion really comesintwopieces. Fi
ransformed rst,are BWIconfidence
scoresmeaningfuland consistent? Thatis,doesBWIproduce incorrec highconfidence,or doesitsaccuracyincrease withitsconfidenc
30
matches t with Second, e? can BWI
1.0 Precisionimatches f below givenconfidencethreshold weredropped
Performance
0.8
Recallifmatchesbelow givenconfidencethreshold weredropped
0.6
0.4
F1imatches f belowgiven confidencethresholdwere dropped
0.2 Empiricalcalibrated probabilityagiven t confidencethreshold
0.0 0
10
20
30
40
50
60
70
80
90 100
Confidencethreshold
Figure 6: Precision,recall,and F1for BWIonthe AbstractTexttask,versus confidence threshold belowwhichtoignorematches. The empiricalprobabilityomatch f c orrectness within eachconfidence intervalisalso shown.
learnthe correlationbetweenconfidence scoresandthe chance of
correctnessof m a atch,
and thusautomaticallycalibrate its scoresinto probabilities? The firstquestion, whether BWIscores are consistent, can be
answered
empirically, bytraining andtesting BWIondifferent documentcoll
ections. Ideally,most
incorrectpredictionshave low confidence comparedtcoorrectpredi
ctions. Sucha
distribution idesirable s for tworeasons. First, itresemble
tasrue probability distribution,
where higher a score icorrelated s with higher a chance obe f
ingcorrect. Second,it
allows usersto set caonfidence threshold above whichtaocceptpre
dictions, and below
whichtoexamine predictions further. Our results suggestthat BWIscoresare fairly consistent anda threshold setting for taskswithhighregularity, butonhardtasksit correctand incorrectpredictionsbasedonly om n atchconfidences. Fo
menable to is difficultto separate the r
AbstractText
task, confidence scores range from 0t1o00, butallincorrectpredi
ctionshave confidence
scoresbelow 20(Figure 6This ). meansthatifBWIignorespredi
ctionsbelow a
confidence threshold o20, f itobtainsperfect precision. However, this in only59%recall,comparedwith95% recallwiththreshold A c0. of10 maximizesF1, because mostincorrectanswerscan bperuned wit correctguesses,ascan bseeeniF n igure The 6. confidence scores
31
threshold results onfidence threshold houteliminating are behaving roughly
1.0 Precisionimatches f below givenconfidencethreshold weredropped
Performance
0.8
Recallifmatchesbelow givenconfidencethreshold weredropped
0.6
0.4
F1imatches f belowgiven confidencethresholdwere dropped
0.2
Empiricalcalibrated probabilityagiven t confidencethreshold
0.0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
Confidencethreshold
Figure 7: Precision,recall,and F1for BWIonthe YPD-protein task, versus confidence threshold belowwhichtoignorematches. The empiricalprobabilityomatch f c orrectness within eachconfidence intervalisalso shown.
asprobabilities should, because aconfidence s increases, sodoes the guesses. Note thatthe precision for thistask ilower s than repor
fraction ocorrect f tediT n able 1because
here wceonsider allBWImatches,whereasnormallyBWIeli
minatesoverlapping
matches,keeping onlythe matchwith higher confidence. Unfortunately,the correlationbetween confidence andprobability ocor f isweaker for more difficulttasks. Onthe proteintask,the mos
rectness
difficult t asmeasured
bothbyperformance andbS y WIratio, Figure 7showsthatthere ino scle separate the correctand incorrectguesses. The highestconfide score 0.65, butabove thisthreshold,BWIonlyachieves 0.8% recall, because
anwayto nce negative matchhas almostall
correctmatcheshave confidence score below 0.35. The highest F1itn
hiscase comes
from caonfidence threshold o0, fwith precision52%and recall24%. (
These numbers
are taken from an individualtrain/testfoldand thusdiffer slight
ly from the averages
presentedinTable 1There .) ino swaytiomprove performance with
naon-zero
confidence threshold, because there ino ssmoothtransition thighe oa
density r ocorrect f
predictions asconfidence scores increase. There idasirectway ocalibrating f scoresinto probabilities on validation a set of documentsbyconfidence score,and estimate the eachbinathe s fraction ocorrect f predictions inthe bin. Essent
—bin BWIpredictions probability for ially ifor f gaivenrange
32
ofconfidence scores, say75% of predictions are correct, then pr a
ediction otest na
documentwith similar confidence iestimated s tohave 75% a chance of
beingcorrect.
Evenfor difficult tasks for whichconfidence scores are notconsi
stent, i.e.there ino s
smooth increase ipnrobability ascores s increase,the calibrat
ed probabilitiesare still
meaningful, because theymake explicitthe uncertainty ipnrediction
s.
9Concludingremarks Current information extractionmethodsare able todiscover and exploit
regularitiesin
textthatcome from localword ordering, punctuation usage, anddistincti featureslike capitalization. This information ioften s suffic
ve word-level ientfor partiallyand highly
structureddocumentsbecause their regularity iat sthe surface
level—inthe use and
ordering owords f themselves. However,currentIE methodsperform
considerablyworse
whendealingwith naturaltextdocuments, because the regularities i
tnhese are less
apparent. Nevertheless,there
are still importantregularitiesin naturaltextdocuments, at
the grammaticaland semantic levels. We have shown thatmaking expl grammaticalinformation resultsin considerablyhigher precision andr
icitevenlimited ecallfor
informationextractionfrom naturaltextdocuments. AllIE methodsperform differentlyodnocumentcollectionsofdiffer regularity, butthere hasbeennqouantitative analysisin previouspaper relationship betweendocumentregularity and IE accuracy. We pr qauantitative measure odocument f regularity, andwuese ito tin relationship between regularity and accuracy ignreater detail Since the SWIratioobjectivelymeasuresthe relative regular itis suitable for comparison oIE ftasks ondocumentcollections of
ent of sthe opose the SWIratioas vestigate the thanpreviously possible. ity odafocumentcollection, varying size and
content. While all the IE algorithmswe studyperform worse olness re collections, theystillexhibitconsistentdifferences. Many cur including the SWImethodproposed inthispaper, employsome form ofset combinemultiple extraction rules. These methodsare relatively s theyare all fundamentallylimitedinthatthey remove positive exampl are covered once. Boosting overcomesthislimitationbydownweightingcove
33
gular document rent rule-basedIE methods, coveringto imple toimplement,but esas soon athey s red
examplesinsteadoremoving f them. We have shown thatBWIexploitst
his propertyto
learnadditionaluseful rulesevenafter allexampleshave beencove
red,and consistently
outperformssetcoveringmethods. Reweightingalsohelpsby focusing BWIon
learning
specific rulesfor the exceptional cases missedbgyeneralrules
resulting , ihnigher
precision. The combinationolearning f accurate rules, andcontinuin
gtrainingtobroaden
coverage, is the source othe f success ofBWI. While this paper focuseson rule-based IE methods,and onBWIin par
ticular,
manyoour f observationsholdfor information extraction aw as hole.
For example,while
hidden Markovmodelsemploy different a representation,theytendlike B
WItolearn
regularities thatare localandflat, concerning word usage, punctuati
onusage, andother
surface patterns. Thus the sections above aboutsources of inform
ationthatare currently
ignored are relevant to HMMs and rule-based IEmethods. There are
manyopportunities
for improvingthe usefulnessofcurrent informationextractionmethods.
We hope this
paper is caontribution inthisdirection.
Acknowledgements The authors would like ttohankDayne Freitag for hisassistance and f
or makingthe BWI
code available,MarkCraven for givinguthe s natural text MEDLINE
documentswith
annotatedphrase segments,and MedExpert International, Inc.for its f
inancial supportof
this research. Thiswork wasconducted with support from the Cali Telecommunicationsand InformationTechnology,Cal-(IT)
fornia Institute for 2
.
References Baluja, S.,Mittal,V.,&Sukthankar,R. (1999).Applyingmachine learning for hi entity extraction.In
gh performancenamed-
ProceedingsoftheConferenceothe f Pacific AssociationforComputatio
nal
Linguistics (pp. 365-378). Bikel,D.,Miller,S.,Schwartz, R.,&Weischedel,R.(1997). Nymble: Name-Finder.In
H a igh-Performance Learning
Proceedings ofthe FifthConference on
AppliedNaturalLanguage Processing
,
pp. 194-201. Califf,M.E.,&Mooney,R.J. (1999).
RelationalLearning oPattern-Match f Rulesfor Information
Extraction. In Proceedingsofthe SixteenthNationalConference onArtificialInte 99),Orlando,FL, pp.328-334.
34
lligence(AAAI-
Califf,M.(1998). RelationalLearning Techniquesfor NaturalLang
uage Information Extraction, Artificial
Intelligence1998,pg.276. Canyoucheck this reference? can’t I find
it.
Cardie,C. (2001). Rule Induction andNaturalLanguage Applications of
Rule Induction,
http://www.cs.cornell.edu/Info/People/cardie/tutorial/tutorial.html. Charniak,E. (2001).Immediate-HeadParsingfor LanguageModels.In
Proceedings ofthe 39thAnnual
Meetingothe f AssociationforComputationalLinguistics(2001). 2
Ciravegna, F.(2001).(LP)
an ,AdaptiveAlgorithmfor InformationExtraction fromWeb-rela
Proceedings ofthe 17
InternationalJointConference onArtificialIntelligence(IJCAI-2001
Clark, P.andNiblett,T.(1989). The CN2InductionAlgorithm.
).
Machine Learning 3,pg. 261-283.
Cohen, W.W.,&Singer,Y.(1999).A Simple,Fast,andEffectiveRule Le
arner. In
SixteenthNationalConference onArtificialIntelligence
Proceedings ofthe
1999. ,
Eliassi-Rad,T.,&Shavlik, J.(2001).A Theory-RefinementApproachtIonf Proceedings ofthe 18
tedTexts.In
th
ormation Extraction. In
th
InternationalConference onMachine Learning(ICML-2001).
Freitag,D.&Kushmerick,N.(2000). Boosted Wrapper Induction,
Proceedings ofthe Seventeenth
NationalConference onArtificialIntelligence
pg. , 577-583.
Kushmerick,N. (2000). Wrapper Induction: Efficiency andExpressiveness,
ArtificialIntelligence 2000,
pg. 118. Liu,L., Pu,C., &Ilan, W.(2000).XWrap:An XML-EnabledWrapper Constr InformationSources.In
uctionSystem for Web
Proceedings ofthe InternationalConferenceoD n ata Engineering,2000.
Margineantu,D.D.,&Diettrich, T.G. (1997).Pruningadaptive boosting.In
Machine Learning:
Proceedings ofthe Fourteenth InternationalConference
pp. , 211-218.
NationalLibraryoMedicine. f (2001).The MEDLINEdatabase,2001. http://
www.ncbi.nlm.gov/Pubmed/.
Michalski, S.(1980). Patternrecognitionarule-guided s inductiveinfere Analysis andMachine Intelligence
nce. IEEETransactions onPattern
2, ,349-361.
Miller,G.(1995).Wordnet:A lexicaldatabasefor English.
Communications oftheACM
Muslea, I,Minton,S. &Knoblock,C. (1999). A HierarchicalApproach tW o
38(11):39-41
rapper Induction.
Muslea, I.(1999).Extraction Patterns for InformationExtracti
on:ASurvey.In
SixteenthNationalConference onArtificialIntelligence
Proceedings ofthe
1999. ,
Quinlan,J.R.(1990).LearninglogicaldefinitionsfromRelations.Machine
Learning, 5:239-266,1990.
Ray,S. &Craven,M.(2001). RepresentingSentence Structure inHidden
MarkovModelsfor Information
Extraction.In
Proceedings ofthe 17
th
InternationalJointConference onArtificialIntelligence
(IJCAI-2001). Riloff,E.(1996).AutomaticallyGenerating ExtractionPatternsf
romUntagged Text. In
ThirteenthNationalConference oA n rtificialIntelligence(AAAIShapire,R.E.(1999). ABrief Introduction tB o oosting.In
96),pp.1044-1049.
Proceedings ofthe 16
Conference oA n rtificialIntelligence(IJCAI-1999).
35
Proceedings ofthe
th
InternationalJoint
Shinnou,H. (2001). Detection oerrors f intrainingdatabyusing decision a l
istandAdaBoost.In
2001 Workshop,"TextLearning:BeyondSupervision",
pp.61-65.
Yarowsky,D.(1994). Decision Lists for LexicalAmbiguity Resolution: inSpanish And French. In
IJCAI-
ApplicationtoAccentRestoration
Proceedings ofthe ACL'94
pp. , 77-95.
Yih, W-t. (1997).Template-basedInformationExtraction from Tree-s
tructuredHTMLDocuments.
Master’s thesis, NationalTaiwan University.
DataSet
BWI
SWI-Ratio
Prec. Recall
Fixed-BWI F1
Prec.
Recall
Root-SWI F1
Prec.
Greedy-SWI
Recall
F1
Prec.
Recall
F1
LATimes-cc
0.013
0.996
1.000 0.998
1.000
0.985 0.993
1.000
0.975 0.986
1.000
0.948 0.973
Zagat-addr
0.011
1.000
0.937 0.967
1.000
0.549 0.703
1.000
0.549 0.703
1.000
0.575 0.724
QS-date
0.056
1.000
1.000 1.000
1.000
0.744 0.847
1.000
0.744 0.847
1.000
0.783 0.875
AbstractText
0.149
0.993
0.954 0.973
0.966
0.495 0.654
0.846
0.585 0.685
0.847
0.317 0.447
SA-speaker
0.246
0.791
0.592 0.677
0.887
0.446 0.586
0.777
0.457 0.565
0.904
0.342 0.494
SA-location
0.157
0.854
0.696 0.767
0.927
0.733 0.818
0.800
0.766 0.780
0.924
0.647 0.759
SA-stime
0.098
0.996
0.996 0.996
0.991
0.949 0.969
0.975
0.952 0.964
0.979
0.842 0.902
SA-etime
0.077
0.944
0.949 0.939
0.993
0.818 0.892
0.912
0.793 0.843
0.987
0.813 0.885
Jobs-id
0.068
1.000
1.000 1.000
1.000
0.956 0.978
1.000
0.956 0.978
0.996
0.829 0.902
Jobs-company
0.246
0.884
0.701 0.782
0.955
0.733 0.824
0.794
0.838 0.784
0.904
0.751 0.802
Jobs-title
0.381
0.596
0.432 0.501
0.661
0.479 0.547
0.480
0.658 0.546
0.660
0.477 0.549
YPD-protein
0.651
0.567
0.239 0.335
0.590
0.219 0.319
0.516
0.154 0.228
0.594
0.134 0.218
YPD-location
0.478
0.738
0.446 0.555
0.775
0.418 0.542
0.633
0.246 0.347
0.774
0.240 0.365
OMIM-gene
0.534
0.655
0.368 0.470
0.826
0.480 0.606
0.469
0.249 0.324
0.646
0.199 0.304
OMIM-disease
0.493
0.707
0.428 0.532
0.741
0.411 0.528
0.487
0.251 0.327
0.785
0.241 0.369
Table1: SWI-Ratio andperformance othe f four IEmethodsexaminedonthe
cc
addr date abstract speaker location stime etime 500
500
15 extractiontasks.
id comp.
500 500 500 500
title protein
500 500
BWI
50
50 50
SWI
5.1
1.3 1.0
51.8
139.4 75.6 72.1
25.0 15.2 46.9 132.1
333.1
235.0 357.0
280.1
Root-SWI 3.6 1.3 1.0
48.4
110.7 66.8 62.0
17.9
322.5
229.7 344.6
278.8
1.0
46.3 119.6
Table2: Number ofboostingiterations usedBWI, SWI,andRoot-SWIon the
36
500
location gene disease 500
500
500
15data sets.