ABL: Alignment-Based Learning. Menno van Zaanen. School of Computer Studies. University of Leeds. LS2 9JT L(~eds. UK menno@scs, leeds, ac. uk. Abstract.
ABL: Alignment-Based Learning Menno van Zaanen School of C o m p u t e r S t u d i e s University of Leeds LS2 9 J T L(~eds UK menno@scs, l e e d s , a c . uk
Abstract This ])al)er introdu(:es a new tyl)e of grammar learning algorithm, iilst)ired l)y sl;ring edit distance (Wagner and Fis(:her, 1974). The algorithm takes a (:ortms of tlat S(~lltell(:es as input and returns a (:ortms of lat)elled, l)ra(:ket(~(1 sen-~ ten(:(~s. The method works on 1)airs of unstru(:tllr(?(l SelltellC(~,s that have one or more words in (:onunon. W]lc, ll two senten('es are (tivided i n t o parts that are the same in 1)oth s(mten(:es and parl;s tha|; are (litl'erent, this intbrmation is used to lind l)arl;s that are interchangeal)le. These t)arts are taken as t)ossil)le (:onstituents of the same tyl)e. After this alignment learning stel) , the selection learning stc l) sel(~('ts the most l)rot)at)le constituents from all 1)ossit)le (:onstituents. This m(;thod was used to t)ootstrat) structure (m the ATIS (:ortms (Mar(:us et al., 1f)93) and on the OVIS ~ (:ort)us (Bommma et ~d., 1997). While the results are en(:om:aging (we ol)|;ained Ul) to 89.25 % non-crossing l)ra(:kets precision), this 1)at)er will 1)oint out some of the shortcomings of our at)l)roa(:h and will suggest 1)ossible solul;ions. 1
Introduction
Unsupervised learning of syntactic structure is one of the hardest 1)rol)lems in NLP. Although people are adept at learning grammatical structure, it is ditficult to model this 1)recess and therefore it is hard to make a eomtmter learn strllCtllre.
We do not claim that the algorithm described here models the hmnan l)rocess of language learning. Instead, the algorithm should, given unstructured sentences, find the best structure. This means that the algorithm should assign 1Opcnbam" Vcrvoer hfformatie Systeeln (OVIS) stands for Pul)lic Transt)ort hfformation System.
961
sl;ru('ture to sentences whi(:h are similar to the ,~;tru(:ture peot)le would give to sentences, lint not ne(:essarily in the same |lille or Sl);~(;e l'estrictions. The algorithm (:onsists of two t)hases. The tirst t)hase is a constituent generator, whi(:]l generates a m()tiw~ted set of possible constituents 1)y aligning sentenc(:s. The se(:ond i)hase restri(:ts tllis set l)y selecting the best constituents from the set. The rest of this t)aper is organized as ibllows. Firstly, we will start t)y describing l)revious work in machine learning of language stru(:ture and then we will give a descrit)tion of the ABL algorithm. Next, some results of al)t)lying the ABL algorithm to different corpora will 1)e given, followed 1)y a discussion of the algorithm alia flltllre resear(;h.
2
Previous
Work
I,e;wning metl,o(ls can t)e grouped into suitorvised and unsut)ervised nmthods. Sul)ervised methods are initial|seal with structured input (i.e. stru(:ture(] sent(m(:es for grannnar learning methods), while mlsut)ervised methods learn l)y using mlstru(:tured data only. In 1)ractice, SUl)ervised methods outpertbrm mlsut)ervised methods, since they can adapt their o u t p u t based on the structured exami)les in the initial|sat|on t)hase whereas unSUl)ervised lnethods emmet. However, it is worthwhile to investigate mlsupcrvised gramlnar learning methods, since "the costs of annotation are prohibitively time and ext)ertise intensive, and the resulting corpora may 1)e too suscet)tible to restri(:tion to a particular domain, apt)lication, or genre". (Kehler and Stolcke, 1.999) There have 1)een several approaches to the unsupervised learning of syntactic structures. We will give a short overview here.
Memory based learifing (MBL) keeps track of possible contexts and assigns word types based on that information (Daelemans, 1995). Redington et al. (1998) present a method that bootstraps syntactic categories using distributional information and Magerman and Marcus (1990) describe a m e t h o d that finds constituent boundaries using m u t u a l information values of the part of speech n-grams within a sentence. Algorithms that use the minimmn description length (MDL) principle build grammars that describe the input sentences using the minimal nunfl)er of bits. This idea stems from intbrnmtion theory. Examples of these systems can be found in (Grfinwald, 1994) and (de Marcken,
Wh,at Wh,at Wh, at Wh, at
is is is is
a family fare th,e payload of an African Swallow & family fare)x (the payload of an African Swallow)x
Figure 1: Example bootstrapping structure For each sentence sl in the corpus: For every other sentence s2 in the corpus: Align s~ to s2 Find the identical and distinct parts between s~ and s2 Assign non-terminals to the constituents (i.e. distinct parts of s~ and s2)
1996).
Figure 2: Alignment learning algorithm
The system by Wolff (1982) pertbrms a heuristic search while creating and Inerging symbols directed by an evaluation function. Chen (1.995) presents a Bayesian g r a m m a r induction method, which is tbllowed by a postpass using the inside-outside algorithm (Baker, 1979; Lari and Young, 1990). Most work described here cmmot learn complex structures such as recursion, while other systems only use limited context to find constituents. However, the two phases in ABL are closely related to some previous work. Tim alignment learning phase is etlb.ctively a compression technique comparat)le to MDL or Bayesian g r a m m a r induction methods. ABL remembers all possible constituents, building a search space. The selection h;arning phase searches this space, directed by a probabilistic evaluation function.
fences as shown in figure 1. 2 The constituents a .family fare and the payload of an African Swallow both have the same syntactic type (they are both NPs), so they can be replaced by each other. This means that when the constituent in the first sentence is replaced by the constituent in the second sentence, the result is a wflid sentence in the language; it is the second sentence. The main goal of the algorithm is to establish that a family .fare and the payload of art, African Swallow are constituents and have the same type. This is done by reversing Harris's idea: 'i1" (a group o.f) words car-,, be; replaced by each other, they are constituents and h.ave th,e same type. So the algorithm now has to find groups of words that can be replaced by each other and after replacement still generate valid sentences.
The algorithm consists of two steps:
3
Algorithm 1. Alignment Leanfing
We will describe an algorithm that learns structure using a corpus of plain (mlstructured) sentences. It does not need a structured training set to initialize, all structural information is gathered from the unstructured sentences. The output of the algorithm is a labelled, bracketed version of the inlmt corpus. Although the algorithm does not generate a (context-fl'ee) grammar, it is trivial to deduce one from the structured corpus. The algorithm builds on Harris's idea (1951) that states that constituents of the same type can be replaced by each oth,er. Consider the sen-
2. Selection Learning 3.1
Alignment
Learning
The model learns by comparing all sentences in the intmt corpus to each other in pairs. An overview of the algorithm can be tbund in figure 2. Aligning sentences results in "linking" identical words in the sentences. Adjacent linked words are then grouped. This process reveals 2All sentences in the examlfles can be fbund in the ATIS corlms.
962
.f,'o,,,.
Sa,,. F,'a,.ci.,'co (to Dallas).
also be tbmM in the o t h e r sentence. (In figure 1, What is is a g r o u p like this.) T h e rest of the sentences can also b e g r o u p e d . T h e words in these grout)s arm words t h a t are distinct in the two sentences. W h e n all of these groups fl:om sentence, one would 1)e relflaced by the respective groups of sentence two, sentence two is g e n e r a t e d . (a family fare a n d th,c payload of an African Swallow art: of this t y p e of g r o u p in figure 1.) E a c h pair of these distinct groups consists of possilfle constil;uents Of the same type. :~ As can be, seen in tigure 3, it is possible t h a t e m p t y g r o u p s can lm learned.
./'rout (Dallas to)| San Francisco 02
(Sa,, l.o) Dallas 02 O, DaUas #o Sa,,. J';'a,.cisco)2 •[;1"0 ~II,
(San Francisco), to (Dallas)2
.fF()Ii't
(Dalla.gj to (Sa,,. F i g u r e 3: A m b i g u o u s a l i g n m e n t s
1;t1(; groul)S of identical words, 1)ut it also llIlC()vers the groups of distinct wor(ls in t h e sentences. In figure 1 What is is the identical p a r t of the sentences and a fam, ily J'a~v, and the payload of an A./ricau, Swallow are the d i s t i n c t l)arts. T h e distinct p a r t s are interchangeable, so t h e y are (tetermilmd to 1)e c o n s t i t u e n t s o17 the s a m e I;yl)e. We will now Cxl)lain the stel)s in the alignm e n | learning i)hase in more de, tail.
3.1.1
a.l.a
Edit Distance
q[b find the identi(:al word grouI)S in |;he sentences, we use the edit; distan(:e a l g o r i t h m by W a g n e r and Fischer (197d:), which finds the m i n i m u m nmnl)er of edit o p e r a t i o n s (insertion, (lelei;ion and sul)stii;ul;ion) l;o change one sente, nce into the other, ld(mti(:al wor(ts in the sent(races can 1)e t'(mnd at ])]a(;es W]l(~,l'e lie edit ope r a t i o n was al)plied. T h e insl;antia,tiol~ of the algoril;hm t h a t fin(is l;}le l o n g e s t
COllllllOll Slll)S(}(]ll(}ll(;( ~, ill t w o
Sell-
tences s o m e t i m e s "links" words t h a t are, too far apart, in figure 3 when: 1)esides the o(:cm'rences of.from,, the ocem:rences of San }4"au,ci.sco or Dallas are linked, this results in u n i n t e n d e d constituents. We woukt r;d;her have the lnodel linking to, r e s u l t i n g in a sl;1"u(;I;llre with the 1101111 phrases groul)ed w i t h the same t y p e corre(:tly. Linking San Francisco or Dallas results i~l c o n s t i t u e n t s t h a t vary widely in size. T h i s stems from the large d i s t a n c e b e t w e e n t h e linked urords in the tirsi; sentence mid in th(; s(:cond sentence. T h i s t y p e of alignlnent can t)e ruled out by biasing t h e cost f i m c t i o n using distances
between words. 3.1.2
Grouping
An edit d i s t a n c e a l g o r i t h m links identical words in two sentences. W h e n a d j a c e n t wor(ls are linked in l)oth sentences, t h e y can l)e g r o u p e d . A groul) like this is a p a r t of a senten(:e t h a t can
Existing
Constituents
At seine 1)oint it m a y be t)ossible t h a t the m o d e l lem'ns a co11stituent t h a t was a l r e a d y stored. T h i s m a y hal)l)en w h e n a new sentence is comp a r e d to a senlaen(;e in the p a r t i a l l y s t r u c t u r e d corpus. In this case,, no new tyl)e, is intro(hu:ed~ lint the, consti|;ucnl; in l;he new sentence gel;s l;he same t y p e of the c o n s t i t u e n t in the sentence in the p a r t i a l l y s t r u c t m : e d corpus. It m a y even t)e the case t h a t a p a r t i a l l y si;ruct u r e d sentence is c o m p a r e d to a n o t h e r p a r t i a l l y sl;rtlctllre(1 selll;elR,e. T h i s occm:s whel~ a s(:nfence t h a t (;onl;ains some sl;ructure, which was learner1 1)y COlnl)aring to a sentelme in the part;]ally s t r u c t u r e ( l (;Ol;pllS~ is (;Olllt)ar(~,(] 1;o allo t h e r (t)art;ially stru(:ture(t) sente, n(:e. W h e n the ('omparison of these two se,nl;ence, s yields a c o n s t i t u e n t thai: was ah:ea(ly t)resent in b o t h senten(:es, the tyl)es of these constitueld;S are merged. All c o n s t i t u e n t s of these types are ut)d a t e d , so the, y have the same tyl)e. B y merging tyl)es of c o n s t i t u e n t s we make t;he assuml)tion t h a t co]lstil;uents in a (:ertain cont e x t can only have one tyl)e. In section 5.2 we discuss the, imt)li(:atiolls of this assmnpl;ion and p r o p o s e an a l t e r n a t i v e at)t)roach.
3.2
Selection Learning
The first step in the algorithm may at some point generate COllstituents that overlap with other constituents, hi figure 4 Give me all flights .from Dallas to Boston receives two overlal)ping structures. One constituent is learned 3Since the alger||Inn does not know any (linguist;|c) llalIICS for the types, the alger|finn chooses natural numbers to denote different types.
963
(
Book Delta 128
) f l w n Dallas to Boston
?'Give m¢(all.fligh, ts)'f,'om Dallas to Boston) Give me (
help on classes
)
l?igure 4: Overlapping constituents by comparing against Book Delta 128 f i r m Dallas to Boston and the other (overlapl)ing) constituent is tbund by aligning with Give me help
on classes. The solution to this problem has to do with selecting the correct constituents (or at least the better constituents) out of the possible constitnents. Selecting constituents can be done in several dittbrent ways. A B L : i n c r Assume that the first constituent learned is the correct one. This means that when a new constituent overlaps with older constituents, it can 1)e ignored (i.e. they are not stored in the cortms). A B L : l e a f The model corot)rites the probability of a constituent counting the nmnber of times the particular words of the constituent have occurred in the learned text as a constituent, normalized by the total number of constituents.
Ple,f(c) = ]c' C C : yield(c') = yicld(c)l
ICI where C is the entire set: of constituents. A B L : b r a n e h In addition to the words of the sentence delimited by the constituent, the model computes the probability based on the part of the sentence delimited by the words of the constituent and its non-terminal (i.e. a normalised probability of ABL:leaf). Pb,.~na,,(clroot(c ) = r) = e c:
y/el(l(,-') -- y i e l d ( c )
Ic"
A
; "1
c : ,'oot(c") =
The first method is non-probabilistic and may be applied every time a constituent is found that overlaps with a known constituent (i.e. while learning). The two other methods are probabilistic. The model computes the probability of the constituents and then uses that probability to select constituents with the highest probability. These
964
methods are ~pplied afl;er the aligmnent learning phase, since more specific informatioil (in the form of 1)etter counts) can be found at that time. In section 4 we will ewfluate all three methods on the ATIS and OVIS corpus. 3.2.1
Viterbi
Since more than just two constituents can overlap, all possible combinations of overlapping constitueni;s should be considered when comImting the best combination of constituents, which is the product of the probabilities of the separate constituents as in SCFGs (cf. (Booth, 1969)). A Viterbi style algorithm optimization (1967) is used to etficiently select the best combination of constituents. When conll)uting the t)r()t)ability of a combination of constituents, multiplying the separate probabilities of the constituents biases towards a low nnmber of constituents. Theretbre, we comtmte the probability of a set of constituents using a normalized version, the geometric mean 4, rather than its product. (Caraballo and Charniak, 1998) 4
Results
The three different ABL algorithms m~d two 1)aseline systems have been tested on the ATIS and OVIS corpora. The ATIS corlms ti'om the P(;nn Treebank consists of 716 sentences containing 11,777 (:onstituents. The larger OVIS corpus is a Dutch corpus containing sentences on travel intbrmation. It consists of exactly 10,000 sentences. We have removed all sentences containing only one word, resulting in a corpus of 6,797 sentences and 48,562 constituents. The sentences of the corpora are stript)ed of their structures. These plain sentences are used in the learning algorithms and the resulting structure is compared to the structure of the original corpus. All ABL methods are tested ten times. Th(, ABL:incr method is applied to random orders of the input corpus. The probabilistic ABL methods select constituents at random when different combinations of constituents have the same probability. The results in table 1 show the
4The geometric mean of a set of constituents ^... A
= VFI
=, P( d
NCBP LEFT 32.6O I{IGI/T 82.70 ABL:INCll 83.24 (1.17) AI3L:LEAF 81.42 (0.11) ABL:BI/ANCII 8, .31 (0.01)
AT1S NCBI{ 76.82 92.91 87.21 (0.67) 86.27 (0.06) 89.31 (0.01)
ZCS 1.J2 38.83 18.56 (2.32) 21.63 (0.5O) 29.75 (0.00)
NCB] ) 51.23 75.85 88.71 (0.79) 85.32 (0.02) 89.2.5 (0.oo)
OVIS NCBR 73.17 86.66
ZCS 25.22 48.08
84.36 (1.]0) 79.96 (0.03) 8>o4 (0.0|))
30.87 (0.09) 42.20 (0.01)
laJ)h, 1: Results of I;he ATIS and OVIS corpora mean ;rod standard deviations (between bra(:kets). The two base, line systcnis, left and right, onty t)uiM left: mid right brnnching trees respectively. Three, metrics hnve been compnl;cd. N C B P stmlds for Non-(]rossing Bra.(:kets Precision, which denotes the percentage, of learned (:onstituents th~,t do not overlai) with any consl;it;uent;s in I;he m'igi'n, al (:orpus. NCIH~ is the Non-Crossing Brackets ll.e(:all mid shows |;he t)(;rt'ent~ge of constituents in the original c o l t)us thai; (1o not overlap with :my constituents in the learned (:oft)us. Finnlly, Z(LS' strums ti)l' Zero-(Jrossing Sentences a,nd r(',l)reseuts the perc(ml;age of sentence, s that (t(1 not have m~y overlnt)l)ing constii;uenl;s.
4.1
Evaluation
'l-'tm incr modet 1)erfi)rms (tuii:e well (:onsi(hwing the t'~mt that it; (:;mnot re(:ov(w t'roln incorre(:t (:()nstituents, with a t)re(:ision a,nd re(:~dl of ()V(~l' 8t) %. The order of the senl;en(:es how(we, r is quite iml)orbmt , since |tie sl;ml(tard deviation of the inc'r model is quite high (est)e~(:ialty with the ZCS, reaching 3.22 % on the OV!S (:orpus). We expected the prot)nl)ilistic nmtho(ts to i)erform t)o,l;ter, trot the lc((f modet performs slightly worse. The, ZCS, however, is somewhat better, re,suiting in 21.63 % on the AT1S corpus. Furthermore, d;he standard deviations of the le,:f model (&lid Of the branch, model) are c]ose to 0 %. The st;~tisti(:al methods generate more precise, results. Ttm bra'n,ch, modet d e a r l y outl)erfornl all o~,her models. Using more Sl)e(:itic statistics generate better results. Although the resull;s of the N FIS (:orpus mM OVlS corIms differ, the, conclusions that (:ml })e reached are similm:.
965
4.2
ABL Compared to Other Methods
It; is difficult to corot)are the results of the ABL model ag~dnst other lnethods, since, often d i f thrent corpora or m(',trics m:e used. The methods describe, d by Pereira and Schabcs (1.9(.)2) comes reasonably close to ours. The unsupervised nmthod le~rns structure on plain sentences from the ATIS corlms resulting in 37.35 % precision, while the "un.supcrvised ABL signili(:mltly outperforms this method, reaching 85.31% l)recision. Only their s'uperviscd version results in n slightly higher pre('ision of 90.36 %. The syste,nl th;d; simt)ly buihts right branchins structures results in 82.70 % precision mid 92.91% teeM1 on the ATIS cortms, where ABL got 8 5 . 3 1 % and 8 9 . 3 1 % . This wa,s expected, sin(:e English is a right |)rmmhing language; a left branching sysl;Clll t)(~rff)l.'ltle(| lllllCh woFsc (32.60 % pre(:ision and 76.82 % rccnll). C(mversely, right branching wouht not do very well on ~ ,l~q)mmse, corpus (~ left 1)r~m(:hing langua.ge). Sin(:e A]31, does not have a 1)ref(~renc(~ fi)r direction built; in, we exi)ect ABL to t)ertbrm similarly on n Ja,t)anese (:orpus.
5 5.1
Discussion
and Future
Extensions
Recurs|on
All ABL methods des('ribed here can lem:n recursive structures and have been fomtd when ~t)plying ABI, to the NIl?IS and OVIS (:orlms. As (:ml be sc(m in figure 5, the learned recursive structure, is similm: to the, original structure. Some structure has t)een removed to make it easier to see where the recurs|on occurs. Roughly, recursive structures arc built in two steps. First, the algorithm generates the structure with difl'cro,nt non-terminals. Then, the two nonq;ermimds are merged as described in so,el;ion 3.1.3. The merging of the non-terminals m~y occur anywhere in the cortms , sin(:e all merged non-terndnals are ut)dated.
learned original learned original
Please ezplain the (field FLT D A Y in the (table)is)is Please explain (the .field FLT D A Y in (the table)NP)Np Explain classes QW and (QX and (Y)a2)~'e Explain classes ((QW)Np and (QX)NI, and (Y)NP)NP Fignre 5: Recursive structures learned in the ATIS corpus
Show me the ( morning )x flights Show me the ( nonstop )x fli.qhts Figure 6: Wrong syntactic type
5.2
Wrong Syntactic Type
In section 3.1.3 we made the assumt)tion t h a t a constituent in a certain context can only have one type. This assumption introduces some problems. The sentence John likes visiting relatives illustrates such a problem. The constituent visiting relatives can be a noun phrase or n verb phrase. Another prol)lem is ilhlstrated in figure 6. W h e n applying the ABL learning algorithm to these sentences, it will determine that morning and nonstop are of the same type. Untbrtunately, morning is a noun, while nonstop is an adverb) A fixture extension will not only look at the type of the constituents, lint also at the context; of the constituents. Ii5 the example, the constituent morning nlay also take the t)lace of a subject position in other sentences~ 1)ut the constituent nonstop never will. This intbrnmtion can be used to determine when to merge constituent types, efl'ectively loosening the assunlption.
5.3
Weakening Exact Match
W h e n the ABL algorithms try to learn with two conlpletely distinct sentences, nothing can be learned. If we weaken the exact match between words in the alignment step of the algorithm, it is possible to learn structure even with distinct sentences. Instead of linking exactly matching words, the algorithm should match words t h a t are equivalent. An obvious way of implementing this is by making use of cquivalence classes. (See 5Harris's implication does hold in these sentences.
nonstop can also be replaced by for example cheap (another adverb) and morning can be replaced by evenin.q (another noun).
966
for example (Redington et al., 1998).) The idea 1)ehind equivalence classes is that words which are closely related are grouped together. A big advantage of equivalence classes is t h a t they can be learned in an unsupervised way, so the resulting algorithm remains nnsui)ervised. Words t h a t are in the same equivalence class are. said to be sufficiently equivalent, so the aligmnent algoritlnn may assunm they are sin> ilar and may thus link them. Now sentences that do not have words in common, but do have words in the same equivalence class in common, can be used to learn structure. When using equivalence classes, more constituents are learned and more terminals in constitnents may l)e seen as similar (according to the equivalence classes). This results in a much richer structm'ed corlms.
5.4
Alternative Statistics
At the moment we have tested two diflbrent ways of computing the probal)ility of a constituent: ABL:leaf which c o m p u t e s the t ) r o t ) ability of the occurrence of the terminals in a constituent, and ABL:b','anch which coml)utes the probability of the occurrence of |;11(; terminals together with the root non-terminal in a (-onstitueut, based on the learned corpus. Of course, other models can bc imt)lemented. One interesting possibility takes a DOP-like approach (Bod, 1998), which also takes into account the inner structure of the constituents.
6
Conclusion
Wc have introduced a new grammar learning algorithm based 055 c()mparing and aligning plain sentences; neither pre-labelled or bracketed sentences, nor pre-tagged sentences arc used. It uses distinctions between sentences to find possible constituents and afterwards selects the most probable ones. The output of the algorithm is a structured version of the corpus. By l;aking entire sentences into account, the context used by the model is not limited by window size, instead arbitrarily large contexts are
used. Furthermore, the model has the ability to learn recursion. ~['ln'ee ditl'erent instances of the algorithm have l)een al)t)lied to two corpora of differeat size, the ATIS corpus (716 sentences) and the OVIS corpus (6,797 sentences), generating promising results. Although t;he OVIS corpus is almost ten t;imes the size of the ATIS corpus, these corpora describe a small conceptual domain. We plan to ~l)t)ly the ~flgori~hms to larger domain corpora in the near fllture.
Peter Oriinwald. 1994. A nfinimmn de,scription lengl;h approach to grammar inference. In G. Scheler, S. Wernter, and E. R,iloif, editors,
Connectionist, Statistical and S!pnbolic Approaches to Learning for Natural Language, vohnne 1004 of Lecture Notes in dl~ pages
References J. K. Barker. 1979. Trainabh; grammars for speech recognition. In J. J. Wolf and 1). H. Klatt, editors, Speech, Communication Papers
for the Ninety-seventh Meetin.q of the Acoustical Society of America, pages 547-550. R,ens Bod. 1998. Beyond Grammar An _F,zpcricncc-Bascd Th,eory of Language. StanJbrd, CA: CSLI Publications. R.. Bonnema, R. Bod, and R,. Scha. 1997. A DOP model for semantic iil|;ertn'el;ation. In
Proceedings of the Association for Compu~ tational Ling'aistics/Eurwpean Ch,apter of th,c Association for Computational Linguistics, Madrid, p~ges 159 167. Sommerset, N J: A ssoci~tion tbr Compul;ational Linguistics. T. Booth. 1969. Probal)ilistic representation of formed languages. In Co'~@',rcnce ll,cco'rd o/"
1959 "lEnth, Annual Symposium on ,5'witcl~,in.q and Automata Theory, pages 74 8:1. Sharon A. Caraballo and Eugene Charniak. 1998. New figures of merit for best-first prol)abilist;ie chart parsing. Computational Linguistics, 24(2):275 298. Stanley F. Chert. 1995. Bayesian gralmnar induction for language modeling. ]:ll Proceed-
ings of the Association J'or Computational Linguistics, pages 228 235. Walter Daelemmls. 1995. Memory-based lexical acquisition and 1)rocessing. In P. Stefh;ns, editor, Mach, inc Translation and the Lexicon, vohmm 898 of Lecture Notes in Artificial Intelligence, pages 85 98. Berlin: Springer Verlag. Carl G. de Marcken. 1996. Unsupervised Lang'aage Acquisition. I)h.D. thesis, Departnmnt of Electrical Engineering mid Comtmter Science, Massachusetts Institute of Technology, Cambridge, MA, sep.
203-216. Berlin: Springer Verlag. Zellig Harris. 1951. Methods in Structural Linguistics. Chicago, IL: University of Chicago Press. Andrew Kehler and Andreas Stoleke. 1999. Preface. In A. Kehler and A, Stolcke, editors, Unsuper'viscd Learning in Natural Language Processing. Association for Comlmtational Linguistics. Proceedings of the workshop. K. Lari and S. J. Young. 1990. The estimation of stochastic context-free grammars using the inside-outside ~dgorithm. Computer Speech and Language, 4:35 56. ]). Magerman and M. Marcus. 1990. Parsing natural language using mutual intbrmation statistics. In Pwcecdin.qs o,f th,e National Con.fcrcnce on Artificial Intelli.qence, p~ges 984 989. Cambridge, MA: MIT Press. M. Marcus, B. Santorini, and M. Marcinkiewicz. 1993. Building a large annotated corpus of english: the Penn tr(,ebank. Computational Linguistics, 19(2):31.3 330. F. Pereira and Y. Schgd)e,s. 1992. Inside-outside reestimation fl:om pm:tially t)racketed corpora. In l'rocccdings of th,c Association for Computational Lin.quistics, pages :128-135, Newark, Debm~are. Martin Redington, Nick Chater, and Steven Finch. 1998. Distrilmtional information: A powerflfl cue for acquiring synt;actic categories. Cwnitivc Science, 22(4):4:25 469. A. Viterbi. t967. Error bmmds for convoh> tiona] codes and an asymptotically ol)timum decoding algorithm. Institute of Electrical
and Electronics Engineers Transactions on Information Th,cory, 13:260 269. Robert A. Wagner and Michael J. Fischer. 1974:. The string-to-string correction problem. Journal of th,e Association for Computing Machinery, 21(1):168-173, jmL J. G. Wollf. 1982. Lmlguage acquisition, data compression and generalization. Langv, agc ~ Communication, 2:57-89.
967