Simple Unsupervised Identification of Low-level ... - CiteSeerX

3 downloads 0 Views 204KB Size Report
Negra. NP 42.4. PP 36.4. AP 6.0. Other 15.2. CTB. NP 63.0. VP 24.0. PP 4.6. Other 8.4. Fig. 3. Distribution of extracted text clumps the cat saw the red dog run.
Simple Unsupervised Identification of Low-level Constituents Elias Ponvert, Jason Baldridge and Katrin Erk Department of Linguistics The University of Texas at Austin 1 University Station, Austin Texas 78712 Email: {ponvert,jbaldrid,katrin.erk}@mail.utexas.edu

Abstract—We present an approach to unsupervised partial parsing: the identification of low-level constituents (which we dub clumps) in unannotated text. We begin by showing that CCLParser [1], an unsupervised parsing model, is particularly adept at identifying clumps, and that, surprisingly, building a simple right-branching structure above its clumps actually outperforms the full parser itself. This indicates that much of the CCLParser’s performance comes from good local predictions. Based on this observation, we define a simple bigram model that is competitive with CCLParser for clumping, which further illustrates how important this level of representation is for unsupervised parsing.

I. I NTRODUCTION The aim of unsupervised grammar induction (UGI) is to induce the syntactic structure (both word-word dependencies and phrase structures) of text in the input language without the use of training data with annotated syntactic structures. Klein and Manning [2] established the first UGI methods that actually surpassed a right-branching baseline; more recent models have greatly improved on Klein and Manning’s results for wordword dependencies [3], [4]. However, these models generally assume that the gold standard parts-of-speech are known and may be used to predict higher level structure. In contrast, Gao et al. [5], [6] and Seginer [1], [7] present grammar induction models which learn from text alone, without requiring partof-speech annotations. The CCLParser [1] is currently the best-performing model for UGI of phrase structures from raw text. However, as we show, much of the performance of CCLParser is based on quite local structural predictions that identify relatively highquality low-level constituents, or clumps. In fact, CCLParser has already been used in exactly this capacity: Davidov et al [8] used CCLParser predicted local structure as a basis for word-concept acquisition. Furthermore, we find a most surprising result: by extracting the clumps predicted by CCLParser and building a punctuation-obeying right-branching structure above them, we obtain better unlabeled F-scores than using the higher-level structure actually predicted by CCLParser (for English and Chinese). These observations suggest a route to obtaining simpler UGI algorithms that focus exclusively on identifying clumps directly from text, which we term unsupervised partial parsing

(UPP). Partial parsing (also known as chunking or segmentation) is the identification of low-level constituents in text, without identifying hierarchical syntactic structure. We distinguish clumps from chunks: chunks are usually specified to be of a homogeneous phrasal type (e.g. NP chunks), and can include one-word chunks. Clumps are defined structurally, as described below, must have more than one word but can include some phrasal heterogeneity not usually seen in chunking datasets, such as to London and for one. There has been plenty of work on supervised partial parsing, especially following the 1999 and 2000 CoNLL shared tasks [9], but we are not aware of work that specifically focuses on unsupervised techniques. Our repurposing of CCLParser shows that reasonably good clumps can indeed be extracted without annotations. However, CCLParser is a complex algorithm with detailed heuristics guiding its attachment decisions. Looking at it from the perspective of UPP suggests massive simplifications that may retain much of the effect of the model while simultaneously giving us greater insight into why it works as well as it does. In this vein, we define an extremely simple (and extremely fast) method for clumping that is built on just two pieces of information: co-occurrence counts between pairs of words, and co-occurrence counts of words with explicit boundaries such as punctuation and sentence boundaries. This rivals CCLParser at partial parsing, and in combination with right-branching higher-level structure, and is competitive with CCLParser for unlabeled PARSEVAL scores, even better in the case of Chinese. This paper makes the following contributions. We identify a worthwhile but unexplored problem for natural language processing: unsupervised partial parsing. We analyze the performance of CCLParser at this task, and find that it often performs better at this than at identifying higher level sentence structure, when compared to a simple baseline model. Based on these observations, we propose targeting this level of structure directly, and provide a simple and fast model as a starting point to that project. In addition to the analysis provided by this work into the performance of CCLParser, unsupervised partial parsing has the potential to aid in several semantic computing tasks. One, as mentioned above, is word-concept acquisition [8]. Another is paraphrase acquisition. Distributional methods have

II. U NSUPERVISED PARTIAL PARSING A. About In this work, partial parsing corresponds to the identification of grammatically significant text clumps. By grammatically significant, what is meant is that the clumps correspond to the lowest level of grammatical structure in a full syntactic analysis. This is best illustrated by considering our evaluation of the task. As evaluation material, we extract clumps from gold-standard tree annotations by identifying the lowest subtrees which (1) span more than one token but (2) do not contain non-trivial subtrees. To illustrate, consider the example in Fig. 1. The lowest sub-trees of this tree correspond to the text-clumps extracted for our evaluation. Note that clumps are different from standard chunks: they include sequences like (for one) and they always contain at least two words. For example, in Fig. 2, only the phrase (cray research) is extracted as a clump, though all would constitute a noun phrase chunk. Clumps often correspond to noun phrases, as illustrated in Fig. 3, which shows the distribution of text clumps by phrasal type in the corpora we used in our study: the Wall Street Journal subset of the Penn Treebank-3 (WSJ), the Negra-2 German treebank (Negra), and the Penn Chinese Treebank-5 (CTB) (see Section IV for details and citations). We see that in

relieved

for

clump

       

one       

clump

was

      

,

       

,

      

ward

mrs.        

clump

Fig. 1. Example tree without node labels, and with single-branch sub-trees collapsed. The clumps extracted from this tree are (mrs. ward), (for one), and (was relieved).

all came from cray

          

research

           

been successful in identifying single-word synonyms and nearsynonyms (e.g. [10]). With an unsupervised partial parser, it would be straightforward to extend this simple strategy to multi-word units, determining distributional similarity between clumps. Yet another is statistical machine translation (SMT): Carpuat and Diab [11] use multiword expressions (MWE), which are similar to the kind of local structure we target in this paper, to improve the performance of a widely used phrasal SMT system on an English-to-Arabic translation task. The experiments reported by Carpuat and Diab use gold-standard MWE as defined by the WordNet lexical database [12], however because they use a gold-standard resource, they only have access to MWE for one of the languages. Unsupervised methods, such as those proposed here, would allow strategies such as those of Carpuat and Diab to be systematically applied to both languages in a SMT system. Finally, unsupervised local linguistic structure identification can aid information retrieval: local models can contribute to greater linguistically informed document indexing and query segmentation ( [6], [13], [14] ). Local structure prediction may well be more robust in identifying useful linguistic units in different document genres, and may be applied consistently to different content types (e.g. headlines, image captions, document bodies and queries). Because we specifically propose unsupervised methods, our models do not rely on linguistic resources like treebanks or language-specific parsers, and thus will remain useful as the non-English Web grows in cultural and economic prominence. Finally, precisely because it is so simple, our proposed model for partial parsing is strikingly fast, and can be easily parallelized and applied to Web-scale datasets.

clump Fig. 2. Example tree with fewer clumps. Only (cray research) is considered a clump.

the three treebanks, phrases of type NP1 form the plurality of the clumps extracted for our evaluation. Of course, as we are working from raw text, we do not distinguish different types of clumps in our model or evaluation. B. Motivation Unsupervised partial parsing is an interesting problem for natural language processing for a number of reasons. First of all, predicting textual structure at this level of representation may well benefit multilingual text-retrieval tasks, wherein likely collocations may be better treated as unique lexical items. This is part of the motivation for our evaluation: we wish to capture relevant non-NP collocations such as for one (Fig. 1), in september (a prepositional phrase) and the company ’s (an adjectival phrase, in Penn Treebank annotations). Moreover, tackling low-level structure directly may provide a path forward for research in unsupervised parsing. It is quite likely that current methods in unsupervised parsing perform best at local and low-level structural prediction. We demonstrate this for the unsupervised parser introduced in [1], CCLParser. C. CCLParser CCLParser makes a good comparison for our presentation of unsupervised partial parsing, and not just because it is accurate, fast and freely available (though that helps). CCLParser uses a linguistic representation which, in effect, 1 We count NP-like phrase types in the numbers reported in Fig. 3. For WSJ we count {NP, QP, WHNP} as NPs; for Negra we count {NP, CNP}; for CTB we count{DP, NP, DNP, QP}.

NP 81.3

RB Baseline SP Clumps LB Clumps RB Clumps CCLParser

WSJ Other 6.7 VP 3.7 PP 8.3

Prec 40.8 61.7 21.0 50.8 55.1

Rec 53.9 21.8 25.9 62.6 52.0

F1 46.5 32.2 23.2 56.1 53.5

PARSEVAL scores for CCLParser and methods that heuristically build structure above the clumps it predicts (SP: single parent, LB: left-branching, RB: right-branching). RB Baseline is a strong right-branching baseline without clumps but which also does not branch over phrasal punctuation. This on the WSJ subset of the Penn Treebank.

NP 42.4

TABLE I C OMPARING CCLPARSER TO STRUCTURES USING CLUMPS

Negra Other 15.2

PP 36.4

AP 6.0

model uses distributional information of word co-occurrences to determine the tendency of two words to link. Another key component of the model is a co-occurrence count with boundaries, which keeps track of how often a sentence boundary or phrasal punctuation symbol (see Section IV below) has occurred to the left and to the right of a given word. The decision whether to add a link between two words is based on a comparison between (a) the strength of the preference of the two words to link to each other with (b) the strength of the preference of those words for encountering a boundary. This decision is subject to adjacency constraints dictated by punctuation and the structure built for the sentence thus far. (Note that this exposition greatly simplifies the model—see Seginer [7] for full details.)

NP 63.0

CTB Other 8.4 PP 4.6 VP 24.0

Fig. 3.

Distribution of extracted text clumps

the

cat saw the

0

the j

*

1

cat

+

GF saw

red

*

0

the j

*

0

red j

0

0

Fig. 4.

D. CCLParser as a clumper

ED

What is most striking about the output of the model is that the doubly linked word sequences seem to correspond to noun phrase chunks. Based on this observation, we extracted all such sequences from the output of CCLParser on the WSJ corpus and evaluated against the clumps extracted from the gold standard. This produced 67.6% precision, 44.1% recall, and 53.4% F1 -score. This is of course well below the scores in the mid-90’s obtained by supervised chunkers [9]. The “clumps” extracted from CCLParser output are reasonably precise. And, of course, they correspond to a low level of syntactic structure, which raises the question: how much of the CCLParser’s performance is due to getting the clumps right? We tested this by comparing full CCLParser output versus different methods for connecting the clumps it predicts: a single-spanning parent connecting all clumps, leftbranching, and right-branching. Similarly to the full parser, we predict only sub-trees that span no punctuation boundaries. The PARSEVAL results, given in Table I, are striking: simply predicting right-branching structures that conform to the clumps and punctuation boundaries improves F1 -score by +2.6 over CCLParser itself. In fairness to CCLParser, it learns that the structure above clumps is primarily right-branching for English. Nonetheless, the fact that a uniformly right-branching structure is better than that predicted by CCLParser indicates that the real action is in the local structure prediction.

dog

0 0

run

+

dog

s

0

 run

0

Example CCL set

directly represents the level of textual structure targeted by UPP. CCLParser uses a novel representation of phrase structures called Common Cover Links (CCL) based on links between word tokens in a sentence. Though they resemble dependency links, a consistent set of CCLs in fact represents a unique treestructure, or bracketing, over the sentence. For example, the tree in Fig. 4 for the cat saw the red dog run is equivalently represented in the CCL set given. As Fig. 4 shows, each link is associated with a depth. The interpretation of a link from wi to wk is that there is a bracket that contains wi and wk , and either wi are of the same depth under the lowest common bracket, or wi is shallower. Note also that two words may be linked to each other, e.g. the cat and the red dog; such mutually linked words are at the same depth under a common parent. CCL parsing is the task of mapping a sequence of tokens to a minimal CCL set. To learn linkings between words, the

That this is the case is in some ways unsurprising, given the representations used by CCLParser to predict links. The lexical entry for a word in the CCL parser stores linking and boundary preferences for different adjacency positions, basically words observed at different distances to the target. Immediate adjacencies are the most important factor in link creation for the CCLParser. III. A S IMPLE M ODEL Given these observations, we compare CCLParser to a simple model that predicts whether a link should be present between adjacent words and consider sequences of linked words to be clumps. Like CCLParser, our model uses only two pieces of information: co-occurrence counts between pairs of words, and co-occurrence counts of words with phrasal punctuation or sentence boundaries. Let S = w1 . . . wN be a sentence to be analyzed. For any two adjacent words wn , wn+1 , we want the model to decide whether to clump, operationalized as a decision between linking wn to wn+1 and linking wn to a phrase boundary, written as #. The model simply uses raw bigram counts: Posit that two adjacent words wn , wn+1 are in the same clump if τ · count(wn , wn+1 ) ≥ count(wn , #) + count(#, wn+1 ) That is, wn and wn+1 are clumped if they have been observed together at least as often as wn followed by a boundary, and a boundary followed by wn+1 . Note that this model defaults to clumping when all of the relevant counts are 0. That means that it will clump rare or unseen strings, most notably names, like pierre vinken. The model provided here is referred to as C LUMP BG in the results below; results quoted below use τ = 2. IV. DATA A. Corpora The results in this paper use the same2 treebanks used to evaluate the unsupervised parsers of Klein and Manning [15] and Seginer [1]: for English, the Wall Street Journal subset of the Penn Treebank-3 (WSJ),3 for German, the Negra corpus version 2,4 and for Chinese, the Penn Chinese Treebank version 5.0 (CTB).5 We used a version of the CTB converted to UTF-8 from the original GB encoding, with a discrepancy of two tokens. In addition to results on the full treebank (reported as WSJall, Negra-all and CTB-all), we report results on the subset of each corpus of sentences with 10 tokens or less. (WSJ10, Negra10, CTB10). These data-sets have been used in other research on unsupervised parsing ( [1], [2], [15], [19], [20] ). Basic statistics for these corpora are given in Table II.

WSJ10 WSJ-all Negra10 Negra-all CTB10 CTB-all

Sentences 7 422 49 208 7 542 20 602 4 626 18 787

Types 9 892 43 740 13 090 48 911 7 031 37 386

Tokens 324 867 1 028 348 43 369 303 471 24 470 430 971

TABLE II BASIC CORPUS STATISTICS – SENTENCES , WORD TYPES AND TOKENS ; THESE COUNTS REFLECT THE TEXT AFTER IT HAS BEEN PRE - PROCESSED TO IGNORE PUNCTUATION , AND TOKENS ARE SIMPLIFIED VIA LOWER - CASING .

B. Text conventions We focus on parser learning from word types alone. Part-ofspeech annotations are only used to identify punctuation and special empty categories which are included in the treebanks. In this, we follow Seginer [7, Section 7.2] who largely follows previous conventions ( [2], [15], [19], [20] ). Punctuation is usually ignored in unsupervised parser evaluation; Seginer departs from this in one key aspect: the usage of certain punctuation to help identify phrases. This punctuation, which we call phrasal punctuation, factors significantly in our results. For phrasal punctuation, we use: . ? ! ; , -- ◦ 8 (The latter two represent Chinese full-stop and enumeration comma.) Otherwise, the only adjustment we make to treebank text is conversion to lower-case. C. Not counting root We depart from earlier research in unsupervised parsing evaluation by ignoring the tree-root bracket in treebank-based evaluation. Since the standard evaluation is unlabeled precision, recall and F1 score on sub-tree-spans (or brackets), counting the full tree provides the evaluation with a free true positive for every tree under consideration. This is has the strongest effect on experiments involving sentences of only 10 or 20 tokens or less. We replicate experiments reported in [1], the best results on the 10-token-or-less subsets of WSJ, Negra and CTB (WSJ10, Negra10 and CTB10). Evaluating with and without counting roots, we find that counting them adds 7263 true positives to PARSEVAL evaluation on WSJ10 (37% increase), 6505 true positives to evaluation on Negra10 (86% increase) and 3877 true positive to evaluation on CTB10 (95% increase). Overall, these additional true positives yield a +6.2 change in WSJ10 F1 score, and +15.4 change in Negra10 F1 , and a +16.6 change in CTB10 F1 . For the purpose of this paper, which is to relate local structure prediction to unsupervised parsing in general, predicting tree root has little relevance. V. E XPERIMENTS

2 With

one exception: We use the Penn Treebank-3, earlier research uses Treebank-2. 3 LDC99T42, [16] 4 www.coli.uni-saarland.de/projects/sfb378/negra-corpus/ [17] 5 LDC2005T01, [18]

We evaluate our models as clumpers and heuristic constituency parsers on the corpora cited above. We use CCLParser as a benchmark for comparison. When it is used as a clumper, we refer to it as C LUMP CCL.

P / R / F1 C LUMP BG C LUMP CCL

WSJ10 53.1 / 37.9 / 44.2 76.6 / 42.8 / 54.9

WSJ-all 46.9 / 39.1 / 42.7 67.6 / 44.1 / 53.4

Negra10 44.2 / 33.4 / 38.0 55.9 / 32.7 / 41.3

Negra-all 33.3 / 30.3 / 31.7 43.1 / 30.2 / 35.5

CTB10 43.9 / 34.2 / 38.4 34.3 / 24.3 / 28.5

CTB-all 40.8 / 30.9 / 35.2 26.0 / 21.0 / 23.3

Clumping results in precision / recall / F1 . TABLE III C LUMPING RESULTS

A. High-level results

With root: 7263 more

WSJ10 Without root: 19687

With root: 6505 more

Negra10 Without root: 7569

With root: 3877 more

CTB10 Without root: 4064

100

a. Volume of true positives added to evaluation

0

20

40

60

80

With root Without root

1) Clumping: Table III provides high-level analysis of the performance of various models on the unsupervised clumping task. For comparison, we use the “gold” clumps extracted from each treebank as described in Section II-A. The numbers quoted here are precision (P), recall (R) and F1 score6 on the full clumps identified by each method. This is very similar to unlabeled PARSEVAL, only on the subset of trees corresponding to text clumps. This means that partial clump predictions are counted as wrong: for example if a system outputs the ( new york ) times when the correct clump is ( the new york times ), then the partially correct clump is considered incorrect. This is in spite of the fact that for our models – as well as CCLParser – the clumping decision is based on individual bigrams: full clumps are emergent structures on the individual bigram clumps. 2) Clumpers evaluated as heuristic parsers: Following the observation in Section II-D, we consider the extent to which right-branching baselines recover treebank annotations when the output of clumper models is used as the basis for the higher-level structure. Specifically, we consider trees which are simple right-branching by default, but 1) respect the brackets defined by a model’s identified clumps, and 2) do not branch over phrasal punctuation. Some examples are given in Fig. 6. This heuristic is partially inspired by a baseline for unsupervised parsing proposed by Seginer [7]: right-branching (or left-branching) trees which do not branch over punctuation.7 We also provide O RACLE -RB numbers, as an upper-bound to the performance of baseline+clumper model evaluation. This is the same right-branching tree output, only using the actual clumps extracted from the treebank. The results of these experiments is given in Table IV. The numbers given are unlabeled PARSEVAL: P / R / F1 on finding full brackets. These numbers do not count the root of the tree as a free true positive, as discussed in Section IV-C; also, single-node branches are ignored.

P

P / R / F1 Without root With root

R

F

WSJ10 69.3 / 70.2 / 69.8 75.6 / 76.3 / 76.0

P

R

F

Negra10 35.9 / 55.3 / 43.5 51.0 / 69.7 / 58.9

P

R

F

CTB10 38.8 / 39.3 / 39.0 55.4 / 55.8 / 55.6

b. Change in precision, recall and F1 metrics Fig. 5.

Effects of counting root bracket in unsupervised parser evaluation

B. Analysis From the basic clumping numbers, we see that CCLParser usually performs well as a clumper, easily outperforming our simple models on WSJ and Negra. The exceptions are the evaluations on Chinese Treebank. This is possibly due to the fact that Chinese does not have determiners akin to the and a – in parsing English, CCLParser frequently makes use of 6F = 1 7 Note

2PR / (P + R) as usual that the CCLParser itself uses this as a hard constraint, so it also “respects” punctuation in this way.

Clumping: ( pierre vinken ) , ( 61 years ) old , will join ( the board ) ( of elsevier )

pierre

vinken

will

old 61

years

join

the

board

of

elsevier

Fig. 6. Example tree extracted from clumper output. Punctuation is not counted in evaluation; however, punctuation is used to segment the tree into branches which are only dominated by the top (root) node. Otherwise, clumps and words are arranged into a right-branching structure. P / R / F1 BASELINE -RB C LUMP BG-RB C LUMP CCL-RB CCLParser O RACLE -RB

51.5 55.4 62.0 69.4 78.0

WSJ10 / 68.4 / / 70.2 / / 77.9 / / 70.2 / / 91.7 /

58.7 61.9 69.0 69.8 84.3

40.8 46.2 50.8 55.1 66.6

WSJ-all / 53.9 / 46.5 / 58.2 / 51.5 / 62.6 / 56.1 / 52.0 / 53.5 / 77.0 / 71.4

20.8 28.2 28.1 35.9 49.0

Negra10 / 44.2 / 28.3 / 55.6 / 37.5 / 57.8 / 37.8 / 55.3 / 43.5 / 88.1 / 63.0

Negra-all 14.1 / 29.4 / 19.1 21.0 / 40.2 / 27.6 19.9 / 39.9 / 26.5 27.1 / 39.8 / 32.2 38.3 / 69.8 / 49.4

33.0 42.8 35.5 38.9 77.7

CTB10 / 50.6 / / 56.1 / / 49.4 / / 39.3 / / 77.3 /

39.9 48.6 41.3 39.1 77.5

31.9 38.5 35.4 36.9 57.0

CTB-all / 37.2 / 34.4 / 42.2 / 40.3 / 36.8 / 36.1 / 27.7 / 31.6 / 62.5 / 59.6

BASELINE -RB is the baseline model: right-branching trees that respect punctuation. C LUMP BG-RB are the trees created by the right-branching trees which don’t branch over punctuation, and respect the clumps found by the C LUMP BG clumper model. CCLParser-RB is the same thing with the clumps extracted from the CCLParser output. CCLParser is the output of CCLParser. O RACLE -RB is the same thing with the clumps extracted from the treebank itself. TABLE IV PARSING RESULTS

statistics associated with these terms as a kind of back-off for linking decisions, even when they are not lexically present. Comparing the text clumping model results to the full-parse results in Table IV, we find some symmetry, but not complete harmony. In those cases where CCLParser outperforms our models as a clumper (WSJ and Negra), its combination of clumps and right-branching structure outperforms our models as a parser. Conversely, where our clumping models outperform CCLParser at the clumping task (CTB) their full-tree output is competitive (CTB10) or outright better (CTB-all) than the original parser output. Of course, as noted in section II-D, one of the truly surprising results is that using C LUMP CCL with the heuristic right branching baseline obtains a higher F-score on WSJ-all than CCLParser itself (56.1 vs 53.5). A similar result holds for CTB-10 and CTB-all. In what follows, we look more carefully at these results, and argue that per-bigram precision is an important metric for unsupervised structure identification, which cascades to other metrics for unsupervised parsing. Also, we consider what seems to be the outlier in our analysis: both the Negra10 subset and the full Negra corpus do not follow the pattern that C LUMP CCL-RB’s trees are closer to treebank annotations than CCLParser’s. 1) Relating clumper accuracy to parse accuracy: For CCLParser, at the level of local syntactic structure, multiword structures such as clumps are emergent structures on individual token bigram decisions. This is true of our models as well. So, we consider how the accuracy of the various models on individual bigrams relates to the other metrics under

consideration. We compared how CCLParser and our model performed on a per-bigram basis to the other metrics considered. This is in part due the the fact that both models make local clumping decisions at the level of bigrams – clumps, or low-level constituents, are derived from predictions on the bigrams themselves. We find that per-bigram precision correlates strongly with the other evaluation metrics. In Fig. 7, we graph per-bigram precision and per-bigram recall against full clump identification F1 score (the top graphs), as well as against the derived tree evaluation using clumping predictions (the bottom graphs). Per-bigram precision correlates well with clumping F1 in particular (the top-left graph), while perbigram recall reveals much less strong correlation. A similar pattern holds between the per-bigram precision versus recall with the derived-tree evaluation metrics, in spite of the degree of indirection between the individual bigram predictions and the ultimate PARSEVAL results. The graphs in Fig. 7 only compare per-bigram precision and recall to two other metrics – full clump identification F1 score and derived tree constituent identification F1 . To amend this, we performed rank correlation analysis of per-bigram precision and recall against full clump precision, recall and F1 score, as well as against derived tree constituent identification precision, recall and F1 score, using Pearson’s correlation coefficient. Rank correlation is calculated for each evaluation type across all experimental setups (the two models and six datasets). These results are given in Fig. 8. The lesson here is that the models which are more conservative, which predict clumps less frequently but with more con-

50

50 Data ●

CTB10

40

Negra−all Negra10



35

WSJ−all WSJ10

30

Data

45

CTB−all

Clump F score

Clump F score

45



Negra−all Negra10



35

WSJ−all WSJ10

30

25

25 ●



65

70

75

80

85

90

45

50

Per−bigram precision

55

60

Per−bigram recall

60

60 Data ●

Derived tree F score

Derived tree F score

CTB−all CTB10

40

CTB−all CTB10

50

Negra−all Negra10 ●

40

WSJ−all WSJ10



30

Data ●

CTB−all CTB10

50

Negra−all Negra10 ●

40

WSJ−all WSJ10



30

65

70

75

80

85

90

45

50

Per−bigram precision

Per-bigram P / R C LUMP BG C LUMP CCL

WSJ10 75.0 / 50.2 94.2 / 51.5

55

60

Per−bigram recall

WSJ-all 75.7 / 56.3 91.1 / 57.1

Negra10 77.6 / 54.8 83.6 / 43.4

Negra-all 67.3 / 57.8 70.0 / 43.5

CTB10 84.7 / 61.4 76.6 / 47.7

CTB-all 67.4 / 52.0 61.7 / 55.3

0.2

0.4

0.6

0.8

Per−bigram precision Per−bigram recall

0.0

fidence, tend to perform relatively well. In other words: false positives result in incorrect clumps, which have a cascade-oferror effect on the other metrics; erring on the side of caution turns out to be less harmful. 2) Negra: The outlier in our full parsing comparison results is Negra: the derived-tree metrics do not correlate with perbigram accuracy, nor do they seem related to the whole-clump evaluation metrics. The basic theme seems to be: adding rightbranching trees to predicted clumps (or, even perfect clumps) does not produce trees that correspond to treebank evaluation. There are a couple reasons for this. First of all, Negra is a difficult data-set, especially when no morphological analysis is employed, as here: the word-types to word-tokens ratio is higher than the other data-sets (c.f. Table II), and yet it’s the smallest treebank of the three, counting tokens. Second, it seems that neither a strict left-branching nor strict rightbranching structure is a good match for the gold analyses. The most important factor, however, is the annotation style in the Negra corpus. Negra has both constituent and dependency analyses, and its trees are relatively flat, in particular for long NPs and PPs. Our evaluation is only based on the trees. This is indicated in part by the following study. As before, in Section II, we consider other heuristics for projecting structures above the clumps identified by C LUMP CCL. The results

1.0

Fig. 7. Relating model performance per-bigram precision and recall unlabeled PARSEVAL (bottom). Also, the actual per-bigram precision and recall numbers are given.

Chunk P

Chunk R

Rank with PBP Rank with PBR

Chunk F

Derived P Derived R Derived F

Clumping P R F1 .90 .76 .84 –.11 .20 .07

P .58 .26

Derived R F1 .83 .69 .02 .20

Fig. 8. Correlation of per-bigram metrics to other structure identification metrics. We report the correlation using Pearson’s correlation coefficient of Per-bigram Precision and Per-bigram recall with each metric across all 12 experimental setups (i.e. the two models and six datasets)

are presented in Table V: single-parent (SP) trees – that is trees whose only sub-trees are the clumps identified, and nodes spanning the words between two boundaries – are almost as

P / R / F1 BASELINE -RB C LUMP CCL-SP C LUMP CCL-RB CCLParser

20.8 52.9 28.1 35.9

Negra10 / 44.2 / 28.3 / 29.6 / 37.9 / 57.8 / 37.8 / 55.3 / 43.5

Negra-all 14.1 / 29.4 / 19.1 43.0 / 24.8 / 31.4 19.9 / 39.9 / 26.5 27.1 / 39.8 / 32.2

TABLE V E FFECTIVENESS OF DIFFERENT HEURISTICS FOR BUILDING TREES FROM CLUMPS PREDICTED BY CCLPARSER ON N EGRA (SP: SINGLE PARENT, RB: RIGHT- BRANCHING ).

good as CCLParser output when compared to the full treebank annotations (Negra-all). Even in the Negra10 experimental set-up, SP trees slightly out-perform right-branching trees in terms of F1 -score, and the precision/recall trade-off associated with the different heuristics is much less severe than the corresponding study on WSJ (c.f. Table I). VI. R ELATED W ORK Several recent papers have made use of CCLParser for NLP tasks: Reichart and Rappoport [21] use clustering to identify labeled constituents in CCLParser output; Reichart and Rappoport [22] present methods to determine the quality of unsupervised parses from CCLParser; and, as mentioned above Davidov et al. [8] use CCLParser output for unsupervised concept acquisition. In addition to Seginer’s, the unsupervised parsing model of Gao and Suzuki [5] is related to our research. Their model also operates only from text without part-of-speech annotations, and like our work pays particular attention to local syntactic structure; they use a language model to characterize clumps of local structure which are linked via a dependency model. In the supervised realm, Hollingshead et al. [23] compare full (context-free) parsing with finite-state partial parsing methods. They find that full parsing maintains a number of benefits, in spite of the greater training time and resources necessary: they can train on less data more effectively than chunkers, and are more robust to shifts in texutal domain. VII. C ONCLUSION In this paper, we have introduced the task of unsupervised partial parsing (UPP), defined as a task of identifying clumps. Clumps are lowest-level phrases of length at least two, including noun phrases but also relevant non-NP collocations such as for one. We have argued that unsupervised parsing can be factored into the identification of clumps plus the building of global structure, showing that the performance a current stateof-the-art unsupervised parser, CCLParser, can in some cases actually be bested by combining the clumps that it proposes with a right-branching structure. Our research introduces a new perspective on the unsupervised parsing task: getting the local structure of a clause right is often extremely important to getting the whole-sentence parse structure right. Additionally, we have posited that perbigram precision is an important metric for unsupervised partial parsing. What this means for future research in unsupervised parsing is that models which are more conservative when

predicting local structure have a greater chance of success in other parser evaluation metrics. Additionally, we have introduced an extremely simple and fast model for unsupervised partial parsing that relies solely on counts of word bigrams along with bigrams of a word and punctuation or sentence boundaries. This in fact mimics the basic functioning of CCLParser for local attachment decisions, and opens up opportunities for modular extension and a probabalistic foundation for such word-based unsupervised parsing models. In particular, it will be interesting to explore methods of generalizing over seen bigrams. ACKNOWLEDGMENTS The authors acknowledge the support of a grant from the Morris Memorial Trust Fund of the New York Community Trust. Also, we are grateful to Taesun Moon and the anonymous reviewers for helpful comments and insight. R EFERENCES [1] Y. Seginer, “Fast unsupervised incremental parsing,” in Proc. of ACL, 2007. [2] D. Klein and C. Manning, “A generative constituent-context model for improved grammar induction,” in Proc. of ACL, 2002. [3] S. Cohen and N. Smith, “Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction,” in Proc. of HLTNAACL, 2009. [4] W. Headden III, M. Johnson, and D. McClosky, “Improving unsupervised dependency parsing with richer contexts and smoothing,” in Proc. of HLT-NAACL, 2009. [5] J. Gao and H. Suzuki, “Unsupervised learning of dependency structure for language modeling,” in Proc. of ACL, 2003. [6] J. Gao, J. Nie, G. Wu, and G. Cao, “Dependence language model for information retrieval,” in Proc. of SIGIR, 2004. [7] Y. Seginer, “Learning syntactic structure,” Ph.D. dissertation, Univ. of Amsterdam, 2007. [8] D. Davidov, R. Reichart, and A. Rappoport, “Superior and efficient fully unsupervised pattern-based concept acquisition using an unsupervised parser,” in Proc. of CoNLL, 2009. [9] E. Tjong, K. Sang, and S. Buchholz, “Introduction to the CoNLL-2000 Shared Task: Chunking,” in Proc. of CoNLL, 2000. [10] D. Lin, “Automatic retrieval and clustering of similar words,” in Proc. of COLING-ACL, 1998. [11] M. Carpuat and M. Diab, “Task-based evaluation of multiword expressions: a pilot study in statistical machine translation,” in Proc. of HLTNAACL, 2010. [12] C. Fellbaum, Ed., WordNet: An Electronic Database. MIT, 1998. [13] B. Tan and F. Peng, “Unsupervised query segmentation using generative language models and wikipedia,” in Proc. of WWW, 2008. [14] M. Bendersky, W. Croft, and D. Smith, “Two-stage query segmentation for information retrieval,” in Proc. of SIGIR, 2004. [15] D. Klein and C. Manning, “Corpus-based induction of syntactic structure: Models of dependency and constituency,” in Proc. of ACL, 2004. [16] M. Marcus, B. Santorini, M. Marcinkiewicz, and A. Taylor, Treebank-3, Linguistic Data Consortium, 1999. [17] W. Skut, B. Krenn, T. Brants, and H. Uszkoreit, “An annotation scheme for free word order languages,” in Proc. of ANLP, 1997. [18] M. Palmer, F. Chiou, N. Xue, and T. Lee, Chinese Treebank 5.0, Linguistic Data Consortium, 2005. [19] R. Bod, “An all-subtrees approach to unsupervised parsing,” in Proc. of ACL-COLING, 2006. [20] ——, “Unsupervised parsing with U-DOP,” in Proc. of CoNLL, 2006. [21] R. Reichart and A. Rappoport, “Unsupervised induction of labeled parse trees by clustering with syntactic features,” in Proc. of COLING, 2008. [22] ——, “Automatic selection of high quality parses created by a fully unsupervised parser,” in Proc. of CoNLL, 2009. [23] K. Hollingshead, S. Fisher, and B. Roark, “Comparing and combining finite-state and context-free parsers,” in Proc. of HLT-EMNLP, 2005.

Suggest Documents