Semi-Supervised Training for Statistical Parsing

Semi-Supervised Training for Statistical Parsing Mark Steedman, Steven Baker, Jeremiah Crim, Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002

Steedman et al.

CLSP WS-2002

August 21, 2002

Introduction Mark Steedman

Steedman et al.

CLSP WS-2002

1

The Team Name Steedman, Mark Sarkar, Anoop Osborne, Miles Hwa, Rebecca Clark, Stephen Hockenmaier, Julia Ruhlen, Paul Baker, Steven Crim, Jeremiah

E-Mail [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

Affiliation University of Edinburgh University of Pennsylvania University of Edinburgh University of Maryland University of Edinburgh University of Edinburgh Johns Hopkins University Cornell University Johns Hopkins University

Associates Name John Henderson Chris Callison-Burch Steedman et al.

E-Mail [email protected] [email protected] CLSP WS-2002

Affiliation Mitre Corp. Stanford/Edinburgh 2

Statistical parsing: the State of the Art • Parsers trained on the Penn Treebank have improved rapidly . . . • . . . and have been widely and usefully applied. • Generalizing to other genres of text and other languages is highly desirable. • But we will probably never get another 1M words of any kind of humanannotated text • Automatically parsed text such as the BLLIP Parsed Corpus (30M words) is too noisy. • Co-training has been shown to support bootstrapping respectable performance from small labeled corpora (Sarkar 2001) Steedman et al.

CLSP WS-2002

3

What is Co-training? • Co-Training (Blum and Mitchell 1998) is a weakly supervised method for bootstrapping a model from a (small) seed set of labeled examples, augmented with a much larger set of unlabeled examples by exploiting redundancy among multiple statistical models. • A set of models is trained on the same labeled seed material. • The whole set of models is then run on a larger amount of unlabeled data. Novel parses from any parser that are deemed reliable under some measure are used as additional labeled data for retraining the other models.

Steedman et al.

CLSP WS-2002

4

What is Co-training? (Contd.) • Co-training can be thought of as seeking to optimize an objective function that measures the degree of agreement between the predictions for the unlabeled data based on the two views (see below). • Crucially, Co-Training is training on others’ output, in contrast to Selftraining, or training on own output (cf. Charniak 1997). • The theory of co-training has been developed in application to classifiers. The present project is an empirical study of its applicability to parsers, following (Sarkar 2001)

Steedman et al.

CLSP WS-2002

5

Potential Payoffs are Large • Payoff 1: – Improved performance of Wide-Coverage Parsers • Payoff 2: – Methods for building Very Large Treebanks (larger and less noisy than BLLIP) Useful for numerous speech/language applications • Payoff 3: – Methods for bootstrapping parsers for novel genres and new languages from small labeled datasets

Steedman et al.

CLSP WS-2002

6

Co-training: Project Details • Three Distinct Parsers – Treebank CFG parsers – LTAG parser – CCG parsers (and CCG supertaggers)

(Collins 1999, 2000, . . .) (Sarkar) (Clark, Hockenmaier)

• Approaches – Co-training with supertaggers and parsers – Co-training with different parsers – Co-training Rerankers

(CCG) (CFG and LTAG) (Collins 2000)

• Data: – Use labeled WSJ (1M words) and Brown (440K words) Penn Treebank – Then use unlabeled North American News Text corpus: 500M words Steedman et al.

CLSP WS-2002

7

Workshop Goals • Identify criteria for Parse Selection that exclude noise and maximize cotraining effects; • Explore the way co-training effects depend on size of labeled seed set; • Explore effectiveness of co-training for porting parsers to new genres by training on Brown Corpus, co-training on unlabeled WSJ and held-out PT-WSJ labeled secn. 00, ultimately evaluating on PT-WSJ secn. 23. • Explore effectiveness of co-training on parsers trained on all of PT-WSJ 2-22, co-trained on unlabeled WSJ and labeled secn. 00, ultimately evaluating on secn. 23.

Steedman et al.

CLSP WS-2002

8

What We Have to Show • We will show that co-training enhances performance for parsers and taggers trained on small (500—10,000 sentences) amounts of labeled data. • We will also show that co-training can be used for porting parsers trained on one genre to parse on another without any new human-labeled data at all, improving on state-of-the-art for this task. • We will also show that even tiny amounts of human-labelled data for the target genre enhace porting via co-training. • We will show a number of preliminary results on ways to deliver similar improvements for parsers trained on large (Penn WSJ Treebank) labeled datasets and expressive grammars. • These include a number of novel methods for Parse Selection. Steedman et al.

CLSP WS-2002

9

Outline • • • • • • • • • • •

Data Management Parse Selection Experiments I: Collins-CFG/LTAG Experiments II: CCG/CCG-Supertagger Student Presentation: Oracle experiments Coffee Break Student Proposal:Smoothing for Co-training Experiments III: Rerankers Student Proposal: Cotraining Rerankers Conclusions Questions and Discussion

Steedman et al.

CLSP WS-2002

(5 min, Steven Baker) (15 min, Rebecca Hwa) (25 min, Anoop Sarkar) (30 min, Stephen Clark) (10 min, Steven Baker) (30min) (15 min, Julia Hockenmaier) (10 min, Jeremiah Crim) (5 min, Jeremiah Crim) (5 min, Miles Osborne) (30 min)

10

Co-training Architecture and Parse Selection Steve Baker, Rebecca Hwa, and Paul Ruhlen

Steedman et al.

CLSP WS-2002

August 22, 2002

Co-training: The Algorithm • Requires: – Two learners with different views of the task – Cache Manager (CM) to interface with the disparate learners – A small set of labelled seed data and a larger pool of unlabelled data • Pseudo-Code: – Init: Train both learners with labelled seed data – Loop: ∗ CM picks unlabelled data to add to cache ∗ Both learners label cache ∗ CM selects newly labelled data to add to the learners’ respective training sets ∗ Learners re-train Steedman et al.

CLSP WS-2002

1

Co-training: Theoretical Considerations • Co-training works (provably) when the views are independent. – Blum and Mitchell, 1998 – Nigam and Ghani, 2000 – Dasgupta et al., 2001 • To apply theory to practice, we would like to explicitly optimise an objective function that measures the degree of agreement between the predictions for the unlabelled data based on the two views.

Steedman et al.

CLSP WS-2002

2

Parse Selection: Practical Heuristics • Computing the objective function is impractical. • We use parse selection heuristics to determine which labelled examples to add. – Each parser assigns a score for every sentence, estimating the goodness of its best parse. – Cache Manager uses these scores to select the newly labelled data for both parsers according to some selection method.

Steedman et al.

CLSP WS-2002

3

Scoring Phase • Ideally, we want the parser to return its true accuracy for the parses it produced (e.g., parseval scores). • In practice, we can approximate these scores with con dence metrics. – – – –

log probability of the best parse entropy of the n-best parses (tree-entropy) number of parses, sentence length ...

Steedman et al.

CLSP WS-2002

4

Selection Phase • Want to select training examples for one parser (student) labelled by the other (teacher) so as to minimise noise and maximise training utility. – Top-n: Choose the n examples for which the teacher assigned the highest scores. – Difference: Choose the examples for which the teacher assigned a higher score than the student by some threshold. – Intersection: Choose the examples that received high scores from the teacher but low scores from the student. – Disagreement: Choose the examples for which the two parsers provided different analyses and the teacher assigned a higher score than the student.

Steedman et al.

CLSP WS-2002

5

Two Pilot Studies • What scoring function might work well? – Experiment with different scoring functions while using the same selection method. • What selection method might work well? – Experiment with different selection method while using the same scoring method.

Steedman et al.

CLSP WS-2002

6

Experimental Setup • Unlabelled training data: derived from Penn Treebank WSJ sec02-21. • Cache size: 50 sentences per iteration. • Development test set: WSJ sec00. • Evaluation metric: parseval – Precision (P) = – Recall (R) = – F2 =

Steedman et al.

# of correctly guessed labelled constituents # of guessed labelled consituents

# of correctly guessed labelled constituents # of labelled constituents in gold standard

2×P ×R P +R

CLSP WS-2002

7

Scoring Function Variation Study • Setting: Two versions of the same parser trained on different initial labelled seed data (1000 sentence subsets of Penn Treebank; resp. w/o conjunctions, and w/o prepositions). • Scoring functions: parseval, best parse probability, tree-entropy, combination. • Selection method: intersection, picks 20 sentences per iteration.

Steedman et al.

CLSP WS-2002

8

Scoring Function Variation Learning Curve • All practical scoring functions performed similarly 83 "Parseval" "Probability" "Tree Entropy" "Combination" 81.3

Parsing Accuracy on Test Data (f2score)

82

81

80

79

78

77 1000

Steedman et al.

1500

2000 Number of Training Sentences

CLSP WS-2002

2500

3000

9

Selection Method Variation Study • Setting: Two different parsers (CFG and LTAG) trained on the same initial labelled seed data (500, 1000 sentences from Penn Treebank) • Scoring function: parseval • Selection methods: Top-n, Difference, Intersection, Disagreement

Steedman et al.

CLSP WS-2002

10

Selection Method Variation (500 examples) • \Smart" selection methods do better than Top- n • None is as good as human annotated data 83 "Hand-Annotated" "Top-N" "Difference" "Intersection" "Disagreement"

Parsing Accuracy on Test Data (F2)

82

81

80

79

78

77 600

Steedman et al.

800

1000 1200 1400 Number of Training Sentences

CLSP WS-2002

1600

1800

2000

11

Selection Method Variation (1000 examples) • Similar trend for experiment with larger initial labelled set • Disparity between selection methods and hand-annotated upper-bound is bigger 83 "Hand-Annotated" "Top-n" "Difference" "Intersection" "Disagreement"

Parsing Accuracy on Test Data (F2)

82.5

82

81.5

81

80.5

80 1000

Steedman et al.

1200

1400 1600 Number of Training Sentences

CLSP WS-2002

1800

2000

12

Summary • Use parse selection heuristics to approximate objective function. • Scoring function differences are inconclusive – use log probability for large-scale experiments. • Selection methods that use scores from both parsers (Difference, Intersection, Disagreement) seem more reliable.

Steedman et al.

CLSP WS-2002

13

Co-training between Collins-CFG and LTAG

Rebecca Hwa, Miles Osborne and Anoop Sarkar

1

Collins-CFG and LTAG: Different Views for Co-Training

Collins-CFG Bi-lexical dependencies are between nonterminals Can produce novel elementary trees for the LTAG Learning curves show convergence on 1M words labeled data When using small amounts of seed data, abstains less often than LTAG

LTAG Bi-lexical dependencies are between elementary trees Can produce novel bi-lexical dependencies for Collins-CFG Learning curves show LTAG needs relatively more labeled data When using small amounts of seed data, abstains more often than Collins-CFG

2

Experimental comparison between Collins-CFG and LTAG (self-training) 76.5

76

75.5

fscore

75

74.5

74

73.5

73

72.5

Collins-CFG self-training LTAG self-training 0

20

40

60 Iterations

80

100

120

3

Workshop Goals

• Use co-training to boost performance, when faced with small seed data → Use small subsets of WSJ labeled data as seed data

• Use co-training to port parsers to new genres → Use Brown corpus as seed data, co-train and test on WSJ

• Use a large set of labeled data and use unlabeled data to improve parsing performance → Use Penn Treebank (40K sents) as seed data 4

Experiments on Small Labeled Seed Data

• Motivating the size of the initial seed data set

• We plotted learning curves, tracking parser accuracy while varying the amount of labeled data

• Find the “elbow” in the curve where the payoff will occur

• This was done for both the Collins-CFG and the LTAG parser

5

Collins Parser learning curve: Penn Treebank sec.2−21 in order

0.9

0.88

Predicted performance =0.88586

F−measure

0.86

0.84

0.82

0.8

x

f = a(1−b ) a = 0.892732 b = 0.574215 c = 0.20543

0.78

Best observed = 0.886645 0.76

0

0.5

1

1.5

2 Training sentences

c

2.5

3

3.5

Upper bound = 89.3% Projected to Treebank of size 80K sentences = 88.9%

4

x 10

400K = 89.2%

6

Workshop Goals

• Use co-training to boost performance, when faced with small seed data → Use 500 sentences of WSJ labeled data as seed data → Compare performance of co-training vs. self-training



Co-training compared with Self-training (Collins-CFG with LTAG, initialized with 500 sentences seed data) 78.5

78

77.5

fscore

77

76.5

76

75.5

75

74.5

Collins-CFG co-training Collins-CFG self-training 0

20

40

60 Iterations

80

100

120

8

Workshop Goals

• Use co-training to boost performance, when faced with small seed data → Co-training beats self-training with 500 sentence seed data → Compare parse selection methods for different parser views



Comparing parse selection methods between Collins-CFG and LTAG 78

77.5

fscore

77

76.5

76

75.5

75

Collins-CFG throw all Collins-CFG intersection 0

10

20

30

40 Iterations

50

60

70

80

10

Comparing parse selection methods between Collins-CFG and LTAG 89.5

89

fscore

88.5

88

87.5

87

LTAG throw all LTAG intersection 0

10

20

30 Iterations

40

50

60

11

Workshop Goals • Use co-training to boost performance, when faced with small seed data → Co-training beats self-training with 500 sentence seed data → Different parse selection methods better for different parser views → Compare performance when seed data is doubled to 1K sentences • Use co-training to port parsers to new genres → Use Brown corpus as seed data, co-train and test on WSJ • Use a large set of labeled data and use unlabeled data to improve parsing performance → Use Penn Treebank (40K sents) as seed data 12

Comparison between 500 sentence seed data vs. 1K sentence seed data (Collins-CFG with LTAG) 80 79.5 79 78.5

fscore

78 77.5 77 76.5 76 75.5 75

Collins-CFG 500 co-training (throw all) Collins-CFG 1K co-training (intersection) 0

20

40

60 Iterations

80

100

120

13

Workshop Goals • Use co-training to boost performance, when faced with small seed data → Co-training beats self-training with 500 sentence seed data → Different parse selection methods better for different parser views → Co-training still improves performance with 1K sentence seed data • Use co-training to port parsers to new genres → Use Brown corpus as seed data, co-train and test on WSJ • Use a large set of labeled data and use unlabeled data to improve parsing performance → Use Penn Treebank (40K sents) as seed data 14

Train on Brown corpus as seed data, cotrain and test on WSJ (Collins-CFG with LTAG) 80.5

80

79.5

fscore

79

78.5

78

77.5

77

76.5

Brown seed co-training Brown seed plus 100 WSJ sents co-training 0

20

40

60

80

100

120

140

Iterations

15

Workshop Goals • Use co-training to boost performance, when faced with small seed data → Co-training beats self-training with 500 sentence seed data → Different parse selection methods better for different parser views → Co-training still improves performance with 1K sentence seed data • Use co-training to port parsers to new genres → Co-training improves performance significantly when porting from one genre (Brown) to another (WSJ) • Use a large set of labeled data and use unlabeled data to improve parsing performance → Use Penn Treebank (40K sents) as seed data 16

Co-training using a large labeled seed set

• Experiments using 40K sentences Penn Treebank WSJ sentences as seed data for co-training did not produce a positive result

• Even after adding 260K sentences of unlabeled data using co-training did not significantly improve performance over the baseline

• However, we plan to do more experiments in the future which leverage more recent work on parse selection and the difference between the Collins-CFG and LTAG views 17

Collins Parser learning curve: Penn Treebank sec.2−21 in order

0.9

0.88

Predicted performance =0.88586

F−measure

0.86

0.84

0.82

0.8

x

f = a(1−b ) a = 0.892732 b = 0.574215 c = 0.20543

0.78

Best observed = 0.886645 0.76

0

0.5

1

1.5

2 Training sentences

c

2.5

3

3.5

Upper bound = 89.3% Projected to Treebank of size 80K sentences = 88.9%

4

x 10

400K = 89.2%

18

LTAG parser learning curve: fscores for sentences

Semi-Supervised Training for Statistical Parsing

Semi-Supervised Training for Statistical Parsing

Suggest Documents

Semi-Supervised Training for Statistical Parsing - University of ...

Chinese Statistical Parsing

Learning Structured Classifiers for Statistical Dependency Parsing

A Statistical Parsing Framework for Sentiment Classification

Optimizing Local Probability Models for Statistical Parsing

Learning Structured Classifiers for Statistical Dependency Parsing

Statistical Decision-Tree Models for Parsing - Association for ...

Statistical Parsing of Dutch using Maximum

Statistical Visual Language Models for Ink Parsing - EECS at UC ...

Statistical Parsing for Harmonic Analysis of Jazz Chord Sequences

Learning for Semantic Parsing with Statistical Machine Translation

Inducing History Representations for Broad Coverage Statistical Parsing

Word Clustering for Persian Statistical Parsing - Semantic Scholar

Statistical Visual Language Models for Ink Parsing - People @ EECS ...

Semisupervised Learning for Computational Linguistics - CiteSeerX

Self-Training for Biomedical Parsing - Stanford NLP Group

Treebank Conversion based Self-training Strategy for Parsing

A Regularized Approach for Geodesic-Based Semisupervised

Structured Training for Neural Network Transition-Based Parsing

Semisupervised Transfer Component Analysis for Domain Adaptation ...

ERROR SIMULATION FOR TRAINING STATISTICAL DIALOGUE ...

Generating a grammar for statistical training

PartBook for Image Parsing

PartBook for Image Parsing