Searching Syntactically Annotated Corpora

4 downloads 1280 Views 754KB Size Report
Why Search in Syntactically Annotated Corpora? To facilitate (e.g.): ... But my intuitions are never wrong! White, Brew ... Ratios of Google Counts for NP Types.
Intro/Examples Methodology Tools References

Searching Syntactically Annotated Corpora Michael White and Chris Brew Department of Linguistics The Ohio State University

OSU Mini-Institute Corpus-Based Computational Linguistics Day 3

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

[Beaver et al., 2005] [Resnik et al., 2005]

Why Search in Syntactically Annotated Corpora?

To facilitate (e.g.): Analysis of constructions in corpus-based grammar engineering Fine-grained error analysis of statistical parsers Data-driven linguistic research

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

[Beaver et al., 2005] [Resnik et al., 2005]

Data-Driven Linguistic Research? (Really?) But my intuitions are never wrong!

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

[Beaver et al., 2005] [Resnik et al., 2005]

Data-Driven Linguistic Research? (Really?) But my intuitions are never wrong!

(You might at least look for counter-examples.)

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

[Beaver et al., 2005] [Resnik et al., 2005]

NP Distribution in Existentials It is well known that not all NP types are equally felicitous in existential constructions: (1) There is a problem with this mobile phone. (6) ??There is the problem (again) with this mobile phone.

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

[Beaver et al., 2005] [Resnik et al., 2005]

NP Distribution in Existentials It is well known that not all NP types are equally felicitous in existential constructions: (1) There is a problem with this mobile phone. (6) ??There is the problem (again) with this mobile phone.

The literature on this “definiteness effect” is largely based on constructed English examples.

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

[Beaver et al., 2005] [Resnik et al., 2005]

NP Distribution in Existentials It is well known that not all NP types are equally felicitous in existential constructions: (1) There is a problem with this mobile phone. (6) ??There is the problem (again) with this mobile phone.

The literature on this “definiteness effect” is largely based on constructed English examples. [Beaver et al., 2005] argue, using web and corpus data, that the definiteness effect is not categorical; moreover, consistent with this view, they claim that cross-linguistic variation in such constructions can be accounted for by markedness constraints on subjects.

White, Brew

Searching Syntactically Annotated Corpora

position.

Intro/Examples Methodology Tools References

[Beaver et al., 2005] [Resnik et al., 2005]

3.2. Results

Ratios of Google Counts for NP Types For the study of English NP distribution using Google, results are presented for a range of NPeffect, types in figure 1. categorical Definiteness but not local pro pro 3rd sing each most local poss all/every mr the both this/these only loc pro that/those a one some something at least N many a few a couple of a door few numerals no at most N 0.01

0.1

1

10

100

1000

10000

100000

1000000

#canonical/#existential (log scale)

White, Brew Searching Syntactically Figure 1: English Canonical/Existential ratiosAnnotated Corpora

in part to the desire of newspaper writers to appear authoritative about the informaIntro/Examples [Beaver if et indefinites al., 2005] are given a specific Methodology tion they present, an effect which may be obtained Tools [Resnik et al., 2005] reading. If this reading is more easily obtained in canonical subject position, pivot References position will be disfavored for indefinites. However, the effect runs across the board of NP types (though it is particularly strong for indefinites), so there must be other factors, e.g. a desire to keep length down, and so avoid expletives that from an editor’s point of view might appear unnecessary.

Ratios of Annotated Corpora Counts for NP Types Genre effect too

Switchboard

Brown

WSJ

100000

Canonical/Existential Ratio

10000

1000

100

10

1

0.1

pro (he etc.)

name

the

this

that

possessive

number (pro)

Det N

number

somebody/one

a few

Det of

some

many

some (pro)

a/an

something

no

nothing

0.01

Brew of Searching Syntactically Figure 2: White, Comparison C/E ratios by GenreAnnotated Corpora

Intro/Examples Methodology Tools References

[Beaver et al., 2005] [Resnik et al., 2005]

Checking Experimental Materials As reported in [Resnik et al., 2005], to avoid a confound in a psycholinguistic study of Principle C effects on sentence processing, Lau et al. needed to verify that the complement clauses in sentences like (18) are almost obligatory: (18a) It was clear to himi that John∗i/j should go. (18b) It was clear to hisi mother that Johni should go.

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

[Beaver et al., 2005] [Resnik et al., 2005]

Checking Experimental Materials As reported in [Resnik et al., 2005], to avoid a confound in a psycholinguistic study of Principle C effects on sentence processing, Lau et al. needed to verify that the complement clauses in sentences like (18) are almost obligatory: (18a) It was clear to himi that John∗i/j should go. (18b) It was clear to hisi mother that Johni should go.

Using the Linguist’s Search Engine (LSE), they were able to verify that sentences of the form It [aux] [adj] to [NP] virtually always continue with a complement clause in naturally occurring web data.

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

[Beaver et al., 2005] [Resnik et al., 2005]

Checking Experimental Materials As reported in [Resnik et al., 2005], to avoid a confound in a psycholinguistic study of Principle C effects on sentence processing, Lau et al. needed to verify that the complement clauses in sentences like (18) are almost obligatory: (18a) It was clear to himi that John∗i/j should go. (18b) It was clear to hisi mother that Johni should go.

Using the Linguist’s Search Engine (LSE), they were able to verify that sentences of the form It [aux] [adj] to [NP] virtually always continue with a complement clause in naturally occurring web data. On off-line completion study later confirmed the conclusions of their LSE search.

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Recall and Precision Web Corpora

How to Measure Search Quality?

Definition Recall =

# correct results returned # all desired results

Definition Precision =

# correct results returned # all returned results

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Recall and Precision Web Corpora

How to Measure Search Quality?

Definition Recall =

# correct results returned # all desired results

Definition Precision =

# correct results returned # all returned results

(There’s always a tradeoff between recall and precision.)

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Recall and Precision Web Corpora

Google Searches

Beaver et al. searched for (e.g.): There is the * in

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Recall and Precision Web Corpora

Google Searches

Beaver et al. searched for (e.g.): There is the * in Is this high or low recall?

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Recall and Precision Web Corpora

Google Searches

Beaver et al. searched for (e.g.): There is the * in Is this high or low recall? Spelling out entire paradigms can sometimes help, but inevitable approximations lead to low precision. What to do?

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Recall and Precision Web Corpora

Google Searches

Beaver et al. searched for (e.g.): There is the * in Is this high or low recall? Spelling out entire paradigms can sometimes help, but inevitable approximations lead to low precision. What to do? Beaver et al. extrapolated from a manually verified sample.

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Recall and Precision Web Corpora

Google Searches

Beaver et al. searched for (e.g.): There is the * in Is this high or low recall? Spelling out entire paradigms can sometimes help, but inevitable approximations lead to low precision. What to do? Beaver et al. extrapolated from a manually verified sample.

N.B.: For bigrams, [Keller and Lapata, 2003] show that web frequencies correlate highly with corpus frequencies, and reliably with plausability judgments.

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Recall and Precision Web Corpora

Annotated Corpora

Annotated corpora make more precise queries possible without sacrificing recall.

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Recall and Precision Web Corpora

Annotated Corpora

Annotated corpora make more precise queries possible without sacrificing recall. Beaver et al. used tgrep2 to search for all NPs in existential constructions.

They also enable repeatable experiments, are carefully collected, avoid potential biases introduced by result rankings, etc.

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Recall and Precision Web Corpora

Annotated Corpora

Annotated corpora make more precise queries possible without sacrificing recall. Beaver et al. used tgrep2 to search for all NPs in existential constructions.

They also enable repeatable experiments, are carefully collected, avoid potential biases introduced by result rankings, etc. But data sparsity is still a big issue.

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Recall and Precision Web Corpora

Annotated Corpora

Annotated corpora make more precise queries possible without sacrificing recall. Beaver et al. used tgrep2 to search for all NPs in existential constructions.

They also enable repeatable experiments, are carefully collected, avoid potential biases introduced by result rankings, etc. But data sparsity is still a big issue. (Darn Zipf!)

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Recall and Precision Web Corpora

Annotated Corpora

Annotated corpora make more precise queries possible without sacrificing recall. Beaver et al. used tgrep2 to search for all NPs in existential constructions.

They also enable repeatable experiments, are carefully collected, avoid potential biases introduced by result rankings, etc. But data sparsity is still a big issue. (Darn Zipf!) And annotation errors can get in the way — see [Meurers and M¨ uller, 2007] for discussion.

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

POS Tags Syntax Trees Web + Trees

Do Part-of-Speech Tags Suffice?

There are several nice web interfaces, e.g. http://www.americancorpus.org/ (Davies/BYU) And nltk provides easy-to-learn programmable tools

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

POS Tags Syntax Trees Web + Trees

Tregex/Tsurgeon

Tregex/Tsurgeon [Levy and Andrew, 2006] is a tree matching and editing tool that extends the well-known tgrep2 in several useful ways. Cross-platform Java API, GUI, easy-to-install. Supports tree surgery!

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

POS Tags Syntax Trees Web + Trees

TIGERSearch

TIGERSearch supports creating a query by example. Cross-platform, GUI. Allows discontinuous constituents to be directly represented; especially useful for the TIGER corpus of German.

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

POS Tags Syntax Trees Web + Trees

LSE

The Linguist’s Search Engine (LSE) combines a query-by-example interface with automatic parsing and web crawling. You could create your own automatically parsed corpora from custom web searches, but LSE nicely bundles all the required steps for you (http://lse.umiacs.umd.edu:8080/).

White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

Beaver, D., Francez, I., and Levinson, D. (2005). Bad Subject: (Non-)Canonicality and NP Distribution in Existentials. In Proc. Semantics and Linguistic Theory XV. Keller, F. and Lapata, M. (2003). Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29:459–484. Levy, R. and Andrew, G. (2006). Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In 5th International Conference on Language Resources and Evaluation (LREC 2006). Meurers, W. D. and M¨ uller, S. (2007). Corpora and syntax. White, Brew

Searching Syntactically Annotated Corpora

Intro/Examples Methodology Tools References

In L¨ udeling, A. and Kyt¨ o, M., editors, Corpus Linguistics. Mouton de Gruyter, Berlin. Resnik, P., Elkiss, A., Lau, E., and Taylor, H. (2005). The Web in Theoretical Linguistics Research: Two Case Studies using the Linguist’s Search Engine. In Proc. of the 31st Meeting of the Berkeley Linguistics Society.

White, Brew

Searching Syntactically Annotated Corpora

Suggest Documents