Why Search in Syntactically Annotated Corpora? To facilitate (e.g.): ... But my intuitions are never wrong! White, Brew ... Ratios of Google Counts for NP Types.
Intro/Examples Methodology Tools References
Searching Syntactically Annotated Corpora Michael White and Chris Brew Department of Linguistics The Ohio State University
OSU Mini-Institute Corpus-Based Computational Linguistics Day 3
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
[Beaver et al., 2005] [Resnik et al., 2005]
Why Search in Syntactically Annotated Corpora?
To facilitate (e.g.): Analysis of constructions in corpus-based grammar engineering Fine-grained error analysis of statistical parsers Data-driven linguistic research
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
[Beaver et al., 2005] [Resnik et al., 2005]
Data-Driven Linguistic Research? (Really?) But my intuitions are never wrong!
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
[Beaver et al., 2005] [Resnik et al., 2005]
Data-Driven Linguistic Research? (Really?) But my intuitions are never wrong!
(You might at least look for counter-examples.)
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
[Beaver et al., 2005] [Resnik et al., 2005]
NP Distribution in Existentials It is well known that not all NP types are equally felicitous in existential constructions: (1) There is a problem with this mobile phone. (6) ??There is the problem (again) with this mobile phone.
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
[Beaver et al., 2005] [Resnik et al., 2005]
NP Distribution in Existentials It is well known that not all NP types are equally felicitous in existential constructions: (1) There is a problem with this mobile phone. (6) ??There is the problem (again) with this mobile phone.
The literature on this “definiteness effect” is largely based on constructed English examples.
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
[Beaver et al., 2005] [Resnik et al., 2005]
NP Distribution in Existentials It is well known that not all NP types are equally felicitous in existential constructions: (1) There is a problem with this mobile phone. (6) ??There is the problem (again) with this mobile phone.
The literature on this “definiteness effect” is largely based on constructed English examples. [Beaver et al., 2005] argue, using web and corpus data, that the definiteness effect is not categorical; moreover, consistent with this view, they claim that cross-linguistic variation in such constructions can be accounted for by markedness constraints on subjects.
White, Brew
Searching Syntactically Annotated Corpora
position.
Intro/Examples Methodology Tools References
[Beaver et al., 2005] [Resnik et al., 2005]
3.2. Results
Ratios of Google Counts for NP Types For the study of English NP distribution using Google, results are presented for a range of NPeffect, types in figure 1. categorical Definiteness but not local pro pro 3rd sing each most local poss all/every mr the both this/these only loc pro that/those a one some something at least N many a few a couple of a door few numerals no at most N 0.01
0.1
1
10
100
1000
10000
100000
1000000
#canonical/#existential (log scale)
White, Brew Searching Syntactically Figure 1: English Canonical/Existential ratiosAnnotated Corpora
in part to the desire of newspaper writers to appear authoritative about the informaIntro/Examples [Beaver if et indefinites al., 2005] are given a specific Methodology tion they present, an effect which may be obtained Tools [Resnik et al., 2005] reading. If this reading is more easily obtained in canonical subject position, pivot References position will be disfavored for indefinites. However, the effect runs across the board of NP types (though it is particularly strong for indefinites), so there must be other factors, e.g. a desire to keep length down, and so avoid expletives that from an editor’s point of view might appear unnecessary.
Ratios of Annotated Corpora Counts for NP Types Genre effect too
Switchboard
Brown
WSJ
100000
Canonical/Existential Ratio
10000
1000
100
10
1
0.1
pro (he etc.)
name
the
this
that
possessive
number (pro)
Det N
number
somebody/one
a few
Det of
some
many
some (pro)
a/an
something
no
nothing
0.01
Brew of Searching Syntactically Figure 2: White, Comparison C/E ratios by GenreAnnotated Corpora
Intro/Examples Methodology Tools References
[Beaver et al., 2005] [Resnik et al., 2005]
Checking Experimental Materials As reported in [Resnik et al., 2005], to avoid a confound in a psycholinguistic study of Principle C effects on sentence processing, Lau et al. needed to verify that the complement clauses in sentences like (18) are almost obligatory: (18a) It was clear to himi that John∗i/j should go. (18b) It was clear to hisi mother that Johni should go.
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
[Beaver et al., 2005] [Resnik et al., 2005]
Checking Experimental Materials As reported in [Resnik et al., 2005], to avoid a confound in a psycholinguistic study of Principle C effects on sentence processing, Lau et al. needed to verify that the complement clauses in sentences like (18) are almost obligatory: (18a) It was clear to himi that John∗i/j should go. (18b) It was clear to hisi mother that Johni should go.
Using the Linguist’s Search Engine (LSE), they were able to verify that sentences of the form It [aux] [adj] to [NP] virtually always continue with a complement clause in naturally occurring web data.
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
[Beaver et al., 2005] [Resnik et al., 2005]
Checking Experimental Materials As reported in [Resnik et al., 2005], to avoid a confound in a psycholinguistic study of Principle C effects on sentence processing, Lau et al. needed to verify that the complement clauses in sentences like (18) are almost obligatory: (18a) It was clear to himi that John∗i/j should go. (18b) It was clear to hisi mother that Johni should go.
Using the Linguist’s Search Engine (LSE), they were able to verify that sentences of the form It [aux] [adj] to [NP] virtually always continue with a complement clause in naturally occurring web data. On off-line completion study later confirmed the conclusions of their LSE search.
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Recall and Precision Web Corpora
How to Measure Search Quality?
Definition Recall =
# correct results returned # all desired results
Definition Precision =
# correct results returned # all returned results
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Recall and Precision Web Corpora
How to Measure Search Quality?
Definition Recall =
# correct results returned # all desired results
Definition Precision =
# correct results returned # all returned results
(There’s always a tradeoff between recall and precision.)
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Recall and Precision Web Corpora
Google Searches
Beaver et al. searched for (e.g.): There is the * in
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Recall and Precision Web Corpora
Google Searches
Beaver et al. searched for (e.g.): There is the * in Is this high or low recall?
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Recall and Precision Web Corpora
Google Searches
Beaver et al. searched for (e.g.): There is the * in Is this high or low recall? Spelling out entire paradigms can sometimes help, but inevitable approximations lead to low precision. What to do?
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Recall and Precision Web Corpora
Google Searches
Beaver et al. searched for (e.g.): There is the * in Is this high or low recall? Spelling out entire paradigms can sometimes help, but inevitable approximations lead to low precision. What to do? Beaver et al. extrapolated from a manually verified sample.
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Recall and Precision Web Corpora
Google Searches
Beaver et al. searched for (e.g.): There is the * in Is this high or low recall? Spelling out entire paradigms can sometimes help, but inevitable approximations lead to low precision. What to do? Beaver et al. extrapolated from a manually verified sample.
N.B.: For bigrams, [Keller and Lapata, 2003] show that web frequencies correlate highly with corpus frequencies, and reliably with plausability judgments.
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Recall and Precision Web Corpora
Annotated Corpora
Annotated corpora make more precise queries possible without sacrificing recall.
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Recall and Precision Web Corpora
Annotated Corpora
Annotated corpora make more precise queries possible without sacrificing recall. Beaver et al. used tgrep2 to search for all NPs in existential constructions.
They also enable repeatable experiments, are carefully collected, avoid potential biases introduced by result rankings, etc.
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Recall and Precision Web Corpora
Annotated Corpora
Annotated corpora make more precise queries possible without sacrificing recall. Beaver et al. used tgrep2 to search for all NPs in existential constructions.
They also enable repeatable experiments, are carefully collected, avoid potential biases introduced by result rankings, etc. But data sparsity is still a big issue.
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Recall and Precision Web Corpora
Annotated Corpora
Annotated corpora make more precise queries possible without sacrificing recall. Beaver et al. used tgrep2 to search for all NPs in existential constructions.
They also enable repeatable experiments, are carefully collected, avoid potential biases introduced by result rankings, etc. But data sparsity is still a big issue. (Darn Zipf!)
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Recall and Precision Web Corpora
Annotated Corpora
Annotated corpora make more precise queries possible without sacrificing recall. Beaver et al. used tgrep2 to search for all NPs in existential constructions.
They also enable repeatable experiments, are carefully collected, avoid potential biases introduced by result rankings, etc. But data sparsity is still a big issue. (Darn Zipf!) And annotation errors can get in the way — see [Meurers and M¨ uller, 2007] for discussion.
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
POS Tags Syntax Trees Web + Trees
Do Part-of-Speech Tags Suffice?
There are several nice web interfaces, e.g. http://www.americancorpus.org/ (Davies/BYU) And nltk provides easy-to-learn programmable tools
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
POS Tags Syntax Trees Web + Trees
Tregex/Tsurgeon
Tregex/Tsurgeon [Levy and Andrew, 2006] is a tree matching and editing tool that extends the well-known tgrep2 in several useful ways. Cross-platform Java API, GUI, easy-to-install. Supports tree surgery!
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
POS Tags Syntax Trees Web + Trees
TIGERSearch
TIGERSearch supports creating a query by example. Cross-platform, GUI. Allows discontinuous constituents to be directly represented; especially useful for the TIGER corpus of German.
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
POS Tags Syntax Trees Web + Trees
LSE
The Linguist’s Search Engine (LSE) combines a query-by-example interface with automatic parsing and web crawling. You could create your own automatically parsed corpora from custom web searches, but LSE nicely bundles all the required steps for you (http://lse.umiacs.umd.edu:8080/).
White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
Beaver, D., Francez, I., and Levinson, D. (2005). Bad Subject: (Non-)Canonicality and NP Distribution in Existentials. In Proc. Semantics and Linguistic Theory XV. Keller, F. and Lapata, M. (2003). Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29:459–484. Levy, R. and Andrew, G. (2006). Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In 5th International Conference on Language Resources and Evaluation (LREC 2006). Meurers, W. D. and M¨ uller, S. (2007). Corpora and syntax. White, Brew
Searching Syntactically Annotated Corpora
Intro/Examples Methodology Tools References
In L¨ udeling, A. and Kyt¨ o, M., editors, Corpus Linguistics. Mouton de Gruyter, Berlin. Resnik, P., Elkiss, A., Lau, E., and Taylor, H. (2005). The Web in Theoretical Linguistics Research: Two Case Studies using the Linguist’s Search Engine. In Proc. of the 31st Meeting of the Berkeley Linguistics Society.
White, Brew
Searching Syntactically Annotated Corpora