Exploring the distribution of animacy: experiments on Norwegian Lilja Øvrelid NLP-unit, Dept. of Swedish G¨oteborg University
[email protected]
Animacy is a an inherent property of the referents of nouns which has been claimed to figure as an influencing factor in a range of different grammatical phenomena in various languages. In recent years several linguistic studies have examined the influence of argument animacy in grammatical phenomena such as differential object marking (Aissen, 2003), the passive construction (Dingare, 2001), the dative alternation (Bresnan et al., 2005) etc. A variety of languages are sensitive to the dimension of animacy in the expression and interpretation of core syntactic arguments (Lee, 2002; Øvrelid, 2004), either on a categorical level or as a strong statistical tendency. This talk will report on machine learning experiments aimed at automatically acquiring animacy information for common nouns in Norwegian (Øvrelid, 2006), which show that the animacy of a noun influences its linguistic distribution in such a consistent manner that an automatic classification based on distributional features is worthwhile. By exploiting the strong correlation between the animacy dimension and other linguistic dimensions, new knowledge about the semantic and distributional properties of various constructions may be obtained with very little manual effort. The experiments also raise the question of how the dimension of animacy may be conceptualised and delimited based on distributional evidence from large corpora. Machine learning experiments are to a large degree dependent on the set of features chosen to represent the data. A key generalisation or tendency observed in both traditional typological linguistics, as well as the more recent linguistic studies where animacy figures, is that prominent grammatical features tend to attract other prominent features (Aissen, 2003); subjects, for instance, will tend to be animate and agentive, whereas objects prototypically are inanimate and themes/patients. Exceptions to this generalisation express a more marked structure, a property which has consequences, for instance, for the distributional properties of the structure in question. For these experiments, a set of seven morphosyntactic features were selected, features which in various ways approximate the multi-faceted property of animacy. The seven features are presented in Table 1. In particular, these features exploit the strong correlation that animacy has to other linguistic dimensions, such as agentivity and discourse salience. The experimental methodology is inspired by experiments done on verb classification for intransitive verbs presented in Merlo and Stevenson (2001). For a set of forty highly frequent common nouns (20 animate, 20 inanimate), relative frequencies for the different morphosyntactic features described above were computed from the Oslo Corpus, a corpus of approximately 15 million words which has been automatically annotated with a Constraint Grammar tagger1 . The mean relative frequencies for each class - animate and inanimate - are presented in the first two rows of Table 2. As we can see, quite a few of the features express morphosyntactic cues that are rather rare, and there is also quite a bit of variation in the data (represented by the standard deviation for each class-feature combination). 1
The corpus is freely available for research purposes, see http://www.hf.uio.no/tekstlab for more information.
1
Feature SUBJ OBJ GEN PASS ANAAN ANAIN REFL
Description (corpus extraction) Unambiguously tagged subject followed by a finite verb and an (direct) object Unambiguously tagged direct object Genitive morphology (-s ending) Demoted agent of a passive (complement of by-phrase) Reference by animate personal pronoun (noun occurs as intransitive subject in a sentence preceding an initial animate pronoun.) Reference by inanimate personal pronoun (noun occurs as intransitive subject in a sentence preceding an initial inanimate pronoun.) Local reference by reflexive pronoun Table 1: The seven morphosyntactic features employed to approximate animacy.
Class A I O
S UBJ Mean SD 0.14 0.05 0.07 0.03 0.20 0.10
O BJ Mean SD 0.11 0.03 0.23 0.10 0.06 0.03
G EN Mean SD 0.04 0.02 0.02 0.03 0.12 0.06
PASS Mean SD 0.006 0.005 0.002 0.002 0.012 0.014
A NA A NIM Mean SD 0.009 0.006 0.003 0.002 0.0009 0.001
A NA I NAN Mean SD 0.003 0.003 0.006 0.003 0.005 0.002
R EFL Mean SD 0.005 0.0008 0.001 0.0008 0.004 0.0017
Table 2: Mean relative frequencies and standard deviation for each class (A(nimate), I(nanimate), O(rganization)) from feature extraction.
This, however, is to be expected as all the features represent approximations of animacy, gathered from an automatically annotated, possibly quite noisy, corpus. Even so, the mean frequencies for the features all express a clear difference between the two classes in terms of distributional properties; the difference between the mean feature values for the classes range from double to five times the lowest class value. And indeed, these distributional differences are strong enough to base automatic classification of unseen nouns on. A classifier trained and tested on the forty nouns employing memory-based learning (Daelemans, 1999) achieves a performance of 95.0% accuracy, a 90% improvement of the baseline. These results are interesting also from a more theoretical point of view. The linguistically motivated features chosen to approximate the property of animacy show a significant distributional difference which corresponds to gradient cut-off points on an animacy hierarchy. Under the assumption that animacy is a lexical property, at least at the core oppositions of an animacy hierarchy, the animacy of a noun may provide evidence for a more fine-grained lexical-semantic analysis of various constructions. Due to the strong correlation between animacy and agentivity, an empirical study by way of animacy also sheds light on the thematic role distribution with reference to particular constructions, employing data averaged over a large corpus and hence uncovering distributional preferences. For instance, the reflexive feature REFL is one of the strongest features for the prediction of animacy in our experiments. Reflexives in Norwegian do not, however, necessarily express an agentive event, and may also be employed in medial constructions. However, with regards to productivity it seems clear from the present study that the reflexives figuring animate subjects are predominant, hence implying an agentive reading. Machine learning experiments of this type also allow for the category of animacy itself to be explored. Various proposals have been made in the literature for animacy hierarchies (Silverstein, 1976; Comrie, 1989; Yamamoto, 1999). In an annotation study of animacy, Zaenen et al. (2004) propose a main three-way distinction for the category of animacy, where an intermediate category ’other animates’ 2
includes a rather heterogeneous set of entities: organisations, animals, intelligent machines and vehicles. However, what these seem to have in common is that they may all be construed linguistically as animate beings, even though they, in the real world, are not. An experiment with a three-way classification task was therefore performed. For this experiment we extended the existing data set of forty nouns with a set of twenty organization nouns and the same feature set. The relative frequencies obtained for these nouns are presented in the final row of Table 2. The classification performance obtained for the three-way classification task is 88.3%, a clear improvement from a 33.3.% baseline and on a par with the results from the binary classification task. This indicates that the distributional properties of the organization entities sets them apart from the clearcut animate/inanimate nouns distributionally. A closer look at the data provides empirical support for the intermediate animacy status of these nouns, showing a more flexible distributional behaviour compatible with both animate and inanimate readings. For instance, the organization nouns exhibit a much higher mean relative frequency for the genitive feature GEN. A small corpus study indicates that the organization nouns may in fact occur with the possessive relations associated with both animate and inanimate nouns, giving them a wider range of available lexical-semantic realizations. In future work, it would be interesting to extend the current approach to verbs. Knowledge regarding the distribution of argument animacy clearly has bearings on the lexical semantics of various verbs and verb classes. Also, the linguistically motivated feature set selected for this task provides a basis for studying cross-linguistics variation of animacy. Experiments on other languages than Norwegian is therefore a line of research that might be worth pursuing.
References Judith Aissen. Differential Object Marking: Iconicity vs. economy. Natural Language and Linguistic Theory, 21:435–483, 2003. Joan Bresnan, Anna Cueni, Tatiana Nikitina, and Harald Baayen. Predicting the dative alternation. To appear in Royal Netherlands Academy of Science Workshop on Foundations of Interpretation proceedings, 2005. Bernard Comrie. Language Universals and Linguistic Typology. University of Chicago Press, 1989. Walter Daelemans. Memory-based language processing. Journal for Experimental and Theoretical Artificial Intelligence, special issue, 11(3):287–467, 1999. Shipra Dingare. The effect of feature hierarchies on frequencies of passivization in English. Master’s thesis, Stanford University, August 2001. Hanjung Lee. Prominence mismatch and markedness reduction in word order. Natural Language and Linguistic Theory, 2002. Paola Merlo and Suzanne Stevenson. Automatic verb classification based on statistical distributions of argument structure. Computational Linguistics, 27(3):373–408, 2001. Lilja Øvrelid. Disambiguation of syntactic functions in Norwegian: modeling variation in word order interpretations conditioned by animacy and definiteness. In Fred Karlsson, editor, Proceedings of the 20th Scandinavian Conference of Linguistics, Helsinki, 2004. Lilja Øvrelid. Towards robust animacy classification using morphosyntactic distributional features. In Proceedings of the EACL 2006 Student Research Workshop, Trento, Italy, 2006. M. Silverstein. Hierarchy of features and ergativity. In R.M.W. Dixon, editor, Grammatical categories in Australian Languages. Canberra:Australian Institute of Aboriginal Studies, 1976. Mutsumi Yamamoto. Animacy and Reference: A cognitive approach to corpus linguistics. John Benjamins Publishing Company, 1999. Annie Zaenen, Jean Carletta, Gregory Garretson, Joan Bresnan, Andrew Koontz-Garboden, Tatiana Nikitina, M. Catherine O’Connor, and Tom Wasow. Animacy encoding in English: why and how. In D. Byron and B. Webber, editors, ACl Workshop on Discourse Annotation, Barcelona, 2004.
3