Automatic Extraction of Subcategorization Frames for Bulgarian Svetoslav Marinov GSLT, Sweden
[email protected]
Abstract. Knowledge of verb’s valency or subcategorization is essential for many NLP tasks. The present paper describes an attempt to learn this kind of information from a corpus of parsed sentences of Bulgarian. Our program acquired the subcategorization information for 38 verbs and achieved 87.7% precision and 68.3% recall. We did not use predefined sets of frames but automatically induced such from a treebank.
1
Introduction
Subcategorization Frame (SF) is an expression of what kind of and how many syntactic arguments a verb takes. The external argument, i.e. the subject of the clause, is usually excluded from the SF and the emphasis is on the internal ones - objects, infinitives, that-clauses, participle clauses, prepositional phrases. In this paper we will present our attempt to extract and learn SFs for certain verbs from a parsed corpus of Bulgarian sentences. The work has been done on TreeBank data, kindly provided by the BulTreeBank project1 . The paper is organized as follows: Section 2 presents the task and reviews previous attempts and results in the area of Automatic SF extraction; section 3 gives a brief overview of some linguistic phenomena concerning verbs in Bulgarian; section 4 describes the TreeBank we have used; section 5 presents our system for extraction and learning SFs; section 6 gives the results and section 7 concludes. 1
We would like to thank Kiril Simov for letting us use the corpus, which is still under development.
1 Proceedings of the Ninth ESSLLI Student Session Paul Egr´ e and Laura Alonso i Alemany (editors) c 2004, Svetoslav Marinov Chapter 1, Copyright
2
Background
Manning (1993) and Briscoe and Carroll (1997), among others, discuss the necessities of automatically acquiring knowledge about verbs’ SFs rather than manually creating such lexicons. The authors agree on several issues, such as: 1) Manually compiled verb lexicons (i.e. lexicons of verbs with their respective SFs) are time-consuming, demand a lot of human resources and often contain mistakes; 2) such lexicons do not exist for all languages, e.g. Bulgarian; 3) they seldom represent the up-to-date language use; 4) they are difficult to update and 5) they are not coherent. Automatically created lexicons have, according to the authors, the benefit of 1) being based on real data, i.e. text corpora; 2) having the possibility of being updated quickly; etc. Brent (1991) is one of the earliest references to acquisition of verbs’ SFs. His program works on untagged text from the Wall Street Journal from which verbs are extracted first. In order to do this he uses simplistic, yet unambiguous, cues to detect the verbs. These are namely the positions of pronouns and proper names, i.e. the so called “case-marked” elements, which predict that a verb will be in the closest vicinity. After that the SFs are detected and statistically verified. Brent’s best result is the discovery of 40% of the verbs taking direct object, but this with only 1.5% error rate and the worst result gives 3% of all the verbs taking infinitive, yet with 3.0% error rate. Manning (1993) criticizes this approach in terms of waste of valuable data. He also starts with raw data but first tags it using Julian Kupiec’s stochastic part-of-speech tagger and then the output is sent to a finite state parser in order to extract the SFs. His system, in comparison to Brent’s, distinguishes among 19 different SFs, yet still uses a variant of Brent’s binomial filtering process to filter the much higher amount of false cues (i.e. initially discovered SFs). Precision and recall were measured, using Oxford Advanced Learner’s Dictionary as a benchmark. Thus the precision estimate for 40 verbs was 90% and the recall 43%. Briscoe and Carroll (1997) presented a complete system that distinguishes among slightly more than 160 different subcategorization classes. The main distinctions with respect to Manning’s approach concern, according to the authors, the use of global instead of strictly local syntactic information and the use of a more complex (linguistically guided) filter on the extracted patterns. The system’s output was evaluated by comparing the results for randomly 2
selected verbs to both manual analyses of the verbs in question and to their respective entries in dictionaries containing argument structure information2 . The results obtained comparing the system’s output to the manually made analyses gave a precision of 76.6% and a recall of 43.4%. One weak link of the system that Briscoe and Carroll themselves point out is the filtering of patterns extracted from the data, especially for low frequency SFs. The filtering process is built on the binomial distribution test originally introduced by Brent. Sarkar and Zeman (2000) take a slightly different approach to extracting SFs. First, they use syntactically annotated data (i.e. The Prague Dependency Tree Bank) and hence the choice of language also differs from the previous approaches, all dealing with English. Czech is a free word-order language and much closer to Bulgarian than English. So the results and techniques discussed by the authors are quite relevant for us. Sarkar and Zeman (2000) concentrate mostly on the filtering of adjuncts from the observed frames. They use 3 different statistical techniques to learn possible SFs for certain verbs, namely, Likelihood ratio test, T-scores and Binomial distribution. They check all possible subsets of observed frames in order to find the best match. Their best results were achieved using the Binomial distribution. Thus precision was measured 88% and recall 74%, and 137 SFs were learned. The approach we take to SF-extraction for Bulgarian is similar to the one by Sarkar and Zeman (2000). Namely, we do not work with predefined set of frames (e.g. Manning (1993), Briscoe and Carroll (1997)), but rather acquire the frames from a treebank.3 Statistical processing of the initially discovered frames is almost inevitable4 . Most of the authors employ similar techniques, Log likelihood ratio tests (Sarkar and Zeman 2000, Gorrell 1999), T-score (Sarkar and Zeman 2000), Binomial distribution (Sarkar and Zeman 2000, Brent 1994, Manning 1993, Briscoe and Carroll 1997), EM algorithm (Carroll and Rooth 1998), Clustering algorithms (Basili et al. 1997). 2
The ANLT and the COMLEX dictionaries, see (Briscoe and Carroll 1997) We thank an anonymous reviewer for noticing the lack of this information in the paper. 4 In the majority of cases the SF-extraction algorithms discover also false frames/arguments for a given verb. These have to be filtered away either be using some hand-written rules or on a statistical basis. 3
3
Of the above methods Binomial distribution has been used most widely and probably most successfully. Yet, a problem with it is that it assumes a uniform error likelihood of all verbs disregarding their rather zipfian-like distribution. It disregards the correlation between the conditional distribution of SFs given a predicate and the unconditional distribution independent of a specific predicate (Korhonen 2002). Sarkar and Zeman (2000) also mention this and propose the use of a multinomial distribution.
3
Syntactic particularities of Bulgarian
Bulgarian is a South Slavic language and can be characterized as having a relatively free word order. Although Bulgarian does not have morphological case on nouns, unlike West and East Slavic languages, it possesses clitic pronouns marked for Dative and Accusative case. Whenever these clitics are present together with the nouns or noun phrases for direct and indirect object, i.e. the so called clitic doubling phenomenon, the full NPs5 may appear in any order and precede the main verb. The subject NP could appear after the main verb. In addition, the pronomial subject of a clause can be omitted since the verbal morphology carries the necessary information about person and number, i.e. on a par with the pro-drop phenomenon in some Romance languages, e.g. Italian, Spanish. These particularities are explained in (1). (1)
a.
b.
c.
d.
vidyah. Az go I himACC saw1p.sg ’I saw him.’ Vˇcera gi vidyah az. Yesterday themACC saw I ’I saw them yesterday!’ ya dadoha na detsata. Knigata im Book themDAT itACC gave3p.pl to kids ’They gave the book to the kids.’ Izyadoh az kiflata. Ate1p.sg I bun. ’I ate the bun.’
The above examples clearly show the following word orders - SOV, OVS, OInd ODir V, VSO, in addition to the default SVO. 5
Actually the indirect object will be a PP
4
Examples like (1) show the difficulties one could meet when designing an automatic SF-learner for Bulgarian.
4
BulTreeBank
For the present purpose we chose to look at data from the Bulgarian Tree Bank project (Simov et al. 2002). We were offered a preliminary version of the parsed corpus which consists of 580 sentences, fully parsed in HeadDriven Phrase Structure Grammar (HPSG) formalism and each word carries a rich part of speech tag. The original xml file was transliterated for the sake of ease in processing and viewing under different operating systems. However, since the corpus is still under development, information like subcategorization, is still missing from it. The standard for HPSG ARG-ST- or COMPS-features, that describe the argument structure of a verb, are not yet present in the annotated text. This renders our task non-trivial one and our results valuable for expanding the treebank. The information present in the data we work with is just a parse tree with begin- and end-tags for each node, so that a sentence like (2-a) looks like (2-b), which for sake of visibility is presented as a parsed tree in (3): (2)
a.
Ne mi e do smyah. Not me-refl is to laughter ”I’m not in the mood to laugh.”
b. NeTmiPp-d1s-t eVx---f-r3sdoR smyahNcmsi.
5
(3)
S VPC
V T
PP V
Ne T
Pron
V
mi Pp-d1s-t
e Vx—f-r3s
Prep
N
do R
smyah Ncmsi
Beyond the node labels and the POS-tags no other information is present in our training corpus.
5
The Program
In this section we outline the initial implementation of a system for learning SFs for Bulgarian verbs. The system consists of three modules, all implemented in the Oz programming language (van Roy and Haridi 2004). The data was initially preprocessed in order to remove some information that was not relevant for the task. The main module is the SF-extractor. Unlike Briscoe and Carroll (1997) we decided to keep to strictly local syntactic information, i.e. the algorithm searches for possible arguments only within the verb phrase, so that from this parse VPC V znae Vpii+f-r3s knows
NPA M
N
dva Mcm-i two
ezika Ncmt languages
the following information will be extracted:
6
znae:
NPA
The next module is the lemmatizer. Since there is no such tool available for Bulgarian6 , at least one that works with non-cyrillic input, we created our own. This was done in order to merge results found for different forms of the same verb. The last of the modules is a statistical filter using the binomial log-likelihood ratio test to filter out SFs. Although, the literature is not very favourable of this method, our little and sparse data did not allow for the use of other statistical techniques7 . However, our results show that this test is really fitted for low-frequency data, as argued already by Dunning (1993). We took the standard log-likelihood ratio test (−2 log λ): −2 log λ = 2 (
log L(p1 , k1 , n1 ) + log L(p2 , k2 , n2 ) − log L(p, k1 , n1 ) − log L(p, k2 , n2 )
) where: log L(p, n, k) = k ∗ log 2p + (n − k) ∗ log 2(1 − p) and: p1 =
k1 , n1
p2 =
k2 , n2
p=
k1 +k2 n1 +n2
In the above equations k1 is the count of the number of times the test verb appears with the test frame8 , n1 − k1 is the count of the number of times that verb appears with any other frame, k2 is the count of the number of times that frame appears with any other verb and n2 − k2 is the count of the number of times any other verb appears with any other frame. In other words, the log-likelihood ratio test describes the probability of the verb with a given frame, based on the overall likelihood of the frame. 6
Too late we learned about the existence of a morphological analyzer for Bulgarian, available at http://www.bultreebank.org/Resources.html 7 We intend, however, to check how other statistical techniques fare with respect to the sparse data set, e.g. Binomial distribution. 8 Here test verb and test frame mean the discovered verb and its discovered frame.
7
6
Results
We excluded all verbs (i.e. lemmas) that were seen less than 4 times in the corpus. Around 90% of the verbs were seen only once. Of the remaining verbs we decided to disregard ditransitives. The threshold was set to -2.0 and this gave us the final result: 130 verbs with their respective frames. Since there is no electronic valency dictionary for Bulgarian, nor a printed one, we had to rely on our own judgement, as well as the corpus, for verifying the results. Precision and recall were calculated using the standard formulae X X for the former and X+Z for the latter, where X is the number of cor- X+Y rectly identified frames, 114 in our case, Y is the false frames, 16 in number, Z is additional correct frames not identified by the algorithm but present in the data, i.e. 53. Thus precision was 87.7% and recall 68.3% per frame. The program learned 16 SFs for 38 different verbs. Although we worked with little data the overall result is comparable to previously described results, especially the one by Sarkar and Zeman (2000), since it deals with another Slavic language.
7
Conclusion
In this paper we looked at the area of automatic extraction of SFs for verbs. We reviewed why this is important and the previous attempts in this area. Our work differs from the previous ones in several respects. First we deal with Bulgarian, a Slavic language, whereas most other work is done on English. We addressed the issue how and why our work is important for extending the Bulgarian treebank. We also achieved results matching those employing the binomial distribution test, i.e. 87.7% precision. This work will be continued in several directions. One is to devise better heuristics for SF extraction (we have to include ditransitives) and employing a different or at least two statistical tests for filtering the results. We also intend to adopt a tagger for Bulgarian and attempt SF learning only from tagged data.
8
8
Some Verbs with SFs
chuvam : nil chuvam : CLCHE chuvam : Pron garantiram : N garantiram : NPA gledam : CoordP gledam : NPA gledam : PP govorya : nil govorya : Pron idvam : nil idvam : PP idvam : Pron imam : N imam : NPA iskam : CLDA V iskam : CLDA VPC iskam : CLDA VPS iskam : N izglezhdam : Adv izglezhdam : CLCHE izglezhdam : CLDA VPS izglezhdam : CoordP izlizam : Adv izlizam : CoordP izlizam : PP kazvam : CLCHE kazvam : N kazvam : Pron mislya : CLCHE mislya : PP mislya : Pron moga : CLDA V moga : CLDA VPA moga : CLDA VPC moga : CLDA VPS nosya : NPA
9
Bibliography Basili, R., M. T. Pazienza, and M. Vindigni (1997). Corpus-driven unsupervised learning of verb subcategorization frames. In Lecture Notes in Artificial Intelligence, Volume 1321, pp. 159–170. Brent, M. R. (1991). Automatic acquisition of subcategorization frames from untagged text. In Meeting of the Association for Computational Linguistics, pp. 209–214. Brent, M. R. (1994). Surface cues and robust inference as a basis for the early acquisition of subcategorization frames. Lingua 92, 433–470. Briscoe, T. and J. Carroll (1997). Automatic extraction of subcategorization from corpora. In Proceedings of the 5th ACL Conference on Applied Natural Language Processing, Washington, DC. Carroll, G. and M. Rooth (1998). Valence induction with a head-lexicalized PCFG. In Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing (EMNLP 3), Granada, Spain. Dunning, T. (1993, March). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74. Gorrell, G. (1999). Acquiring subcategorization from textual corpora. Master’s thesis, Cambridge University. Korhonen, A. (2002). Subcategorization acquisition. Technical report, University of Cambridge, Computer Laboratory. Manning, C. (1993). Automatic acquisition of a large subcategorization dictionary from corpora. In Proceedings of the 31st ACL, pp. 235–242. Sarkar, A. and D. Zeman (2000, August). Automatic extraction of subcategorization frames for czech. In Proceedings of COLING 2000, Saarbr¨ ucken, Germany.
10
Simov, K., G. Popova, and P. Osenova (2002). HPSG-based syntactic treebank of Bulgarian (BulTreeBank). In A. Wilson, P. Rayson, and T. McEnery (Eds.), A Rainbow of Corpora: Corpus Linguistics and the Languages of the World, pp. 135–142. Lincon-Europa, Munich. van Roy, P. and S. Haridi (2004). Concepts, Techniques and Models of Computer Programming. MIT Press.
11