and a verb (V) by means of parsing a newspaper corpus with a lexicalized ... of a verb, a noun (typically the object of the verb) and an adjective (which modifies ...
Significant Triples: Adjective+Noun+Verb Combinations
H EIKE Z INSMEISTER AND U LRICH H EID Institut für Maschinelle Sprachverarbeitung Universität Stuttgart Azenbergstraße 12, 70174 Stuttgart, Germany {zinsmeis,heid}@ims.uni-stuttgart.de
Abstract
We investigate the identification and, to some extent, the classification of collocational word groups that consist of an adjectival modifier (A), an accusative object noun (N), and a verb (V) by means of parsing a newspaper corpus with a lexicalized probabilistic grammar.1 Data triples are extracted from the resulting Viterbi parses and subsequently evaluated by means of a statistical association measure. The extraction results are then compared to predefined descriptive classes of ANV-triples. We also use a decision tree algorithm to classify part of the data obtained, on the basis of a small set of manually classified examples. Much of the candidate data is lexicographically relevant: the triples include idiomatic combinations (e.g. (sich) eine goldene Nase verdienen, ‘get rich’, lit. ‘earn oneself a golden nose’), combinations of N+V and N+A collocations (e.g. (eine) klare Absage erteilen, ‘refuse resolutely’, lit. ‘give a clear refusal’), next to cases where N+V or N+A collocations are found, in combination with other (not necessarily collocational) context partners. To extract such data from text corpora, a grammar is needed that captures verb+object relations: simple pattern matching on part-of-speech shapes is not sufficient. Statistical tools then allow to order the data in a way useful for subsequent manual selection by lexicographers.
1
This work has been carried out in the context of the Transferbereich 32: Automatische Exzerption, a DFG-funded project aiming at the creation of support tools for the corpus-based updating of printed dictionaries in lexicography, carried out in cooperation with the publishers Langenscheidt KG and Duden BIFAB AG.
1
Introduction
Most work on corpus-based extraction and classification of multiword lexical items so far has concentrated on collocations, i.e. on word pairs with certain properties. If larger chunks have been analysed, these were mostly multiword prepositions or adverbs (e.g. by means of, cf. Bouma and Villada 2002), groups of verbs, prepositions and nouns (e.g. German zur Sprache kommen, cf. Krenn 2000, Evert and Krenn 2001, etc.) or multiword terms. We are interested, in this paper, in triples of open class words from general language, consisting of a verb, a noun (typically the object of the verb) and an adjective (which modifies the noun). We call these triples ‘ANV-triples’. Many ANV-triples are of lexicographic interest: some of them are as such idiomatic, others are closely related with collocations. Many of them need to be captured in dictionaries, and the tools described in this paper are meant to support lexicographers in identifying and classifying ANV-triples (however, we do not intend to provide an exhaustive fully automatic account of ANV-triples). First, we will discuss the phenomenon and define five different classes of data (cf. section 2). Then, we will introduce a method of acquiring ANV-triples from a German text corpus (section 3), by means of parsing with a lexicalized probabilistic grammar, subsequent data collection, and sorting of the extraction results by means of a statistical association measure (the log-likelihood ratio, section 4). In section 5, we describe the clustering experiments undertaken with the raw output data, and finally, we discuss the current state of our experiments (section 6) and needs for further work. 2
The data
Collocations are binary. The British contextualist tradition (cf. Firth 1957 etc.) understands collocations as binary word groups (e.g. proud + of, pay + attention, etc.), and so does the tradition of pedagogical lexicography (cf. Hausmann (1989) and now Hausmann (2003), Runcie et al. (2002)); the latter restricts collocations only to combinations of open class words and distinguishes bases and collocates (‘Basis’ and ‘Kollokator’, in Hausmann’s terms). We follow this line of thinking, assuming that collocations are habitual combinations where the collocate can not easily be substituted, but which are not necessarily all non-compositional. Some of the ANV-triples under analysis are combinations of two collocations with the same base (cf. Heid 1994,p.231: allgemeine Gültigkeit haben: (allgemein + Gültigkeit) + (Gültigkeit + haben), ‘be generally valid’ (lit. ‘general validity have’)). On a descriptive level, we distinguish five different types of ANV-triples. The five types are not exclusive. There is a gradient change from one to the other. i. A+N+V lexically fixed: idiomatic phrases like sich einen schönen Lenz machen, ‘take it easy’ (lit. ‘oneself a nice spring make’) which are typically non-compositional in meaning. ii. A+N lexically fixed; V compositional: idiomatic or collocative adjective+noun combinations that occur with and without a verb; the A+N combination may be terminologically fixed (as in absolute Mehrheit + erreichen (‘win an absolute majority’)), or it may be a general language expression in itself, as in schwarze Zahlen schreiben (‘be profitable’: ‘black numbers write’), where schwarze Zahlen is an idiomatic way to express the notion of profitability. iii. N+V lexically fixed; A compositional: combinations in which a random adjective occurs with a noun+verb collocation like einen neuen Haftbefehl erlassen ‘issue a new warrant’ in which Haftbefehl erlassen is a collocation which allows to be modified. iv. Combinations of collocations: N+V lexically fixed; A+N lexically fixed; same N: a combination of two collocations, i.e. of type (ii) and type (iii), as in ein biblisches Alter erreichen, ‘reach
a grand old age’ (lit. ‘a biblical age reach’) in which the adjective+noun collocation biblisches Alter ‘grand old age’ interacts with the semantically compositional noun verb collocation ein Alter (von n Jahren) erreichen, ‘reach an age (of n years)’. v. Trivial combination: non-collocative (i.e. completely compositional and non-habitual) combinations of adjective, noun, and verb as in neue Politik fordern ‘demand new politics’. We thereby ignore the fact that the cooccurrence of an adjectival modifier and its modifyee, the noun, is never completely at random but restricted by semantic properties of both elements. The same holds for the cooccurrence of verbs and their accusative objects. ‘Random’ in our sense means that there is no idiomatic interpretation or habitual use of the combination. 3
Corpus-based acquisition
Our goal is to identify significant triples of adjectives, nouns, and verbs, and to subsequently classify them into the five different classes described above, in section 2. The basis of this undertaking is the collection of frequency data from a corpus. For this task, we employ linguistic preprocessing by means of a fully-fledged probabilistic grammar that encodes predicate argument structures and provides full sentence parses (see Schulte im Walde et al. (2001) for a general overview on the grammar model and its use for the extraction of lexical information). The grammar allows us to identify grammatical structures independently from the linear order of the elements. This is relevant especially for a language like German that allows for a relatively free word order; for illustration, see e.g. example (1) below. Complex data such as the combination of a verb, its accusative object and an adjectival modifier of the latter, cannot be collected by means of shallow parsing methods or bag-of-words approaches satisfactorily. All the more, since the combination includes three parameters that are realized by open word classes. The probabilistic grammar is based on a manually established context-free grammar with feature constraint annotations such as the specification of the subcategorization frame at all levels of a verbal category. The rule probabilities were learned in unsupervised training on a newspaper corpus of approx. 25 million words by a probabilistic parser (Schmid 2000). The grammar is lexicalized in the course of training, which means that each rule is multiplied by all potential lexical heads. Lexical heads are the lemmas of the syntactic heads in terminal phrases which are then propagated to nonterminal structures. Lexicalization allows the grammar to learn lexical cooccurrences, i.e. head-head relations between mother nodes and their non-head daughter nodes, e.g. the relation between the verbal head of a clause and the nominal head of its accusative object. The trained grammar model allows to directly read off estimated frequencies of pairs of lexical heads (see Schulte im Walde (2003) for an overview of the lexical information that is encoded in the model itself). For triples of lexical heads, this is not manageable, at least not if the lexical heads belong to open class categories.2 . We therefore reparsed the corpus with the trained grammar by using the Viterbi option of the parser that determines the most probable parse for each sentence (see e.g. Manning and Schütze 1999, 396ff.). The Viterbi parses were then stored for subsequent extraction of the frequency data. The example (1), is taken from the newspaper Frankfurter Rundschau, 1992/93, illustrates the word order problems encountered in the data. It includes the idiomatic expression rote Zahlen 2
Information about closed class items may be integrated into the grammar categories, which gives indirect access to triple information, e.g. in the case of prepositional objects: the preposition lemma is then added to the category name, like PP.in for PPs headed by in, which leaves the head feature of the PP open for the embedded nominal. This allows the grammar to learn the head-head cooccurrence of a verb and the nominal head of its prepositional object, which is then directly extractable from the grammar model. But the grammar model does not provide more complex lexical dependencies.
Figure 1: Detail of Viterbi parse: Es wurde erwartet, dass die Flughafen Frankfurt AG (FAG) in diesem Jahr zum ersten Mal rote Zahlen bei ihrem Wirtschaftsergebnis schreiben wird. schreiben ‘be in the red’, lit. ‘red numbers write’. The accusative object is not adjacent to the verb but separated by the adjunct bei ihrem Wirtschaftsergebnis ‘at its economic result’. (1)
Es wurde erwartet, dass die Flughafen Frankfurt AG (FAG) in diesem Jahr zum ersten Mal rote Zahlen bei ihrem Wirtschaftsergebnis schreiben wird . red figures at its economic result write will ‘It is expected that the Flughafen Frankfurt AG (FAG) will be in the red in its economic result this year for the first time.’
Figure 1 illustrates a Viterbi parse. A search routine collects the heads of all accusative objects together with the heads of their selecting verbs. In addition, it stores information about prenominal adjectival modifiers of the noun: either, it collects the head of the adjective or a mark which indicates that the noun is not modified, in which case the adjectival head feature is assigned the value ‘NoADJ’. We extracted only prenominal modifiers and ignored all postnominal modification. For the extraction experiment, we used a corpus of 4,982,800 parsed newspaper sentences ranging from 5 to 30 words. We extracted 1,805,840 ANV-triples. 1,233,547 triples (about 68%) featured the feature ‘NoADJ’ instead of a adjectival modifer. After filtering potential parsing errors, i.e. triples with adjectives or verbs that were assigned the default lemma ‘unkown’, we ended up with 440,243 genuine triples, configurations in which an accusative object was modified by an adjectival modifier. About 70% of the modified occurrences are hapax legomena, i.e. triples that occurred only once in our corpus.
4
Calculation of Significance
The resulting quadruples, , are then evaluated by the log-likelihood ratio test (LL, Dunning 1993), a homogeneity test that compares the observed frequency of a pair of items with a frequency that is estimated under the assumption that the two items occurred independently of each other in the corpus, and that cooccurrence is a matter of chance. A high log-likelihood score means that the assumption of independence can be rejected with a high confidence and that it is probable that the pair is a significant combination. In particular, we compared the log-likelihood values of the three involved pairs: , , 3 , and furthermore the triple , which we simplified to the nested binary tuple < , V>. For the calculation of the log-likelihood ratio, we defined the probability space to consist of all observed triples whereby we only considered prenominal attributive adjectives and accusative objects. This means that we ignore occurrences of the analyzed binary relations in other grammatical relations, e.g. whether a pair (A,N) occurred in subject function as well. We determined the log-likelihood scores of a tuple, e.g. , in dependence of the given triple ignoring therebya all occurences of V. This means we excluded all triples that included V from the probability space. Sorting according to the log-likelihood ratio gives preference to significant combinations and suppresses random combinations of general highly frequent words. We implemented the ‘entropy version’ of log-likelihood, that makes reference to the partitions Pij , rows Ri , and columns Ci of a contingency table and compares observed frequencies O ij (read ‘observed frequencies O in partition Pij ’) with expected frequencies Eij (‘expected frequencies E in partition Pij ’, cf. Evert 2002). For illustration, we give the contingency table for the calculation of the log-likelihood score of pair , given triple , in Table 1, taking the observed prenominal adjectives (plus the feature ‘NoAdj’) as one parameter (‘Adj’) and the accusative object nominal (‘Noun’) as the other. The equation in (2) shows how the log-likelihood score is determined from the information contained in the contingency table. (2)
log-likelihood = 2
Pij Oij log O
ij
Eij
(Adj,Verb,Noun)
Noun
OtherNouns
Adj
O11 = |Adj,Noun,OtherVerbs| E11 = C1N;R1
O12 = |Adj,OtherNouns,OtherVerbs|
R1 = O11 + O12
OtherAdjs
O21 = |OtherAdjs,Noun,OtherVerbs|
O22 = |OtherAdjs,OtherNouns,OtherVerbs|
R2 = O21 + O22
C1 = O11 + O21
C2 = O12 + O22
N = R1 +R2 = C1 +C2
Table 1: Contingency table for LL(AN), given (Adj,Noun,Verb) For each triple that was observed in the corpus, we collected four different log-likelihood scores. ‘LL’ abbreviates ‘log-likelihood score’. ‘A’,’N’, and ‘V’ are short forms of the involved constituents. (3)
3
Given a triple (adjective, noun, verb) we calculate i. LL(ANV) whereby the normalizing factor is the set of all observed triples (including triples with the adjective feature ‘NoADJ’) We were pointed to the relevance of the pair by Franz Josef Hausmann, p.c.
ii.
LL(AN) whereby the normalizing factor is the set of all observed triples to the exclusion of those triples that include the given verb iii. LL(NV) whereby the normalizing factor is the set of all observed triples to the exclusion of those triples that include the given adjective iv. LL(AV) whereby the normalizing factor is the set of all observed triples to the exclusion of those triples that include the noun at stake
The top 20 ANV-triples, sorted by LL(ANV), are displayed in table 4. 5
Preclassifying candidate sets
Sorting the resulting lists according to the different log-likelihood scores does only a partial job to discriminate the different classes. This is due to the fact that the log-likelihood scores of the different referent sets cannot be compared directly. Furthermore, due to the ‘binary treatment’ of the ternary word groups, the log-likelihood scores of the triples do not differentiate properly whether the involved pair (e.g. adjective, noun) is a significant pair as such or whether the two words are independent ‘outside’ the respective triple constellation. Ideally, we would expect the log-likelihood values of ANV, AN, and NV and their proportions to be correlated with the five classes we postulated in section 2, above. Table 2 summarizes these expected proportions 4. triple type i: ANV collocation type ii: AN collocation type iii: NV collocation type iv: combination ii+iii type v: trivial ANV
LL(ANV) high low low high low
LL(AN) low high low high low
LL(NV) low low high high low
Table 2: Expected proportions of log-likelihood ratios (LL)
To approximate the intended classification, we employed an additional preprocessing means and trained a standard decision tree (C4.5, cf. Quinlan 1986) on a set of manually classified triples. Different versions of the decision tree helped to identify specific classes. We aimed at separating out, at least to a certain extent, idiomatic ANV-triples (type i), trivial combinations (type v) and the more strictly collocational cases (types ii, iii, and iv). We obtained the best results by defining the decision tree attributes as relations between the different kinds log-likelihood values in combination with thresholds on the loglikelihood scores; furthermore we allowed the system to decide on additional thresholds on the freqency data. We implemented different versions of the decision tree based on (subsets of) the set of attributes listed in table 3. The decision trees were trained on 89 manually classified examples and tested on 25 test examples, whereby the overall set of 114 examples was almost equally distributed over the five classes (25 from class i, 24 from class ii, 19 from class iii, 25 from class iv, and finally 21 from class v). We did not find a decision tree which was able to discriminate all classes. Therefore, we decided to apply different runs of different decision trees to presort the data. Figure 2 gives the decision tree that performed best on class i items. All five class i examples were correctly classified as class i. There was only one false positive, a class iii item falsly identified as class i. All in all, it produced 13 errors (52.0%) on the 25 test examples: the other classes were not disciminated as well as class i. The tree is given in standard C4.5 notation. The number to the right of a colon denotes the classification, the first number in round brackets to the right names the number of times the path was followed in the 4
We disregard the log-likelihood score of pairs of adjective+verb, here, since this combination is not collocational in the syntactic contexts we analyze.
attribute condition A ll(anv) > ll(an) B ll(anv) > ll(nv) C ll(anv) > ll(av) D ll(an) 50 and ll(av) < 10 J ll(an) > ll(nv) attributes ll(anv), ll(an), ll(nv), ll(av), f(anv), f(a), f(n), f(v), f(an), and f(nv)
value y y y y y y y y y y
else n n n n n n n n n n
with continuous values
Table 3: Attributes of decision tree learning
training. The optional number to the right of this determines the number of errors made at this point in the decision tree during training. The decision tree classifies 58 triples as elements of class i. 6
Results
In section 2, we described five different types of ANV-triples. In our experiment on automatically extracting those triples from a newspaper corpus, we used a stochastic grammar, sorting by means of the log-likelihood ratio and clustering by means of a decision tree trained on a set of manually classified data. In table 4, the 20 best ANV-triples are given, sorted by the log-likelihood scores LL(ANV) of the triples. The log-likelihood scores are less clearly interpretable than idealized in table 2, above: this is evident from the log-likelihood figures provided alongside the manually classified candidates given in table 5 (same table layout), which contains a certain number of hapax legomena and low frequency items. A groß wichtig schwer technisch leicht rot schwarz entscheidend offen entsprechend grün klar heftig ordnend groß eigen schwer neu positiv
N Rolle Rolle Verletzung Entwicklung Verletzung Zahl Zahl Rolle Tür Beschluß Licht Absage Kritik Rolle Wert Weg Vorwurf Weg Bilanz
V spielen spielen erleiden aufzeigen erleiden schreiben schreiben spielen einrennen fassen geben erteilen üben spielen legen gehen erheben gehen ziehen
LL(ANV) 4898.20 4152.43 3358.58 2883.93 2747.62 2070.60 2067.46 1827.06 1726.96 1709.36 1622.59 1522.36 1453.40 1303.58 1292.30 1130.17 1121.08 1115.76 1069.24
f(ANV) 486.00 431.00 314.00 187.00 241.00 192.00 185.00 190.00 94.00 134.00 375.00 120.00 126.00 130.00 113.00 112.00 98.00 129.00 113.00
LL(AN) 38.80 505.91 1112.44 83.86 528.69 555.51 305.23 211.57 95.18 427.11 1398.28 123.76 852.48 43.88 51.14 200.95 320.92 845.40 722.88
LL(NV) 32585.80 26387.16 2375.46 50.98 2897.80 1172.95 778.12 25779.82 200.67 2986.22 0.15 2854.22 4334.97 9843.82 3924.96 4516.22 982.47 4495.05 2187.01
Table 4: Results sorted by log-likelihood ratio LL(ANV)
LL(AN) 6.05 0.18 1591.45 3.90 677.77 2.56 3.72 0.15 6.58 22.29 43.87 49.06 7.87 0.00 4.84 1.32 34.92 1.02 246.32
ll(anv) 5854 : 3 (7.0/1.0) | f(a) 6.34 : 2 (2.0) | | | | | ll(nv) 265.1 : | f(nv) 390 : | | f(v) 770 : 4 (4.0) | | | f(nv) 4925 : | | | ll(anv) 442.72 : 4 (3.0)}
Figure 2: Decision Tree that performed best for class i
Table 4 contains several combinations with the noun Verletzung ‘injury’, which have high frequency figures: this may be an artefact of our newspaper corpus. The result data also include examples which are not fully captured by the five descriptive types of Section 2. These are cases in which an adjective is required but not restriced to a specific lexical item like Lebensjahr vollenden ‘complete the nth year of life’. Potentially non-collocative triples (type v) are likely to contain general, non-collocative adjectives like the deictic entsprechend ‘corresponding’ or the listing item weiter ‘further’. We have heuristically extracted some of these, setting thresholds such that the LL(ANV) < 20, LL(AN) < 30 , LL(AV), and F(A)> 1000. This gives a list of 70 adjectives many of which satify this criterion, cf. (4) for a sample. Such adjectives and the pertaining ANV-triples would be removed from the material to be given to a lexicographer for manual subclassification. (4)
ander, besonder, bestimmt, deutlich, eigene, einzig, entscheidend, entsprechend, erheblich, folgend, ganz, gesamt, gleich, gut, sogenannt, . . .
Table 6 contains the top ten of the ANV-triples classified as belonging to type i (left column) and the top ten from type v (right column). In the left column, we have marked with an asterisk ( ) those
type type i
type ii
type iii
type iv
type v
A offen golden kalt letzt rot absolut gut offen einstweilig diplomatisch archimedisch Rot entsprechend neu bestehend gemeinsam scharf klar dringend einstimmig konkret neu alt gesamt
N Tür Nase Schulter Wort Zahl Mehrheit Ruf Drogenszene Verfügung Beziehung Punkt Liste Beschluß Arbeitsplatz Verlustvortrag Vorstellung Kritik Absage Appell Urteil Zahl Gast Eiche Film
V einrennen verdienen zeigen haben schreiben verlieren genießen entgegenwirken erwirken aufnehmen lokalisieren ansehen fassen schaffen tilgen entwickeln üben erteilen richten fällen nennen begrüßen entwurzeln durchziehen
LL(ANV) 1726.96 679.78 415.87 404.46 2070.60 315.65 258.78 13.01 606.67 551.21 19.41 12.04 1709.36 337.58 19.62 8.75 2432.86 1522.36 200.48 12.64 423.27 19.41 18.28 12.53
Freq(ANV) 94 55 41 124 192 62 29 1 44 66 1 1 134 48 1 1 199 120 14 1 51 2 1 1
LL(AN) 95.18 68.39 0.00 404.12 555.51 4667.07 1045.71 260.03 911.66 1142.49 21.11 27.15 427.11 261.70 0.00 2.87 704.72 123.76 7.43 1.91 166.85 0.79 40.11 21.72
LL(NV) 200.67 0.91 33.31 5.98 1172.95 205.12 384.04 0.00 3.38 115.06 0.00 0.00 2986.22 1216.29 34.41 106.03 3208.22 2854.22 758.98 640.37 1509.10 1235.58 0.00 32.27
LL(AV) 6.58 59.76 11.29 10.07 2.56 0.00 1.20 0.00 73.55 4.81 0.00 0.00 22.29 762.40 0.00 169.02 7.30 49.06 41.70 9.84 143.72 307.36 0.00 57.95
Table 5: Manually classified examples items which we would manually classify as belonging to type i; the other items belong to type iv. offen Tür einrennen * deutlich Sprache sprechen * frei Lauf lassen * klein Brötchen backen * golden Nase verdienen * reißend Absatz finden groß Aufsehen erregen schwer Geschütz auffahren * groß Anklang finden gut Haar lassen *
konkret Zahl nennen ander Problem haben deutlich Zeichen setzen weit Auskunft geben genau Angabe machen fatal Folge haben groß Sorge machen groß Schwierigkeit haben weit Information geben aufschiebend Wirkung haben
Table 6: ANV-triples classified by the decision tree: top ten from type i (left) and from type v (right)
7
Discussion and Outlook
Our objective is to provide raw data on significant ANV-triples for lexicographers; in addition, these data are to be sorted in a way that should allow the lexicographers to manually evaluate the data with little effort. We argue that preprocessing based on both, linguistic knowledge and statistical information, is superior to shallow methods or simple part-of-speech pattern matching. The probabilistic grammar allows us to also identify non-adjacent configurations and to thus produce raw candidate data which are homogeneously of the same syntactic type.
Statistical sorting by means of the log-likelihood ratio test helps to identify significant triples and to even out the impact of general high frequency items. To improve the results of this test and to make the figures more easily comparable, the current pair-based calculation of association measures would need to be extended to word triples. Nevertheless, the proportions between the log-likelihood ratios of the ANV-triples and of the (possibly related) NA and NV collocations seem to constitute a starting point for further subclassifying the data into idiomatic vs. collocational vs. trivial. 8
References
Bouma, Gosse, and Begoña Villada. 2002. Corpus-based acquisition of collocational prepositional phrases. In Computational Linguistics in the Netherlands (CLIN) 2001, Twente University. Dunning, Ted. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19:1:61–74. Moira Runcie et al. 2002. OCDSE – Oxford Collocations Dictionary for Students of English. Oxford: Oxford University Press. Evert, Stefan. 2002. Mathematical Properties of AMs. Handout, Workshop Computational Approaches to Collocations, Vienna. Evert, Stefan, and Brigitte Krenn. 2001. Methods for the Qualitative Evaluation of Lexical Association Measures. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France. Firth, John Rupert. 1957. Studies in linguistic analysis, chapter A synopsis of linguistic theory 1930– 55, pp. 1–32. Oxford. Hausmann, Franz Josef 2003. Was sind eigentlich Kollokationen? Talk at IDS Jahrestagung, to appear. Heid, Ulrich. 1994. On Ways Words Work Together – Topics in Lexical Combinatorics. In Willy Martin et al. (eds.), Proceedings of the VIth Euralex International Congress, pp. 226 – 257, Amsterdam. Krenn, Brigitte. 2000. The Usual Suspects: Data-Oriented Models for the Identification and Representation of Lexical Collocations. PhD thesis, DFKI and Universität des Saarlandes, Saarbrücken. Manning, Christopher D., and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing, 1st edition. Cambridge (MA): MIT Press. Quinlan, John Ross. 1986. Induction of Decision Trees. Machine Learning 1:81–106. Schmid, Helmut. 2000. Lopar: Design and Implementation. Arbeitspapiere des Sonderforschungsbereichs 340 Linguistic Theory and the Foundations of Computational Linguistics 149, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart. Schulte im Walde, Sabine. 2003. A collocation database for german verbs and nouns. Budapest. COMPLEX 2003. Schulte im Walde, Sabine, Helmut Schmid, Mats Rooth, Stefan Riezler, and Detlef Prescher. 2001. Statistical Grammar Models and Lexicon Acquisition. In Christian Rohrer, Antje Rossdeutscher, and Hans Kamp (eds.), Linguistic Form and its Computation, pp. 387–440. Stanford, CA: CSLI Publications.